DNS Not Resolving Suddenly

NethServer release 7.5.1804 (final)

I came in from the weekend and my Domain Accounts and Users and Groups are not available. Other funny symptoms are happening that are associated with Windows 7 computers not being able to communicate with domain (drive mapping disappearing, authentication issues).

In the web console, the Dashboard and Users and Groups section says “Account provider generic error: SSSD exit code 1”

The Domain Accounts section has errors including:

ads_connect: No logon servers are currently available to service the logon request.
Didn’t find the ldap server!

Join is OK

ads_connect: No logon servers are currently available to service the logon request.

Updating packages through the software center or yum on command line produces an error that says cannot reach the source or whatever the exact words are.

Dig seems to use the expected dns server (10.1.10.1):

> dig google.com
> 
> ; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7_5.1 <<>> google.com
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12295
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
> 
> ;; OPT PSEUDOSECTION:
> ; EDNS: version: 0, flags:; udp: 4096
> ;; QUESTION SECTION:
> ;google.com.                    IN      A
> 
> ;; ANSWER SECTION:
> google.com.             300     IN      A       216.58.192.238
> 
> ;; Query time: 23 msec
> ;; SERVER: 10.1.10.1#53(10.1.10.1)
> ;; WHEN: Wed Dec 05 12:15:43 CST 2018
> ;; MSG SIZE  rcvd: 55

But ping says network not reachable.

> ping google.com
> connect: Network is unreachable

Originally, I wasn’t even getting this so I added a line in /etc/resolv.conf to add correct nameserver entries. I did this by editing /etc/e-smith/templates/etc/resolv.conf/40dnsRoleResolver to have the appropriate entries and then ran expand-template /etc/resolv.conf, which wrote the correct version of resolv.conf. Dig picked up the change but after a reboot, I still do not seem to have correct DNS resolution. My nethserver (10.1.10.27) still cannot see my logon server running virtually (I assume) on the same machine at 10.1.10.75.

‘Feels’ like 10.1.10.75 has an issue…

That might be a problem since however the samba4 authentication is done has always had a lot of bugs in it for me.

However, ping returning “Network is unreachable” makes me think this is DNS and the 10.1.10.75 problem is a symptom of DNS.

I’ve always changed nameserver in resolv.conf to fix dns in rh 5-7 but I’m not sure if NS uses same network scripts and stack or something proprietary. (NS uses init.d network script)

Could firewall be blocking something here? I haven’t added any new blocking rules. I allowed 5666 tcp/udp for NRPE.

Here are some interesting results:
From a completely different computer on the network:

nmap 10.1.10.75
Starting Nmap 7.70 ( https://nmap.org ) at 2018-12-05 19:11 Ame
Nmap scan report for 10.1.10.75
Host is up (0.0062s latency).
Not shown: 987 closed ports
PORT      STATE SERVICE
53/tcp    open  domain
88/tcp    open  kerberos-sec
135/tcp   open  msrpc
139/tcp   open  netbios-ssn
389/tcp   open  ldap
445/tcp   open  microsoft-ds
464/tcp   open  kpasswd5
636/tcp   open  ldapssl
3268/tcp  open  globalcatLDAP
3269/tcp  open  globalcatLDAPssl
49152/tcp open  unknown
49153/tcp open  unknown
49154/tcp open  unknown
MAC Address: 9A:82:66:8B:97:91 (Unknown)

Nmap done: 1 IP address (1 host up) scanned in 1.89 seconds

From my nethserver (10.1.10.1):

 nmap 10.1.10.75

Starting Nmap 6.40 ( http://nmap.org ) at 2018-12-05 13:10 CST
nmap: nsock_pool.c:227: nsp_delete: Assertion `nse->iod->events_pending >= 0' failed.
Aborted

Well it appears I have no gateway out. I have no idea how that happened since I have not touched the network stack since the install.

 route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.1.10.0       0.0.0.0         255.255.255.0   U     0      0        0 br0
169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0 br0

And there it is:

>  cat /etc/sysconfig/network-scripts/ifcfg-br0
> DEVICE=br0
> BOOTPROTO=none
> GATEWAY=
> IPADDR=10.1.10.27
> NETMASK=255.255.255.0
> NM_CONTROLLED=no
> ONBOOT=yes
> TYPE=Bridge
> USERCTL=no

I have no idea how this could have happened. Looks like it was last written on November 21.

SOLVED(ish)
Anyways here is what got my up and running. This fix is for a missing gateway. Identify a missing gateway by running route -n. If the address in the gateway column is all 0.0.0.0 then you have a missing gateway and your network does not know how to reach the world outside of your subnet, in my case 10.1.10.0/24.

> echo 'GATEWAY=10.1.10.1' > /etc/e-smith/templates/etc/sysconfig/network/40gateway
> expand-template /etc/sysconfig/network
> grep GATEWAY /etc/sysconfig/network
> GATEWAY=10.1.10.1
> /etc/init.d/network restart

Test to make sure gateway exists:

>  route -n
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
> 10.1.10.0       0.0.0.0         255.255.255.0   U     0      0        0 br0
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0 br0

WTF. Check config files and find another conflicting gateway. Remove that entry and restart network stack.

> grep -R GATEWAY /etc/sysconfig/network-scripts/ifcfg-*;echo $?
> /etc/sysconfig/network-scripts/ifcfg-br0:GATEWAY=
> 0
> 
> sed -i /GATEWAY/d /etc/sysconfig/network-scripts/ifcfg-br0
> 
> grep -R GATEWAY /etc/sysconfig/network-scripts/ifcfg-br0;echo $?
> 1
> 
> /etc/init.d/network restart;route -n
> 
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
> 0.0.0.0         10.1.10.1       0.0.0.0         UG    0      0        0 br0
> 10.1.10.0       0.0.0.0         255.255.255.0   U     0      0        0 br0
> 169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0 br0

Verify

>  ping google.com
> PING google.com (216.58.192.238) 56(84) bytes of data.

Where do I find the config file for br0? I need to make sure that empty GATEWAY= variable never returns. I have no idea how this would have changed but this was the problem and this was a fix for it. Hopefully I included enough information to log this and to help the next poor bastard out.

Also my 10.1.10.75 domain controller is still not showing up and I still have the error:

Account provider generic error: SSSD exit code 1

Restarting a few services did not fix this problem. The best case scenario is that I just didn’t restart them in the correct order and a reboot will fix it. I wish I could fix this without a reboot so I know what happened and how to fix it in the future.

Worst case scenario is that the missing GATEWAY was a symptom of a larger problem and this is another symptom and a reboot won’t fix it.

Also, I will note that 10.1.10.75 seems to be working. I have my Active Directory Users and Computers tool open on my windows computer open and connected to the Domain Controller and I am able to adjust users and computers. So I’m optimistic that it is some sort of lingering DNS funkiness and that a reboot will start all services in the correct order. I cannot test this for a few hours since the samba shares are still being used in production.

This issue needs some of the devs to take a look at…

br0, networking etc…

Just to follow up, the nethserver host (10.1.10.27) still does not resolve the samba address (10.1.10.75) as anything even though it is in DNS and all other computers in my environment have that entry populated by DNS. After adding the FQDN to the /etc/host file, my nethserver now can see the domain and I no longer have SSSD error.

I don’t need any assistance and am mostly documenting this for people with a similar problem from the internet.

I would think that after making NS a DC with domain accounts, it would be beneficial to automatically add this entry to the e-smith templates. Even if DNS is gone the NS host itself will be able to find the samba it is hosting for authentication.

Hi Jeremy, Please note changes to templates are overwritten if packages containing this template is updated. If you want to make a custom template you must do so in
/etc/e-smith/templates-custom/etc/what/ever/template/...

However to set a gateway for the bridge br0 I suggest to do so in the server-manager:

(Configuration) Network > (br0) Edit > and set the gateway.

afbeelding

Here is the overview of the network setup of one of my (simple) test instances with AD installed:

# db networks show
br0=bridge
    gateway=10.0.0.100
    ipaddr=10.0.0.2
    netmask=255.255.255.0
    role=green
eth0=ethernet
    FwInBandwidth=
    FwOutBandwidth=
    bridge=br0
    role=bridged
ppp0=xdsl-disabled
    AuthType=auto
    FwInBandwidth=
    FwOutBandwidth=
    Password=
    name=PPPoE
    provider=xDSL provider
    role=red
    user=
wlan0=ethernet
    role=

Resulting in this Kernel IP routing table:

# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.0.100      0.0.0.0         UG    0      0        0 br0
10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 br0

EDIT:
Afterwards noticed /etc/e-smith/templates/etc/sysconfig/network/40gateway is non existing on a default install hence it wont be overwritten. Still recommend to create custom templates in /etc/e-smith/templates-custom to keep oversight of the customizations :grinning:

I’m not sure what is going on here but this is breaking every beginning of the week and the fix is never the same. Now I am back to SSSD exit code 1 and no one can authenticate against the nethserver domain controller.

The gateway disappeared on me again so I changed it via the web interface. I still don’t understand why this wasn’t set or why it disappeared.