Service sssd has started to crash frequently

mahaq · December 7, 2024, 10:55pm

NethServer Version: 7.9.2009
Module: sssd

My sssd service has started to crash frequently which is causing that all mail is delivered to the catch all box and users can’t access their mailboxes.
This is the log from recently before the crash and the startup almost twelve hours later.

Dec 7 12:00:40 sssd: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.
Dec 7 12:02:05 sssd: ; TSIG error with server: tsig verify failure
Dec 7 12:02:08 sssd: ; TSIG error with server: tsig verify failure
Dec 7 12:02:09 sssd: ; TSIG error with server: tsig verify failure
Dec 7 12:02:09 sssd: ; TSIG error with server: tsig verify failure
Dec 7 12:02:09 sssd: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.
Dec 7 12:05:08 sssd[sssd]: Child [3230] (‘’:‘%BE_’) was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Dec 7 12:05:08 sssd[be[]]: Starting up
Dec 7 12:05:39 sssd[sssd]: Child [5316] (‘’:‘%BE_’) was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Dec 7 12:05:39 sssd[be[]]: Starting up
Dec 7 12:06:07 sssd[sssd]: Child [8638] (‘nss’:‘nss’) was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Dec 7 12:06:07 sssd[nss]: Starting up
Dec 7 12:06:09 sssd[nss]: Starting up
Dec 7 12:06:10 sssd[sssd]: Child [5350] (‘’:‘%BE_’) was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Dec 7 12:06:10 sssd[be[]]: Starting up
Dec 7 12:06:13 sssd[nss]: Starting up
Dec 7 12:06:13 sssd[sssd]: Exiting the SSSD. Could not restart critical service [nss].
Dec 7 12:07:05 sssd[pam]: Shutting down
Dec 7 12:07:05 systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
Dec 7 12:07:05 systemd: Unit sssd.service entered failed state.
Dec 7 12:07:05 systemd: sssd.service failed.
Dec 7 22:38:03 sssd[sssd]: Starting up
Dec 7 22:38:03 sssd[be[]]: Starting up
Dec 7 22:38:03 sssd[pam]: Starting up
Dec 7 22:38:03 sssd[nss]: Starting up
Dec 7 22:38:05 sssd: ; TSIG error with server: tsig verify failure
Dec 7 22:38:05 sssd: ; TSIG error with server: tsig verify failure
Dec 7 22:38:06 sssd: ; TSIG error with server: tsig verify failure
Dec 7 22:38:06 sssd: ; TSIG error with server: tsig verify failure
Dec 7 22:38:06 sssd: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.

mrmarkuz · December 7, 2024, 11:20pm

Could it be that there was high load when sssd gave up?

Does it work to restart the service?

systemctl restart sssd

Maybe interesting:

mahaq · December 9, 2024, 3:01pm

The server has been running since 2019 with very low load and few issues, it has only five users. The hardware has almost no load that is one reason why I plan to migrate it to a VM for NS8 so I can free resources for other projects.

It’s never a problem to start it from the service panel after it has failed and I did a restart now that gave this log output.

Dec 9 16:03:16 sssd[nss]: Shutting down
Dec 9 16:03:16 sssd[be[]]: Shutting down
Dec 9 16:03:16 sssd[pam]: Shutting down
Dec 9 16:03:16 systemd: Stopped System Security Services Daemon.
Dec 9 16:03:16 systemd: Starting System Security Services Daemon…
Dec 9 16:03:17 sssd[sssd]: Starting up
Dec 9 16:03:17 sssd[be[]]: Starting up
Dec 9 16:03:18 sssd[pam]: Starting up
Dec 9 16:03:18 sssd[nss]: Starting up
Dec 9 16:03:18 systemd: Started System Security Services Daemon.
Dec 9 16:03:19 sssd: ; TSIG error with server: tsig verify failure
Dec 9 16:03:19 sssd: ; TSIG error with server: tsig verify failure
Dec 9 16:03:20 sssd: ; TSIG error with server: tsig verify failure
Dec 9 16:03:21 sssd: ; TSIG error with server: tsig verify failure
Dec 9 16:03:21 sssd: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.

A few more log entries from the last failure
sssd.log

(2024-12-07 12:06:13): [sssd] [monitor_restart_service] (0x0010): Process [nss], definitely stopped

sssd_nss.log>

(2024-12-07 12:05:22): [nss] [nss_dp_reconnect_init] (0x0010): Could not reconnect to provider.
(2024-12-07 12:06:07): [nss] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(2024-12-07 12:06:07): [nss] [sss_process_init] (0x0010): fatal error setting up backend connector
(2024-12-07 12:06:07): [nss] [nss_process_init] (0x0010): sss_process_init() failed
(2024-12-07 12:06:09): [nss] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(2024-12-07 12:06:09): [nss] [sss_process_init] (0x0010): fatal error setting up backend connector
(2024-12-07 12:06:09): [nss] [nss_process_init] (0x0010): sss_process_init() failed
(2024-12-07 12:06:13): [nss] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(2024-12-07 12:06:13): [nss] [sss_process_init] (0x0010): fatal error setting up backend connector
(2024-12-07 12:06:13): [nss] [nss_process_init] (0x0010): sss_process_init() failed

sssd_pam.log

(2024-12-07 12:01:56): [pam] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(2024-12-07 12:02:02): [pam] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(2024-12-07 12:07:05): [pam] [orderly_shutdown] (0x0010): SIGTERM: killing children

sssd_.log

(2024-12-01 10:18:57): [be[]] [id_callback] (0x0010): The Monitor returned an error [org.freedesktop.DBus.Error.NoReply]

The domain controller is on the same machine and part of NS7.
I have seen a similar error from other joined servers that fails because there are multiple hostnames for the same IP so when it tries to do a reverse lookup it sometimes pick the same for a service, e.g. jellyfin, instead for the servers hostname. I have two hostnames configured for the IP that is used Samba/AD, ad and ldap but that was something I added years ago and this is something that has started to happening during the last months.

mrmarkuz · December 9, 2024, 7:04pm

What about the load average?

[root@server2 ~]# uptime
 19:56:07 up 1 day,  6:42,  1 user,  load average: 0.11, 0.27, 0.29

It seems a long time issue:

github.com/SSSD/sssd

sssd don't restart properly after being killed by watchdog

opened 01:37PM - 14 Jun 22 UTC

paulz1

Future work Needs triage

Recently we have a strange behavior, I hope somebody could find the solution. Th…ank you for your help in advance. We using sssd plugged to LDAP on our ssh server. Server is installed on > Distributor ID: Debian > Description: Debian GNU/Linux 11 (bullseye) > Release: 11 > Codename: bullseye Here are the version of sssd packages : ``` ii sssd 2.4.1-2 ii sssd-ad 2.4.1-2 ii sssd-ad-common 2.4.1-2 ii sssd-common 2.4.1-2 ii sssd-dbus 2.4.1-2 ii sssd-ipa 2.4.1-2 ii sssd-krb5 2.4.1-2 ii sssd-krb5-common 2.4.1-2 ii sssd-ldap 2.4.1-2 ii sssd-proxy 2.4.1-2 ii sssd-tools 2.4.1-2 ``` Here is **sssd.conf** : ``` [sssd] #services = nss, pam config_file_version = 2 domains = domain timeout = 15 [nss] # Ensure that certain users are not authenticated from network accounts filter_users = root,lightdm,nslcd,dnsmasq,dbus,avahi,avahi-autoipd,backup,beagleindex,bin,daemon,games,gdm,gnats,haldaemon,hplip,irc,ivman,klog,libuuid,list,lp,mail,man,messagebus,mysql,news,ntp,openldap,polkituser,proxy,pulse,puppet,saned,sshd,sync,sys,syslog,uucp,vde2-net,www-data filter_groups = root [domain/domain] debug_level = 5 id_provider = ldap access_provider = ldap auth_provider = ldap autofs_provider = ldap ldap_schema = rfc2307 ldap_uri = ldaps://ldap.domain/ , ldaps://ldap-master.domain/ ldap_search_base = o=domain,dc=xxx,dc=xx ldap_id_use_start_tls = True cache_credentials = False ldap_group_member = memberUid ldap_access_filter = (objectClass=posixAccount) ldap_tls_cacertdir = /etc/ssl/certs #ldap_tls_cacert = /etc/ssl/certs/mytlsca.pem # Replace with the correct file name ldap_tls_reqcert = allow # Enables listing users and groups with getent enumerate = True ``` Globally it works perfectly, but sometimes sssd process is killed by watchdog and then it can't start up again. 1) The reason for sssd to be killed by watchdog is probably explained by server's load. This load by itself is also strange thing, but probably not linked with sssd. **sssd.log** (with debug_level=5) ``` (2022-06-14 0:31:43): [sssd] [svc_child_info] (0x0020): Child [24079] ('domain':'%BE_domain') was terminated by own WATCHDOG (2022-06-14 0:32:25): [sssd] [svc_child_info] (0x0020): Child [88706] ('domain':'%BE_domain') was terminated by own WATCHDOG ``` 2) Then the sssd service tried to restart, but failed partially I see these lines in sssd_pam and sssd_nss logs : **sssd_pam.log** ``` (2022-06-14 0:31:31): [pam] [server_setup] (0x0040): Starting with debug level = 0x0070 (2022-06-14 0:31:43): [pam] [sbus_dbus_connect_address] (0x0020): Unable to register to unix:path=/var/lib/sss/pipes/private/sbus-dp_domain [org.freedesktop.DBus.Error.NoReply]: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. (2022-06-14 0:31:43): [pam] [sss_dp_init] (0x0010): Failed to connect to backend server. (2022-06-14 0:31:43): [pam] [sss_process_init] (0x0010): fatal error setting up backend connector (2022-06-14 0:31:43): [pam] [pam_process_init] (0x0010): sss_process_init() failed (2022-06-14 0:31:43): [pam] [server_setup] (0x0040): Starting with debug level = 0x0070 (2022-06-14 0:31:43): [pam] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_domain [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_domain: Connection refused (2022-06-14 0:31:43): [pam] [sss_dp_init] (0x0010): Failed to connect to backend server. (2022-06-14 0:31:43): [pam] [sss_process_init] (0x0010): fatal error setting up backend connector (2022-06-14 0:31:43): [pam] [pam_process_init] (0x0010): sss_process_init() failed ``` **sssd_nss.log** ``` (2022-06-14 0:31:44): [nss] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_domain [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_domain: Connection refused (2022-06-14 0:31:44): [nss] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus (2022-06-14 0:31:47): [nss] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_domain [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_domain: Connection refused (2022-06-14 0:31:47): [nss] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus (2022-06-14 0:32:12): [nss] [server_setup] (0x0040): Starting with debug level = 0x0070 (2022-06-14 0:32:25): [nss] [sbus_dbus_connect_address] (0x0020): Unable to register to unix:path=/var/lib/sss/pipes/private/sbus-dp_domain [org.freedesktop.DBus.Error.NoReply]: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. (2022-06-14 0:32:25): [nss] [sss_dp_init] (0x0010): Failed to connect to backend server. (2022-06-14 0:32:25): [nss] [sss_process_init] (0x0010): fatal error setting up backend connector (2022-06-14 0:32:25): [nss] [nss_process_init] (0x0010): sss_process_init() failed ``` I said "partially" as sssd. service has active status by itself, but it doesn't work properly and no authentication is possible. **sssd_pam.log** ``` (2022-06-14 8:31:06): [pam] [cache_req_common_process_dp_reply] (0x0040): CR #1: Could not get account info [1432158311]: Unknown service (2022-06-14 8:31:56): [pam] [cache_req_common_process_dp_reply] (0x0040): CR #3: Could not get account info [1432158311]: Unknown service ``` 3) The reason why sssd doesn't work is probably caused by services sssd-pam and sssd-nss that failed to restart. > journalctl -u sssd-nss.service ``` Jun 14 00:32:11 ssh systemd[1]: sssd-nss.service: Main process exited, code=exited, status=70/SOFTWARE Jun 14 00:32:11 ssh systemd[1]: sssd-nss.service: Failed with result 'exit-code'. Jun 14 00:32:11 ssh systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 1. Jun 14 00:32:11 ssh systemd[1]: Stopped SSSD NSS Service responder. Jun 14 00:32:11 ssh systemd[1]: Started SSSD NSS Service responder. Jun 14 00:32:12 ssh sssd_nss[88732]: Starting up Jun 14 00:32:25 ssh systemd[1]: sssd-nss.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Jun 14 00:32:25 ssh systemd[1]: sssd-nss.service: Failed with result 'exit-code'. Jun 14 00:32:26 ssh systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 2. Jun 14 00:32:26 ssh systemd[1]: Stopped SSSD NSS Service responder. Jun 14 00:32:26 ssh systemd[1]: Started SSSD NSS Service responder. Jun 14 00:32:26 ssh sssd_nss[88739]: Starting up Jun 14 00:35:07 ssh systemd[1]: sssd-nss.service: Main process exited, code=exited, status=70/SOFTWARE Jun 14 00:35:07 ssh systemd[1]: sssd-nss.service: Failed with result 'exit-code'. Jun 14 00:35:07 ssh systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 3. Jun 14 00:35:07 ssh systemd[1]: Stopped SSSD NSS Service responder. Jun 14 00:35:07 ssh systemd[1]: Started SSSD NSS Service responder. Jun 14 00:35:07 ssh sssd_nss[88795]: Starting up [skip] Jun 14 00:36:20 ssh systemd[1]: Started SSSD NSS Service responder. Jun 14 00:36:21 ssh systemd[1]: sssd-nss.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Jun 14 00:36:21 ssh systemd[1]: sssd-nss.service: Failed with result 'exit-code'. Jun 14 00:36:21 ssh systemd[1]: sssd-nss.service: Scheduled restart job, restart counter is at 12. Jun 14 00:36:21 ssh systemd[1]: Stopped SSSD NSS Service responder. Jun 14 00:36:21 ssh systemd[1]: sssd-nss.service: Start request repeated too quickly. Jun 14 00:36:21 ssh systemd[1]: sssd-nss.service: Failed with result 'exit-code'. Jun 14 00:36:21 ssh systemd[1]: Failed to start SSSD NSS Service responder. ``` > journalctl -u sssd-nss.socket ``` Jun 14 00:36:21 ssh systemd[1]: sssd-nss.socket: Failed with result 'service-start-limit-hit'. ``` > journalctl -u sssd-pam.service ``` Jun 14 00:30:50 ssh sssd_pam[88640]: Starting up Jun 14 00:31:31 ssh systemd[1]: sssd-pam.service: Main process exited, code=exited, status=70/SOFTWARE Jun 14 00:31:31 ssh systemd[1]: sssd-pam.service: Failed with result 'exit-code'. Jun 14 00:31:31 ssh systemd[1]: sssd-pam.service: Scheduled restart job, restart counter is at 1. Jun 14 00:31:31 ssh systemd[1]: Stopped SSSD PAM Service responder. Jun 14 00:31:31 ssh systemd[1]: Starting SSSD PAM Service responder... Jun 14 00:31:31 ssh systemd[1]: Started SSSD PAM Service responder. Jun 14 00:31:43 ssh systemd[1]: sssd-pam.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Jun 14 00:31:43 ssh systemd[1]: sssd-pam.service: Failed with result 'exit-code'. Jun 14 00:31:43 ssh systemd[1]: sssd-pam.service: Scheduled restart job, restart counter is at 2. Jun 14 00:31:43 ssh systemd[1]: Stopped SSSD PAM Service responder. Jun 14 00:31:43 ssh systemd[1]: Starting SSSD PAM Service responder... Jun 14 00:31:43 ssh systemd[1]: Started SSSD PAM Service responder. Jun 14 00:31:43 ssh systemd[1]: sssd-pam.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Jun 14 00:31:43 ssh systemd[1]: sssd-pam.service: Failed with result 'exit-code'. Jun 14 00:31:43 ssh systemd[1]: sssd-pam.service: Scheduled restart job, restart counter is at 3. Jun 14 00:31:43 ssh systemd[1]: Stopped SSSD PAM Service responder. [skip] Jun 14 00:36:20 ssh systemd[1]: sssd-pam.service: Start request repeated too quickly. Jun 14 00:36:20 ssh systemd[1]: sssd-pam.service: Failed with result 'exit-code'. Jun 14 00:36:20 ssh systemd[1]: Failed to start SSSD PAM Service responder. ``` > journalctl -u sssd-pam.socket ``` Jun 14 00:31:44 ssh systemd[1]: sssd-pam.socket: Failed with result 'service-start-limit-hit'. Jun 14 00:32:11 ssh systemd[1]: Starting SSSD PAM Service responder socket. Jun 14 00:32:12 ssh systemd[1]: Listening on SSSD PAM Service responder socket. Jun 14 00:36:20 ssh systemd[1]: sssd-pam.socket: Failed with result 'service-start-limit-hit'. ``` So, this is pretty serious as it happens often enough and ssh server become useless until we restart sssd service. Restarting sssd service solves the problem, but I hope there is some better solution. Here some additional details/thoughts : - I used some old conf file as template for sssd conf for this server. There was "service" setting in this conf in the sssd part : [sssd] services = nss, pam As I understand this option is not necessary now. So I comment this line. The problem described above was already present, and commenting this line didn't solve it. But still I'd like to mention this change. - I added `timeout = 15` option into sssd part of config, but it didn't help neither. - There is an option "RestartSec" for systemd. I'm wondering may be set this option for sssd-nss and sssd-pam services could change something (may be the problem that sssd-nss restart too quickly) ?

As you want to migrate the server I’d recommend one of the workarounds.
You could set another timeout or RestartSec value as explained in the links from my previous post or just restart sssd by cron if it’s not running anymore:

mahaq · December 10, 2024, 10:28am

My average load is very low.
11:26:31 up 12 days, 1:28, 1 user, load average: 0,23, 0,52, 0,52
I’ll try your suggested workaround and schedule a monitoring/restart.
I had to uncheck it as solution because at 12:07 today it crashed again. The scheduled job did nothing but it was running.

mrmarkuz · December 10, 2024, 4:39pm

You could also try the other workarounds.

But as the server doesn’t have high load, it could be another issue…

Did you already check the hardware?

It seems to always happen at 12:07. Does a backup or some special script run at that time?

mahaq · December 11, 2024, 2:26pm

It’s not always at 12:07, that was only for two of the latest and at that time it updates clamav which I don’t even need since everything inbound is already checked in the firewall.
After digging around in the logs I have a theory and that is that one of the drives in the mirror raid is acting up so I will start with replacing that one.
Any suggestions what I can do in the short-term until I get a new drive?

mrmarkuz · December 11, 2024, 8:28pm

Do you have the Statistics (collectd) — NethServer 7 Final package installed? There you can check if there was high system load at the time of the crash.

If you’re already using some monitoring software (for example Zabbix) you may add a check to monitor the sssd service/logfiles and get alerted in case of issues.

A simpler approach would be a cronjob running following command every minute. If the error “The Data Provider returned an error” is in the logs of the last minute then a mail is sent to user@maildomain.tld.

journalctl --since="1 minute ago" | grep -q "The Data Provider returned an error" && echo "There's an SSSD error again." | mail -s "SSSD Error" user@maildomain.tld

So you can check the system when the error occurs and try to restart the service (which may not work as the recommended cron script wasn’t able to do it).

stephdl · December 12, 2024, 8:18am

Make an automatic restart by systemd, it has solved my ass. when sssd stopped to work it was under load, I fixed the load but I kept the restart. No sounds since years now