Service sssd has started to crash frequently

NethServer Version: 7.9.2009
Module: sssd

My sssd service has started to crash frequently which is causing that all mail is delivered to the catch all box and users can’t access their mailboxes.
This is the log from recently before the crash and the startup almost twelve hours later.

Dec 7 12:00:40 sssd: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.
Dec 7 12:02:05 sssd: ; TSIG error with server: tsig verify failure
Dec 7 12:02:08 sssd: ; TSIG error with server: tsig verify failure
Dec 7 12:02:09 sssd: ; TSIG error with server: tsig verify failure
Dec 7 12:02:09 sssd: ; TSIG error with server: tsig verify failure
Dec 7 12:02:09 sssd: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.
Dec 7 12:05:08 sssd[sssd]: Child [3230] (‘’:‘%BE_’) was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Dec 7 12:05:08 sssd[be[]]: Starting up
Dec 7 12:05:39 sssd[sssd]: Child [5316] (‘’:‘%BE_’) was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Dec 7 12:05:39 sssd[be[]]: Starting up
Dec 7 12:06:07 sssd[sssd]: Child [8638] (‘nss’:‘nss’) was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Dec 7 12:06:07 sssd[nss]: Starting up
Dec 7 12:06:09 sssd[nss]: Starting up
Dec 7 12:06:10 sssd[sssd]: Child [5350] (‘’:‘%BE_’) was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Dec 7 12:06:10 sssd[be[]]: Starting up
Dec 7 12:06:13 sssd[nss]: Starting up
Dec 7 12:06:13 sssd[sssd]: Exiting the SSSD. Could not restart critical service [nss].
Dec 7 12:07:05 sssd[pam]: Shutting down
Dec 7 12:07:05 systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
Dec 7 12:07:05 systemd: Unit sssd.service entered failed state.
Dec 7 12:07:05 systemd: sssd.service failed.
Dec 7 22:38:03 sssd[sssd]: Starting up
Dec 7 22:38:03 sssd[be[]]: Starting up
Dec 7 22:38:03 sssd[pam]: Starting up
Dec 7 22:38:03 sssd[nss]: Starting up
Dec 7 22:38:05 sssd: ; TSIG error with server: tsig verify failure
Dec 7 22:38:05 sssd: ; TSIG error with server: tsig verify failure
Dec 7 22:38:06 sssd: ; TSIG error with server: tsig verify failure
Dec 7 22:38:06 sssd: ; TSIG error with server: tsig verify failure
Dec 7 22:38:06 sssd: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.

Could it be that there was high load when sssd gave up?

Does it work to restart the service?

systemctl restart sssd

Maybe interesting:

1 Like

The server has been running since 2019 with very low load and few issues, it has only five users. The hardware has almost no load that is one reason why I plan to migrate it to a VM for NS8 so I can free resources for other projects.

It’s never a problem to start it from the service panel after it has failed and I did a restart now that gave this log output.

Dec 9 16:03:16 sssd[nss]: Shutting down
Dec 9 16:03:16 sssd[be[]]: Shutting down
Dec 9 16:03:16 sssd[pam]: Shutting down
Dec 9 16:03:16 systemd: Stopped System Security Services Daemon.
Dec 9 16:03:16 systemd: Starting System Security Services Daemon…
Dec 9 16:03:17 sssd[sssd]: Starting up
Dec 9 16:03:17 sssd[be[]]: Starting up
Dec 9 16:03:18 sssd[pam]: Starting up
Dec 9 16:03:18 sssd[nss]: Starting up
Dec 9 16:03:18 systemd: Started System Security Services Daemon.
Dec 9 16:03:19 sssd: ; TSIG error with server: tsig verify failure
Dec 9 16:03:19 sssd: ; TSIG error with server: tsig verify failure
Dec 9 16:03:20 sssd: ; TSIG error with server: tsig verify failure
Dec 9 16:03:21 sssd: ; TSIG error with server: tsig verify failure
Dec 9 16:03:21 sssd: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.

A few more log entries from the last failure
sssd.log

(2024-12-07 12:06:13): [sssd] [monitor_restart_service] (0x0010): Process [nss], definitely stopped

sssd_nss.log>

(2024-12-07 12:05:22): [nss] [nss_dp_reconnect_init] (0x0010): Could not reconnect to provider.
(2024-12-07 12:06:07): [nss] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(2024-12-07 12:06:07): [nss] [sss_process_init] (0x0010): fatal error setting up backend connector
(2024-12-07 12:06:07): [nss] [nss_process_init] (0x0010): sss_process_init() failed
(2024-12-07 12:06:09): [nss] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(2024-12-07 12:06:09): [nss] [sss_process_init] (0x0010): fatal error setting up backend connector
(2024-12-07 12:06:09): [nss] [nss_process_init] (0x0010): sss_process_init() failed
(2024-12-07 12:06:13): [nss] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(2024-12-07 12:06:13): [nss] [sss_process_init] (0x0010): fatal error setting up backend connector
(2024-12-07 12:06:13): [nss] [nss_process_init] (0x0010): sss_process_init() failed

sssd_pam.log

(2024-12-07 12:01:56): [pam] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(2024-12-07 12:02:02): [pam] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(2024-12-07 12:07:05): [pam] [orderly_shutdown] (0x0010): SIGTERM: killing children

sssd_.log

(2024-12-01 10:18:57): [be[]] [id_callback] (0x0010): The Monitor returned an error [org.freedesktop.DBus.Error.NoReply]

The domain controller is on the same machine and part of NS7.
I have seen a similar error from other joined servers that fails because there are multiple hostnames for the same IP so when it tries to do a reverse lookup it sometimes pick the same for a service, e.g. jellyfin, instead for the servers hostname. I have two hostnames configured for the IP that is used Samba/AD, ad and ldap but that was something I added years ago and this is something that has started to happening during the last months.

What about the load average?

[root@server2 ~]# uptime
 19:56:07 up 1 day,  6:42,  1 user,  load average: 0.11, 0.27, 0.29

It seems a long time issue:

As you want to migrate the server I’d recommend one of the workarounds.
You could set another timeout or RestartSec value as explained in the links from my previous post or just restart sssd by cron if it’s not running anymore:

My average load is very low.
11:26:31 up 12 days, 1:28, 1 user, load average: 0,23, 0,52, 0,52
I’ll try your suggested workaround and schedule a monitoring/restart.
I had to uncheck it as solution because at 12:07 today it crashed again. The scheduled job did nothing but it was running.

1 Like

You could also try the other workarounds.

But as the server doesn’t have high load, it could be another issue…

Did you already check the hardware?

It seems to always happen at 12:07. Does a backup or some special script run at that time?

It’s not always at 12:07, that was only for two of the latest and at that time it updates clamav which I don’t even need since everything inbound is already checked in the firewall.
After digging around in the logs I have a theory and that is that one of the drives in the mirror raid is acting up so I will start with replacing that one.
Any suggestions what I can do in the short-term until I get a new drive?

Do you have the Statistics (collectd) — NethServer 7 Final package installed? There you can check if there was high system load at the time of the crash.

If you’re already using some monitoring software (for example Zabbix) you may add a check to monitor the sssd service/logfiles and get alerted in case of issues.

A simpler approach would be a cronjob running following command every minute. If the error “The Data Provider returned an error” is in the logs of the last minute then a mail is sent to user@maildomain.tld.

journalctl --since="1 minute ago" | grep -q "The Data Provider returned an error" && echo "There's an SSSD error again." | mail -s "SSSD Error" user@maildomain.tld

So you can check the system when the error occurs and try to restart the service (which may not work as the recommended cron script wasn’t able to do it).

Make an automatic restart by systemd, it has solved my ass. when sssd stopped to work it was under load, I fixed the load but I kept the restart. No sounds since years now

1 Like