Oct 30 02:43:23 prometheus.domain.com sssd[sssd][1743]: Child [26651] ('domain.com':'%BE_domain.com') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Oct 30 02:43:23 prometheus.domain.com sssd[be[domain.com]][28052]: Starting up
Oct 30 02:43:40 prometheus.domain.com sssd[sssd][1743]: Child [27061] ('pam':'pam') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Oct 30 02:43:40 prometheus.domain.com sssd[pam][28071]: Starting up
Oct 30 02:43:43 prometheus.domain.com sssd[pam][28074]: Starting up
Oct 30 02:43:47 prometheus.domain.com sssd[pam][28081]: Starting up
Oct 30 02:43:48 prometheus.domain.com sssd[sssd][1743]: Exiting the SSSD. Could not restart critical service [pam].
Oct 30 02:43:54 prometheus.domain.com systemd[1]: sssd.service: main process exited, code=exited, status=1/FAILURE
Oct 30 02:43:54 prometheus.domain.com systemd[1]: Unit sssd.service entered failed state.
Oct 30 02:43:54 prometheus.domain.com systemd[1]: sssd.service failed.
From time to time I have sssd which stops to work, especially I think when my server is under heavy load (I presume), no much ideas nor clues to look
if you look it stopped 6 times since February, of course I cannot receive emails if stopped and I reject emails for well known users
[root@prometheus ~]# grep -srni 'sssd.service: main process exited' /var/log/messages-2022*
/var/log/messages-20220206:6157:Feb 4 11:22:15 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
/var/log/messages-20220206:6778:Feb 4 11:31:38 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
/var/log/messages-20220605:10673:Jun 4 00:51:52 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
/var/log/messages-20221030:3897:Oct 25 04:57:20 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
/var/log/messages-20221030:14008:Oct 28 02:11:40 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
/var/log/messages-20221030:26258:Oct 30 02:43:54 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
don’t know, normally the watchdog triggers a restart each 10s if unresponsive, but if the restart fails, systemd set the service as failed and does not restart it
Nice. In the not nice meaning. This occurs on a single installation, when this service could lead a total unavailability of services for whole userbase.
I hope that the orchestra of metal boxes could be better managed…
I had a quick look to our statics: I’ve found some evidence of SSSD failures, but they are quite rare and I’m not able to inspect the causes.
Also, the provided log doesn’t tell anything useful.
I’m quite surprised that standard systemd unit doesn’t already have the Restart directive.
Still, I’m not sure if it’s a good idea to add such a custom configuration, but probably we could add it without regressions.
Options usable in SERVICE and DOMAIN sections
timeout (integer)
Timeout in seconds between heartbeats for this service. This is used to ensure that the process is alive and capable of answering requests. Note that after three missed
heartbeats the process will terminate itself.
Default: 10
Instead of configuring the automated restart, I’d prefer to avoid the daemon failure. Maybe increasing the timeout to a larger value, e.g. 60, helps in this case.
@davidep i may have understand uncorrectly…
The heartbeat currently is 10 seconds and is not “long enough” to manage those occurrencies. Your proposal is to set to 60 seconds.
So, if during a login phase the user catch the issue on sssd, it has to wait 60 seconds to know that the logon phase was wrong?
So… the demon might take more time to fail, but should not be restarted anyway?
Yes I have had the idea to increase the timeout, but I am not sure it worths it, let me explain
it seems it is a common feature, test if the service is responding, else restart it
I have many
531:Nov 6 10:00:25 prometheus sssd[be[domain.fr]]: Backend is offline
532:Nov 6 10:01:20 prometheus sssd[be[domain.fr]]: Backend is online
552:Nov 6 10:18:02 prometheus sssd[sssd]: Child [16229] ('domain.fr':'%BE_domain.fr') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
553:Nov 6 10:18:05 prometheus sssd[be[domain.fr]]: Starting up
the example above is good, the service has been unresponsive and has been well restarted however I have other bad examples like below
Oct 28 02:11:24 prometheus sssd[sssd]: Child [6503] ('domain.fr':'%BE_domain.fr') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Oct 28 02:11:25 prometheus sssd[sssd]: Child [17715] ('pam':'pam') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Oct 28 02:11:25 prometheus sssd[pam]: Starting up
Oct 28 02:11:27 prometheus sssd[pam]: Starting up
Oct 28 02:11:31 prometheus sssd[pam]: Starting up
Oct 28 02:11:31 prometheus sssd[sssd]: Exiting the SSSD. Could not restart critical service [pam].
Oct 28 02:11:32 prometheus sssd[be[domain.fr]]: Starting up
Oct 28 02:11:40 prometheus sssd[be[domain.fr]]: Shutting down
Oct 28 02:11:40 prometheus sssd[nss]: Shutting down
Oct 28 02:11:40 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
Oct 28 02:11:40 prometheus systemd: Unit sssd.service entered failed state.
Oct 28 02:11:40 prometheus systemd: sssd.service failed.
So from my point of view it should not be needed to delay the watchdog by increasing the timeout because it is something normal in the sssd behavior to test and restart the service, however we need to fix the case where sssd fails to restart because it is not enough faster. For history we have 6 seconds during 3 times to restart sssd before it was set as failed
So eventually we could increase the timeout, but we have to restart the service in case of failure, the systemd way is probably nicer than a cron job obviously.