Sssd stops to work

stephdl · October 30, 2022, 2:19pm

Oct 30 02:43:23 prometheus.domain.com sssd[sssd][1743]: Child [26651] ('domain.com':'%BE_domain.com') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Oct 30 02:43:23 prometheus.domain.com sssd[be[domain.com]][28052]: Starting up
Oct 30 02:43:40 prometheus.domain.com sssd[sssd][1743]: Child [27061] ('pam':'pam') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Oct 30 02:43:40 prometheus.domain.com sssd[pam][28071]: Starting up
Oct 30 02:43:43 prometheus.domain.com sssd[pam][28074]: Starting up
Oct 30 02:43:47 prometheus.domain.com sssd[pam][28081]: Starting up
Oct 30 02:43:48 prometheus.domain.com sssd[sssd][1743]: Exiting the SSSD. Could not restart critical service [pam].
Oct 30 02:43:54 prometheus.domain.com systemd[1]: sssd.service: main process exited, code=exited, status=1/FAILURE
Oct 30 02:43:54 prometheus.domain.com systemd[1]: Unit sssd.service entered failed state.
Oct 30 02:43:54 prometheus.domain.com systemd[1]: sssd.service failed.

From time to time I have sssd which stops to work, especially I think when my server is under heavy load (I presume), no much ideas nor clues to look

if you look it stopped 6 times since February, of course I cannot receive emails if stopped and I reject emails for well known users

[root@prometheus ~]# grep -srni 'sssd.service: main process exited' /var/log/messages-2022*
/var/log/messages-20220206:6157:Feb  4 11:22:15 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
/var/log/messages-20220206:6778:Feb  4 11:31:38 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
/var/log/messages-20220605:10673:Jun  4 00:51:52 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
/var/log/messages-20221030:3897:Oct 25 04:57:20 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
/var/log/messages-20221030:14008:Oct 28 02:11:40 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
/var/log/messages-20221030:26258:Oct 30 02:43:54 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE

Andy_Wismer · October 30, 2022, 2:50pm

@stephdl

Salut Stephane

As long as you can start sssd, it’s not to much of a hassle…

But if you can’t start sssd by hand or reboot, what works is:

Making sure you have a working backup - and config backup
Delete your account provider
Restore the config from the latest config backup
Reboot your server…

Haven’t had this issue at home or with my clients now for a while, but I used to have it at least once a year (of all 30 clients, not too bad).

Good luck!

Mes deux centimes
Andy

stephdl · October 30, 2022, 6:46pm

It seems to restart after a failure. For now I set a systemd directive to restart on failure after 10s

mrmarkuz · October 30, 2022, 9:11pm

I can confirm that sssd stops when there’s heavy load.
I used a cron job to restart sssd if it has failed but the systemd directive seems smarter.

stephdl · October 31, 2022, 6:55am

systemctl edit sssd

add

[Service]
Restart=on-failure
RestartSec=60s

save then

systemctl daemon-reload

verify

systemctl cat sssd

restart sssd

systemctl restart sssd

verfiy by killing the PID of sssd it must restart after 60s

kill -9 $(pidof sssd)

check the process starts again : systemctl status sssd

however reading sssd.conf(5): config file for SSSD - Linux man page

I saw the default timeout is set to 60s, maybe we could try to increase it a bit

pike · October 31, 2022, 10:07am

@stephdl is there a reason for sssd to not be already configured like that or for 30 seconds pause?

stephdl · October 31, 2022, 10:09am

don’t know, normally the watchdog triggers a restart each 10s if unresponsive, but if the restart fails, systemd set the service as failed and does not restart it

pike · October 31, 2022, 10:11am

Nice. In the not nice meaning. This occurs on a single installation, when this service could lead a total unavailability of services for whole userbase.
I hope that the orchestra of metal boxes could be better managed…

stephdl · October 31, 2022, 11:14am

I am happy that I am not the only one concerned, maybe we could think on it after the halloween break

Cc @giacomo @davidep

Anyhow it is maybe not relevant to many servers and this is why we did not see it before

pike · October 31, 2022, 11:24am

Maybe is not relevant because when it happens, answer is “more hardware, so it won’t happen again”.

giacomo · November 2, 2022, 7:59am

I had a quick look to our statics: I’ve found some evidence of SSSD failures, but they are quite rare and I’m not able to inspect the causes.
Also, the provided log doesn’t tell anything useful.

I’m quite surprised that standard systemd unit doesn’t already have the Restart directive.
Still, I’m not sure if it’s a good idea to add such a custom configuration, but probably we could add it without regressions.

Let’s see what Davide thinks about it.

gatto · November 4, 2022, 6:55pm

this happened to me too on one of my servers, solved by a cronjob like mrmarkuz did,
*/1 * * * * systemctl is-active --quiet sssd || sssd restart

davidep · November 7, 2022, 9:57am

From sssd.conf(5) man page:

   Options usable in SERVICE and DOMAIN sections
       timeout (integer)
           Timeout in seconds between heartbeats for this service. This is used to ensure that the process is alive and capable of answering requests. Note that after three missed
           heartbeats the process will terminate itself.

           Default: 10

Instead of configuring the automated restart, I’d prefer to avoid the daemon failure. Maybe increasing the timeout to a larger value, e.g. 60, helps in this case.

pike · November 7, 2022, 12:49pm

@davidep i may have understand uncorrectly…
The heartbeat currently is 10 seconds and is not “long enough” to manage those occurrencies. Your proposal is to set to 60 seconds.
So, if during a login phase the user catch the issue on sssd, it has to wait 60 seconds to know that the logon phase was wrong?

So… the demon might take more time to fail, but should not be restarted anyway?

stephdl · November 7, 2022, 2:50pm

Yes I have had the idea to increase the timeout, but I am not sure it worths it, let me explain

it seems it is a common feature, test if the service is responding, else restart it

I have many

531:Nov  6 10:00:25 prometheus sssd[be[domain.fr]]: Backend is offline
532:Nov  6 10:01:20 prometheus sssd[be[domain.fr]]: Backend is online
552:Nov  6 10:18:02 prometheus sssd[sssd]: Child [16229] ('domain.fr':'%BE_domain.fr') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
553:Nov  6 10:18:05 prometheus sssd[be[domain.fr]]: Starting up

the example above is good, the service has been unresponsive and has been well restarted however I have other bad examples like below

Oct 28 02:11:24 prometheus sssd[sssd]: Child [6503] ('domain.fr':'%BE_domain.fr') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Oct 28 02:11:25 prometheus sssd[sssd]: Child [17715] ('pam':'pam') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Oct 28 02:11:25 prometheus sssd[pam]: Starting up
Oct 28 02:11:27 prometheus sssd[pam]: Starting up
Oct 28 02:11:31 prometheus sssd[pam]: Starting up
Oct 28 02:11:31 prometheus sssd[sssd]: Exiting the SSSD. Could not restart critical service [pam].
Oct 28 02:11:32 prometheus sssd[be[domain.fr]]: Starting up
Oct 28 02:11:40 prometheus sssd[be[domain.fr]]: Shutting down
Oct 28 02:11:40 prometheus sssd[nss]: Shutting down
Oct 28 02:11:40 prometheus systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
Oct 28 02:11:40 prometheus systemd: Unit sssd.service entered failed state.
Oct 28 02:11:40 prometheus systemd: sssd.service failed.

So from my point of view it should not be needed to delay the watchdog by increasing the timeout because it is something normal in the sssd behavior to test and restart the service, however we need to fix the case where sssd fails to restart because it is not enough faster. For history we have 6 seconds during 3 times to restart sssd before it was set as failed

So eventually we could increase the timeout, but we have to restart the service in case of failure, the systemd way is probably nicer than a cron job obviously.

cc @giacomo @davidep

Jokan · May 22, 2023, 9:05am

Thanks for this fix Steph, the sssd random failure has been bugging me for sometime now.