SSSD failing after updates

Greetings all,

Longtime e-smith/SME/% user with an issue on NethServer release 7.6.1810 (final). The server is a standalone mail, LDAP authentication only box, no AD.

I took the below updates during quarterly maintenance on Saturday. Since then, we’re experiencing an intermittent issue where users cannot connect and mail bounces with reports that the user does not exist. I’ve tracked it down to sssd failing, but am not familiar enough with sssd & nss to determine the root cause. Starting/re-starting sssd resolves the issue until it fails again. Attempts to roll back to before the updates fail due to packages being unavailable for download. I’m providing what I’ve gathered, and will provide more upon request.

Thank you in advance to all who assist in resolving this issue.

systemctl status sssd
● sssd.service - System Security Services Daemon
Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2019-10-08 10:28:56 PDT; 18s ago
Process: 35204 ExecStart=/usr/sbin/sssd -i ${DEBUG_LOGGER} (code=exited, status=1/FAILURE)
Main PID: 35204 (code=exited, status=1/FAILURE)

Oct 08 10:28:29 valid host with FQDN removed sssd[nss][47054]: Starting up
Oct 08 10:28:29 valid host with FQDN removed sssd[be[valid domain removed]][47055]: Starting up
Oct 08 10:28:31 valid host with FQDN removed sssd[nss][47178]: Starting up
Oct 08 10:28:35 valid host with FQDN removed sssd[nss][47415]: Starting up
Oct 08 10:28:35 valid host with FQDN removed sssd[35204]: Exiting the SSSD. Could not restart critical service [nss].
Oct 08 10:28:56 valid host with FQDN removed sssd[be[valid domain removed]][47055]: Shutting down
Oct 08 10:28:56 valid host with FQDN removed sssd[pam][35207]: Shutting down
Oct 08 10:28:56 valid host with FQDN removed systemd[1]: sssd.service: main process exited, code=exited, status=1/FAILURE
Oct 08 10:28:56 valid host with FQDN removed systemd[1]: Unit sssd.service entered failed state.
Oct 08 10:28:56 valid host with FQDN removed systemd[1]: sssd.service failed.

From /var/log/messages:

Oct 8 09:40:09 mail systemd: Removed slice User Slice of apache.
Oct 8 09:45:01 mail systemd: Created slice User Slice of apache.
Oct 8 09:45:01 mail systemd: Started Session 354 of user apache.
Oct 8 09:45:04 mail systemd: Removed slice User Slice of apache.
Oct 8 09:50:01 mail systemd: Created slice User Slice of apache.
Oct 8 09:50:01 mail systemd: Started Session 355 of user apache.
Oct 8 09:50:27 mail systemd: Removed slice User Slice of apache.
Oct 8 09:53:55 mail sssd[be[valid domain removed]]: Shutting down
Oct 8 09:53:55 mail sssd[nss]: Shutting down
Oct 8 09:53:55 mail sssd[nss]: Starting up
Oct 8 09:53:55 mail sssd[be[valid domain removed]]: Starting up
Oct 8 09:53:57 mail sssd[nss]: Starting up
Oct 8 09:54:01 mail sssd[nss]: Starting up
Oct 8 09:54:01 mail sssd: Exiting the SSSD. Could not restart critical service [nss].
Oct 8 09:54:11 mail sssd[be[valid domain removed]]: Shutting down
Oct 8 09:54:11 mail sssd[pam]: Shutting down
Oct 8 09:54:11 mail systemd: sssd.service: main process exited, code=exited, status=1/FAILURE
Oct 8 09:54:11 mail systemd: Unit sssd.service entered failed state.
Oct 8 09:54:11 mail systemd: sssd.service failed.
Oct 8 09:55:01 mail systemd: Created slice User Slice of apache.
Oct 8 09:55:01 mail systemd: Started Session 356 of user apache.
Oct 8 09:55:03 mail systemd: Removed slice User Slice of apache.
Oct 8 09:56:51 mail systemd: Created slice User Slice of root.
Oct 8 09:56:51 mail systemd: Started Session c733 of user root.
Oct 8 09:56:51 mail systemd: Removed slice User Slice of root.

From /var/log/sssd/sssd_nss.log:

(Mon Oct 7 18:12:57 2019) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Mon Oct 7 18:12:57 2019) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Mon Oct 7 18:12:57 2019) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Tue Oct 8 09:53:55 2019) [sssd[nss]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Tue Oct 8 09:53:55 2019) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Tue Oct 8 09:53:55 2019) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Tue Oct 8 09:53:55 2019) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed
(Tue Oct 8 09:53:57 2019) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Tue Oct 8 09:53:57 2019) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Tue Oct 8 09:53:57 2019) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed
(Tue Oct 8 09:54:01 2019) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Tue Oct 8 09:54:01 2019) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Tue Oct 8 09:54:01 2019) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed
(Tue Oct 8 10:00:45 2019) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Tue Oct 8 10:00:45 2019) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Tue Oct 8 10:00:45 2019) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

From yum history info:

Loaded plugins: changelog, fastestmirror, nethserver_events
Transaction ID : 13
Begin time : Sat Oct 5 14:26:25 2019
Begin rpmdb : 865:a3e5819d1f5b2136cb42d24dda0aecb7db479621
End time : 14:36:42 2019 (10 minutes)
End rpmdb : 866:8134c31d05de05f890d0f3e991ca90db1fda9746
User : System
Return-Code : Success
Transaction performed with:
Installed rpm-4.11.3-35.el7.x86_64 base
Installed yum-3.4.3-161.el7.centos.noarch base
Installed yum-plugin-fastestmirror-1.1.31-50.el7.noarch base
Packages Altered:
Updated bind-libs-32:9.9.4-73.el7_6.x86_64 updates
Update 32:9.9.4-74.el7_6.2.x86_64 ce-updates
Updated bind-libs-lite-32:9.9.4-73.el7_6.x86_64 updates
Update 32:9.9.4-74.el7_6.2.x86_64 ce-updates
Updated bind-license-32:9.9.4-73.el7_6.noarch updates
Update 32:9.9.4-74.el7_6.2.noarch ce-updates
Updated bind-utils-32:9.9.4-73.el7_6.x86_64 updates
Update 32:9.9.4-74.el7_6.2.x86_64 ce-updates
Updated certbot-0.31.0-2.el7.noarch epel
Update 0.38.0-1.el7.noarch epel
Updated clamav-0.101.2-1.el7.x86_64 epel
Update 0.101.4-1.el7.x86_64 epel
Updated clamav-filesystem-0.101.2-1.el7.noarch epel
Update 0.101.4-1.el7.noarch epel
Updated clamav-lib-0.101.2-1.el7.x86_64 epel
Update 0.101.4-1.el7.x86_64 epel
Obsoleted clamav-server-systemd-0.101.2-1.el7.x86_64 epel
Updated clamav-update-0.101.2-1.el7.x86_64 epel
Update 0.101.4-1.el7.x86_64 epel
Updated clamd-0.101.2-1.el7.x86_64 epel
Obsoleting clamd-0.101.4-1.el7.x86_64 epel
Updated curl-7.29.0-51.el7.x86_64 base
Update 7.29.0-51.el7_6.3.x86_64 ce-updates
Updated device-mapper-7:1.02.149-10.el7_6.7.x86_64 updates
Update 7:1.02.149-10.el7_6.8.x86_64 ce-updates
Updated device-mapper-event-7:1.02.149-10.el7_6.7.x86_64 updates
Update 7:1.02.149-10.el7_6.8.x86_64 ce-updates
Updated device-mapper-event-libs-7:1.02.149-10.el7_6.7.x86_64 updates
Update 7:1.02.149-10.el7_6.8.x86_64 ce-updates
Updated device-mapper-libs-7:1.02.149-10.el7_6.7.x86_64 updates
Update 7:1.02.149-10.el7_6.8.x86_64 ce-updates
Updated epel-release-7-11.noarch nethserver
Update 7-12.noarch epel
Updated evebox-0.9.0-1.x86_64 nethserver-base
Update 0.10.2-1.x86_64 nethserver-updates
Updated glib2-2.56.1-2.el7.x86_64 base
Update 2.56.1-4.el7_6.x86_64 ce-updates
Updated glibc-2.17-260.el7_6.5.x86_64 updates
Update 2.17-260.el7_6.6.x86_64 ce-updates
Updated glibc-common-2.17-260.el7_6.5.x86_64 updates
Update 2.17-260.el7_6.6.x86_64 ce-updates
Updated httpd-2.4.6-89.el7.centos.x86_64 updates
Update 2.4.6-89.el7.centos.1.x86_64 ce-updates
Updated httpd-tools-2.4.6-89.el7.centos.x86_64 updates
Update 2.4.6-89.el7.centos.1.x86_64 ce-updates
Install kernel-3.10.0-957.27.2.el7.x86_64 ce-updates
Updated kernel-tools-3.10.0-957.12.2.el7.x86_64 updates
Update 3.10.0-957.27.2.el7.x86_64 ce-updates
Updated kernel-tools-libs-3.10.0-957.12.2.el7.x86_64 updates
Update 3.10.0-957.27.2.el7.x86_64 ce-updates
Updated kexec-tools-2.0.15-21.el7_6.3.x86_64 updates
Update 2.0.15-21.el7_6.4.x86_64 ce-updates
Updated libcurl-7.29.0-51.el7.x86_64 base
Update 7.29.0-51.el7_6.3.x86_64 ce-updates
Updated libprelude-4.1.0-3.el7.x86_64 epel
Update 5.0.0-1.el7.x86_64 epel
Updated librsync-1.0.0-1.el7.x86_64 epel
Update 2.0.2-1.el7.x86_64 epel
Updated libsmbclient-4.8.3-4.el7.x86_64 base
Update 4.8.3-6.el7_6.x86_64 ce-updates
Updated libsodium-1.0.17-1.el7.x86_64 epel
Update 1.0.18-1.el7.x86_64 epel
Updated libssh2-1.4.3-12.el7_6.2.x86_64 updates
Update 1.4.3-12.el7_6.3.x86_64 ce-updates
Updated libteam-1.27-5.el7.x86_64 base
Update 1.27-6.el7_6.1.x86_64 ce-updates
Updated libwbclient-4.8.3-4.el7.x86_64 base
Update 4.8.3-6.el7_6.x86_64 ce-updates
Updated lvm2-7:2.02.180-10.el7_6.7.x86_64 updates
Update 7:2.02.180-10.el7_6.8.x86_64 ce-updates
Updated lvm2-libs-7:2.02.180-10.el7_6.7.x86_64 updates
Update 7:2.02.180-10.el7_6.8.x86_64 ce-updates
Updated microcode_ctl-2:2.1-47.2.el7_6.x86_64 updates
Update 2:2.1-47.5.el7_6.x86_64 ce-updates
Updated mod_ssl-1:2.4.6-89.el7.centos.x86_64 updates
Update 1:2.4.6-89.el7.centos.1.x86_64 ce-updates
Updated net-snmp-libs-1:5.7.2-37.el7.x86_64 base
Update 1:5.7.2-38.el7_6.2.x86_64 ce-updates
Updated nethserver-antivirus-1.2.2-1.ns7.noarch nethserver-updates
Update 1.3.1-1.ns7.noarch nethserver-updates
Updated nethserver-backup-config-2.3.1-1.ns7.noarch nethserver-updates
Update 2.4.0-1.ns7.noarch nethserver-updates
Updated nethserver-backup-data-1.5.3-1.ns7.noarch nethserver-updates
Update 1.6.2-1.ns7.noarch nethserver-updates
Updated nethserver-base-3.7.2-1.ns7.noarch nethserver-updates
Update 3.7.3-1.ns7.noarch nethserver-updates
Updated nethserver-cockpit-lib-0.6.0-1.ns7.noarch nethserver-updates
Update 0.15.1-1.ns7.noarch nethserver-updates
Updated nethserver-collectd-3.0.8-1.ns7.noarch nethserver-updates
Update 3.1.0-1.ns7.noarch nethserver-updates
Updated nethserver-duc-1.4.5-1.ns7.noarch nethserver-updates
Update 1.6.0-1.ns7.noarch nethserver-updates
Updated nethserver-ejabberd-1.4.0-1.ns7.noarch nethserver-updates
Update 1.4.1-1.ns7.noarch nethserver-updates
Updated nethserver-fail2ban-1.1.6-1.ns7.noarch nethserver-updates
Update 1.1.10-1.ns7.noarch nethserver-updates
Updated nethserver-firewall-base-3.6.1-1.ns7.noarch nethserver-updates
Update 3.6.6-1.ns7.noarch nethserver-updates
Updated nethserver-httpd-3.2.7-1.ns7.noarch nethserver-updates
Update 3.5.0-1.ns7.noarch nethserver-updates
Updated nethserver-lang-en-1.3.0-4.ns7.noarch nethserver-updates
Update 1.3.0-10.ns7.noarch nethserver-updates
Updated nethserver-mail-common-2.6.2-1.ns7.noarch nethserver-updates
Update 2.7.3-1.ns7.noarch nethserver-updates
Updated nethserver-mail-disclaimer-2.6.2-1.ns7.noarch nethserver-updates
Update 2.7.3-1.ns7.noarch nethserver-updates
Updated nethserver-mail-filter-2.6.2-1.ns7.noarch nethserver-updates
Update 2.7.3-1.ns7.noarch nethserver-updates
Updated nethserver-mail-quarantine-2.6.2-1.ns7.noarch nethserver-updates
Update 2.7.3-1.ns7.noarch nethserver-updates
Updated nethserver-mail-server-2.6.2-1.ns7.noarch nethserver-updates
Update 2.7.3-1.ns7.noarch nethserver-updates
Updated nethserver-mail-smarthost-2.6.2-1.ns7.noarch nethserver-updates
Update 2.7.3-1.ns7.noarch nethserver-updates
Updated nethserver-nextcloud-1.5.1-1.ns7.noarch nethserver-updates
Update 1.6.2-1.ns7.noarch nethserver-updates
Updated nethserver-ntopng-2.1.1-1.ns7.noarch nethserver-base
Update 2.1.3-1.ns7.noarch nethserver-updates
Updated nethserver-nut-1.3.2-1.ns7.noarch nethserver-updates
Update 1.4.1-1.ns7.noarch nethserver-updates
Updated nethserver-pulledpork-2.1.3-1.ns7.noarch nethserver-base
Update 2.1.4-1.ns7.noarch nethserver-updates
Updated nethserver-restore-data-1.3.0-1.ns7.noarch nethserver-updates
Update 1.4.2-1.ns7.noarch nethserver-updates
Updated nethserver-roundcubemail-1.3.0-1.ns7.noarch nethserver-updates
Update 1.3.2-1.ns7.noarch nethserver-updates
Updated nextcloud-16.0.1-1.el7.noarch nethserver-updates
Update 16.0.4-1.el7.noarch nethserver-updates
Updated pulledpork-0.7.3-1.el7.noarch epel
Update 0.7.3-5.ns7.noarch nethserver-updates
Updated python-2.7.5-77.el7_6.x86_64 updates
Update 2.7.5-80.el7_6.x86_64 ce-updates
Updated python-libs-2.7.5-77.el7_6.x86_64 updates
Update 2.7.5-80.el7_6.x86_64 ce-updates
Updated python-perf-3.10.0-957.12.2.el7.x86_64 updates
Update 3.10.0-957.27.2.el7.x86_64 ce-updates
Updated python2-acme-0.31.0-1.el7.noarch epel
Update 0.38.0-1.el7.noarch epel
Updated python2-certbot-0.31.0-2.el7.noarch epel
Update 0.38.0-1.el7.noarch epel
Dep-Install python2-distro-1.2.0-3.el7.noarch epel
Updated python2-josepy-1.1.0-1.el7.noarch nethserver
Update 1.2.0-1.el7.noarch epel
Updated samba-client-4.8.3-4.el7.x86_64 base
Update 4.8.3-6.el7_6.x86_64 ce-updates
Updated samba-client-libs-4.8.3-4.el7.x86_64 base
Update 4.8.3-6.el7_6.x86_64 ce-updates
Updated samba-common-4.8.3-4.el7.noarch base
Update 4.8.3-6.el7_6.noarch ce-updates
Updated samba-common-libs-4.8.3-4.el7.x86_64 base
Update 4.8.3-6.el7_6.x86_64 ce-updates
Updated samba-common-tools-4.8.3-4.el7.x86_64 base
Update 4.8.3-6.el7_6.x86_64 ce-updates
Updated samba-libs-4.8.3-4.el7.x86_64 base
Update 4.8.3-6.el7_6.x86_64 ce-updates
Updated selinux-policy-3.13.1-229.el7_6.12.noarch updates
Update 3.13.1-229.el7_6.15.noarch ce-updates
Updated selinux-policy-targeted-3.13.1-229.el7_6.12.noarch updates
Update 3.13.1-229.el7_6.15.noarch ce-updates
Updated suricata-4.1.4-1.el7.x86_64 epel
Update 4.1.4-3.el7.x86_64 epel
Updated systemd-219-62.el7_6.6.x86_64 updates
Update 219-62.el7_6.9.x86_64 ce-updates
Updated systemd-libs-219-62.el7_6.6.x86_64 updates
Update 219-62.el7_6.9.x86_64 ce-updates
Updated systemd-python-219-62.el7_6.6.x86_64 updates
Update 219-62.el7_6.9.x86_64 ce-updates
Updated systemd-sysv-219-62.el7_6.6.x86_64 updates
Update 219-62.el7_6.9.x86_64 ce-updates
Updated teamd-1.27-5.el7.x86_64 base
Update 1.27-6.el7_6.1.x86_64 ce-updates
Updated tuned-2.10.0-6.el7_6.3.noarch updates
Update 2.10.0-6.el7_6.4.noarch ce-updates
Updated tzdata-2019a-1.el7.noarch updates
Update 2019b-1.el7.noarch ce-updates
Updated vim-minimal-2:7.4.160-5.el7.x86_64 base
Update 2:7.4.160-6.el7_6.x86_64 ce-updates
Updated zstd-1.4.0-1.el7.x86_64 epel
Update 1.4.2-1.el7.x86_64 epel

Maybe @nrauso or @filippo_carletti can tell us more…

In the meanwhile, clearing sss cache or purging sssd state dirs could help. You’d look around for the right commands!

1 Like

I think that @nrauso has always linked these SSSD problems to system load/slowness.
IIRC, Red Hat was developing a fix, but I can’t find any reference right now.
@EmpireSystems could we exclude system load problems? I can’t see anything directly related to SSSD in the list of packages that got updated.

2 Likes

Thanks guys. I have noticed higher loads than prior to the updates, with top showing the following as the most frequent offenders, in no particular order:

lmtp
fail2ban-server
redis-server
ssd_be
slapd
collectd

For reference, the server is a Dell PE R430
2x Xeon E5-2609 @1.7GHz for a total of 16cores/threads
32GB EEC RAM
2x 2TB in MD mirror - no sync issues

Supporting 115 users across 10 locations within the same company/domain (Internet, not AD).
Previous uptime prior to restart was 129 days, without issue.

thx,
jim

Yes, I can confirm: that kind of issue - rare but known - seems to be strongly connected with the system load/slowness.
A good workaround is to increase the number of retryes for sssd to connect to the backend, there is a specific option you can add to sssd.conf file under [nss] section:

reconnection_retries (integer)
           Number of times services should attempt to reconnect in the event of a Data Provider crash or restart before they give up

           Default: 3

So you can change this value to 10, for example.

Remembering that sssd.conf file is a template, you can change that value in this way:

mkdir -p /etc/e-smith/templates-custom/etc/sssd/sssd.conf
echo -e "reconnection_retries = 10\n" > /etc/e-smith/templates-custom/etc/sssd/sssd.conf/31nss_retryes
signal-event nethserver-sssd-save

It seems to significantly mitigate the issue.

3 Likes

Thanks for the tip. I have applied the suggested template change, which seems to have helped a bit, but am still seeing issues. Something is obviously causing an excessive load on the server, I’m seeing averages of 2-3, with spikes to 5 & 6. This is with fail2ban disabled currently due to the auth errors caused by sssd. I don’t recall ever seeing a load hit 2 prior to the update.

Does anyone have thoughts on the excessive load? I don’t believe this box should be struggling to support 100 users with those specs?

Which kind of disk are you using?
Also: which services are you providing via this server?

2x 2TB Dell “Enterprise” SATA - Seagate 7200s
Standard e-mail services - SMTP & IMAP accessed via Outlook on the Desktop. A handful of users access via their phones.

Disk access could be a part of the issue, due to a quite big stack of disk requests for read or write by redis (rspamd), dovecot, fail2ban. My current suggestion is to try to extend the number of retries to 20, but a performance/load analysis on the server should be done, IMVHO.

I would agree, but this system has been in production for almost a year in this configuration without issue. I didn’t see anything in the updates that could cause disk issues, but will revisit. Thank you.

Hi @EmpireSystems.

Note upfront i’m not a system administrator by any means; only have a little bit experience in “weight watching” embedded ux/bsd-systems.
And here ps more precise ps -aux is my friend to track down the resource hogs;
for instance:

show cpu usage:
ps -aux --sort %cpu

memory:
ps -aux --sort %mem

(as always, further reading man ps)

EDIT

or simply install a tool like htop to watch who takes the resources or any other (web based) tool you of your liking.:smiley: