Sssd trouble (initially)

jaapvdv · May 13, 2018, 7:51am

NethServer Version: 7.4.1708
Module: sssd, mail

Last Friday, after quite a bit of testing I finally migrated my Nethserver 6.9 mailserver (ESXi VM) to a new Nethserver 7.4 instance (ProxMox VM). I did so after configuring the new server from scratch, syncing email using imapsync and backing up and restoring calendar and address book data using sogo-tool. All without problems.

After bringing the new server online and users started to connect I quickly noticed that sogo showed empty email for all users. Users could login to sogo but sogo could not connect to the mailserver . . .
After checking services I noticed sssd was not running.
Restarting sssd did not help, I decided to reboot the server.

After rebooting, mysql, postfix, postgresql and sogod were not started!

After manually starting these services, users were able to connect until I noticed the following in maillog:

May 12 16:21:48 mail dovecot: master: Warning: service(imap-login): process_limit (400) reached, client connections are being dropped

Upon that I increased the dovecot process limit to 800 using the command:

db configuration setprop dovecot MaxProcesses 800

After that users where again able to connect to the server.

(I also increased the sogo WOWorkerscount to 50, SOGoMaximumSyncInterval to 3540
and SOGoInternalSyncInterval to 30)

This morning I noticed again that sssd was not running and sssd_domain.log showing the following entries:

(Sun May 13 02:43:26 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 02:49:20 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 02:54:04 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 02:54:04 2018) [sssd[be[domain]]] [ldb] (0x0010): A transaction is still active in ldb context [0x5622e608bbc0] on /var/lib/sss/db/cache_eisergon.nl.ldb
(Sun May 13 02:56:25 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 03:02:48 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 03:03:58 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children

With now and then the following entry showing up:

(Sun May 13 08:55:36 2018) [sssd[be[doamin]]] [id_callback] (0x0010): The Monitor returned an error [org.freedesktop.DBus.Error.NoReply]

sssd_nss.log shows the following entries:

(Sun May 13 08:20:30 2018) [sssd[nss]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 08:20:30 2018) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Sun May 13 08:20:30 2018) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Sun May 13 08:20:30 2018) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed
(Sun May 13 08:20:32 2018) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Sun May 13 08:20:32 2018) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Sun May 13 08:20:32 2018) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed
(Sun May 13 08:20:36 2018) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Sun May 13 08:20:36 2018) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Sun May 13 08:20:36 2018) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed
(Sun May 13 08:42:28 2018) [sssd[nss]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 08:42:57 2018) [sssd[nss]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 08:51:44 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:44 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:45 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:46 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:46 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:55 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:57 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:57 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:52:08 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

Again starting sssd does not help, rebooting does, but postfix and sogod did not start at boot. Started them manually and for now the server seems to be running okay. But for how long . . .

This VM was running stable for quite a while (weeks) before taking it into production. It seems sssd trouble is the root cause, anyone an idea on how to solve this? And make NS7 run smooth for at least two years, as NS6 (starting at 6.7) did?

davidep · May 13, 2018, 8:35am

What is your account provider? As you’re upgrading ns6 you should be aware of

The recent upstream release 7.5.1804 complicates the things during these days…

jaapvdv · May 13, 2018, 8:40am

Not upgrading, I migrated, but the account provider is local LDAP. And I am not using 7.5.

davidep · May 13, 2018, 8:45am

What is the openldap-servers version?

rpm -q openldap-servers

jaapvdv · May 13, 2018, 8:54am

openldap-servers-2.4.44-5.el7.x86_64

davidep · May 13, 2018, 8:57am

It could be ok!

…but this one leads to a connection problem with slapd. Look at

 journalctl -u slapd

jaapvdv · May 13, 2018, 9:00am

– Logs begin at Sun 2018-05-13 08:48:37 CEST, end at Sun 2018-05-13 11:00:01 CEST. –
May 13 08:51:46 mail.eisergon.nl systemd[1]: Starting OpenLDAP Server Daemon…
May 13 08:52:00 mail.eisergon.nl runuser[1197]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:23 mail.eisergon.nl runuser[1197]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:23 mail.eisergon.nl runuser[1687]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:23 mail.eisergon.nl runuser[1687]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:23 mail.eisergon.nl runuser[1691]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:23 mail.eisergon.nl runuser[1691]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:23 mail.eisergon.nl runuser[1693]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:23 mail.eisergon.nl runuser[1695]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:23 mail.eisergon.nl runuser[1695]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:23 mail.eisergon.nl runuser[1697]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:23 mail.eisergon.nl runuser[1699]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:23 mail.eisergon.nl runuser[1699]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:23 mail.eisergon.nl runuser[1701]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:23 mail.eisergon.nl runuser[1701]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:23 mail.eisergon.nl runuser[1703]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:24 mail.eisergon.nl runuser[1703]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:24 mail.eisergon.nl runuser[1713]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:24 mail.eisergon.nl runuser[1713]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:24 mail.eisergon.nl runuser[1715]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:24 mail.eisergon.nl runuser[1715]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:24 mail.eisergon.nl runuser[1717]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:24 mail.eisergon.nl runuser[1717]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:24 mail.eisergon.nl runuser[1719]: pam_unix(runuser:session): session opened for user ldap by (uid=0)
May 13 08:52:24 mail.eisergon.nl runuser[1719]: pam_unix(runuser:session): session closed for user ldap
May 13 08:52:24 mail.eisergon.nl slapd[1722]: @(#) $OpenLDAP: slapd 2.4.44 (Aug 4 2017 14:23:27) $
mockbuild@c1bm.rdu2.centos.org:/builddir/build/BUILD/openldap-2.4.44/openldap-2.4.44/servers/slapd
May 13 08:52:29 mail.eisergon.nl systemd[1]: Started OpenLDAP Server Daemon.

jaapvdv · May 13, 2018, 1:44pm

Well, just 15 minutes ago, again authentication errors.

auth: Error: auth worker: Aborted request: Lookup timed out
Exiting the SSSD. Could not restart critical service [nss].
pam_sss(crond:session): Request to sssd failed. Connection refused

restarted dovecot, postfix, sogod and sssd

seemed to work at first, but right now I am again suffering authentication errors.

Will reboot and see what happens.

jaapvdv · May 14, 2018, 11:03am

Just took the 7.4 mailserver offline and reverted to the 6.9 vm.

Somehow I am having no success with Nethserver 7.