NethServer Version: 7.4.1708
Module: sssd, mail
Last Friday, after quite a bit of testing I finally migrated my Nethserver 6.9 mailserver (ESXi VM) to a new Nethserver 7.4 instance (ProxMox VM). I did so after configuring the new server from scratch, syncing email using imapsync and backing up and restoring calendar and address book data using sogo-tool. All without problems.
After bringing the new server online and users started to connect I quickly noticed that sogo showed empty email for all users. Users could login to sogo but sogo could not connect to the mailserver . . .
After checking services I noticed sssd was not running.
Restarting sssd did not help, I decided to reboot the server.
After rebooting, mysql, postfix, postgresql and sogod were not started!
After manually starting these services, users were able to connect until I noticed the following in maillog:
May 12 16:21:48 mail dovecot: master: Warning: service(imap-login): process_limit (400) reached, client connections are being dropped
Upon that I increased the dovecot process limit to 800 using the command:
db configuration setprop dovecot MaxProcesses 800
After that users where again able to connect to the server.
(I also increased the sogo WOWorkerscount to 50, SOGoMaximumSyncInterval to 3540
and SOGoInternalSyncInterval to 30)
This morning I noticed again that sssd was not running and sssd_domain.log showing the following entries:
(Sun May 13 02:43:26 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 02:49:20 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 02:54:04 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 02:54:04 2018) [sssd[be[domain]]] [ldb] (0x0010): A transaction is still active in ldb context [0x5622e608bbc0] on /var/lib/sss/db/cache_eisergon.nl.ldb
(Sun May 13 02:56:25 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 03:02:48 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 03:03:58 2018) [sssd[be[domain]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
With now and then the following entry showing up:
(Sun May 13 08:55:36 2018) [sssd[be[doamin]]] [id_callback] (0x0010): The Monitor returned an error [org.freedesktop.DBus.Error.NoReply]
sssd_nss.log shows the following entries:
(Sun May 13 08:20:30 2018) [sssd[nss]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 08:20:30 2018) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Sun May 13 08:20:30 2018) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Sun May 13 08:20:30 2018) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed
(Sun May 13 08:20:32 2018) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Sun May 13 08:20:32 2018) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Sun May 13 08:20:32 2018) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed
(Sun May 13 08:20:36 2018) [sssd[nss]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Sun May 13 08:20:36 2018) [sssd[nss]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Sun May 13 08:20:36 2018) [sssd[nss]] [nss_process_init] (0x0010): sss_process_init() failed
(Sun May 13 08:42:28 2018) [sssd[nss]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 08:42:57 2018) [sssd[nss]] [orderly_shutdown] (0x0010): SIGTERM: killing children
(Sun May 13 08:51:44 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:44 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:45 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:46 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:46 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:55 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:57 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:51:57 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Sun May 13 08:52:08 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
Again starting sssd does not help, rebooting does, but postfix and sogod did not start at boot. Started them manually and for now the server seems to be running okay. But for how long . . .
This VM was running stable for quite a while (weeks) before taking it into production. It seems sssd trouble is the root cause, anyone an idea on how to solve this? And make NS7 run smooth for at least two years, as NS6 (starting at 6.7) did?