N7 sucks - NS8 is not yet usable

For some time now, there have been increasing problems with my Nethserver 7, which has been running productively for about 3 years. The most important module for me is Dokuwiki.
Issues:

  1. the users and groups are always missing (Account provider generic error: SSSD exit code 1,)
    /var/log/messages:
    Jan 4 16:52:45 daho-nethserver sssd: tkey query failed: GSSAPI error: Major = Unspecified GSS failure. Minor code may provide more information, Minor = Server not found in Kerberos database.
  2. backups fail again and again
  3. brand new: Cockpit no longer starts, the old user interface under 980 is still running
  4. the root partition is 80 full, as the Zabbix database continues to grow

What I did to cover the issues:

  1. DC issues
  • systemctl restart nsdc - bring back the the user and groups, but not permanently
  • Restore a config backup - bring back the the user and groups, but not permanently
  • yum reinstall nethserver-dc - my last intervention some minutes ago; I donā€™t know the long-time result.
  1. Backup Issue
    restic unlock -r /mnt/backup-dokuwiki resolves the locks, but not permanently. Next day, next lock.

  2. Cockpit: no solution. I use the old Server Manager.

  3. Zabbix-DB:
    I have tried to shrink the DB, but have not been able to complete the procedure to the end.

BTW:
I thought it would be a good idea not to put much energy into the aging installation and migrate to NS8 - DokuWIki exists there as a module.
My attempt to migrate from NS7 to NS8 has so far failed because I could not restore the DokuWiki content. The previous support here in the forum was not effective.

My first priority is to get access to cockpit again. The other errors annoy me and make me nervous in terms of overall stability and the increase in errors.

If it were possible to quickly restore the DokuWiki content in the NS8 installation, I would not put any more effort into the NS7 server and shut it down.
Either way, I need further support with both.

Sincerely, Marko

this is mostly like a problem to do with your Default DNS for the DC, i have faced this problem almost 20 times, and each time was something new i learnt.

as for dokuwiki, kindly confrrm if the module has a NS7 to NS8 Migration.

When the locks occurs, is a related restic/backup process in the background?
To get some info on the lock:

restic list locks
restic cat lock your-lock-id  # can give the PID holding the lock

Is prune configured and how often it runs? The prune command locks the repository exclusively, preventing other processes from accessing it (while prunning).

More info from logs to understand the root cause.

1 Like

At the moment, no locks occurs. I will check it tomorrow.

I set up the backup via the Cockpit GUI. If prune is not configured there, then it is not. I have never heard of this before.

I have been using OPNSense as my default DNS server for years. I have not changed the DNS configuration.

Didnā€™t work for me: Unsuccessful NS8 migration of a simple NS7 server - #23 by davidep

And this one: NS8: Dokuwiki configuration

systemctl -l status cockpit-user.socket cockpit.socket cockpit
# systemctl -l status cockpit-user.socket cockpit.socket cockpit
ā— cockpit-user.socket - Cockpit Web Service Socket for Users
   Loaded: loaded (/usr/lib/systemd/system/cockpit-user.socket; enabled; vendor preset: disabled)
   Active: active (listening) since Thu 2024-01-04 16:48:55 CET; 3h 6min ago
     Docs: man:cockpit-ws(8)
   Listen: [::]:9191 (Stream)

Jan 04 16:48:55 daho-nethserver.home.dargels.de systemd[1]: Listening on Cockpit Web Service Socket for Users.

ā— cockpit.socket - Cockpit Web Service Socket
   Loaded: loaded (/usr/lib/systemd/system/cockpit.socket; enabled; vendor preset: disabled)
   Active: active (listening) since Thu 2024-01-04 16:48:55 CET; 3h 6min ago
     Docs: man:cockpit-ws(8)
   Listen: [::]:9090 (Stream)

Jan 04 16:48:55 daho-nethserver.home.dargels.de systemd[1]: Starting Cockpit Web Service Socket.
Jan 04 16:48:55 daho-nethserver.home.dargels.de systemd[1]: Listening on Cockpit Web Service Socket.

ā— cockpit.service - Cockpit Web Service
   Loaded: loaded (/usr/lib/systemd/system/cockpit.service; static; vendor preset: disabled)
  Drop-In: /etc/systemd/system/cockpit.service.d
           ā””ā”€nethserver.conf
   Active: inactive (dead) since Thu 2024-01-04 16:58:02 CET; 2h 57min ago
     Docs: man:cockpit-ws(8)
  Process: 5703 ExecStart=/usr/libexec/cockpit-ws (code=exited, status=0/SUCCESS)
  Process: 5699 ExecStartPre=/usr/sbin/remotectl certificate --ensure --user=root --group=cockpit-ws --selinux-type=etc_t (code=exited, status=0/SUCCESS)
 Main PID: 5703 (code=exited, status=0/SUCCESS)

Jan 04 16:56:32 daho-nethserver.home.dargels.de systemd[1]: Starting Cockpit Web Service...
Jan 04 16:56:32 daho-nethserver.home.dargels.de remotectl[5699]: /usr/bin/chcon: can't apply partial context to unlabeled file ā€˜/etc/cockpit/ws-certs.d/99-nethserver.certā€™
Jan 04 16:56:32 daho-nethserver.home.dargels.de remotectl[5699]: remotectl: couldn't change SELinux type context 'etc_t' for certificate: /etc/cockpit/ws-certs.d/99-nethserver.cert: Child process exited with code 1
Jan 04 16:56:32 daho-nethserver.home.dargels.de systemd[1]: Started Cockpit Web Service.
Jan 04 16:56:32 daho-nethserver.home.dargels.de cockpit-ws[5703]: Using certificate: /etc/cockpit/ws-certs.d/99-nethserver.cert

I think to recall the same SELinux message regarding cockpit is logged but without causing problems. So no concern here.

You might have to dig deeper (or try to start cockpit service and check status again) to find something relevant preventing cockpit from starting. Otherwise:

journalctl -u cockpit
-- Logs begin at Thu 2024-01-04 16:48:44 CET, end at Thu 2024-01-04 21:28:54 CET. --
Jan 04 16:49:11 daho-nethserver.home.dargels.de systemd[1]: Starting Cockpit Web Service...
Jan 04 16:49:11 daho-nethserver.home.dargels.de remotectl[2438]: /usr/bin/chcon: can't apply partial context to unlabele
Jan 04 16:49:11 daho-nethserver.home.dargels.de remotectl[2438]: remotectl: couldn't change SELinux type context 'etc_t'
Jan 04 16:49:11 daho-nethserver.home.dargels.de systemd[1]: Started Cockpit Web Service.
Jan 04 16:49:11 daho-nethserver.home.dargels.de cockpit-ws[2441]: Using certificate: /etc/cockpit/ws-certs.d/99-nethserv
Jan 04 16:56:32 daho-nethserver.home.dargels.de systemd[1]: Starting Cockpit Web Service...
Jan 04 16:56:32 daho-nethserver.home.dargels.de remotectl[5699]: /usr/bin/chcon: can't apply partial context to unlabele
Jan 04 16:56:32 daho-nethserver.home.dargels.de remotectl[5699]: remotectl: couldn't change SELinux type context 'etc_t'
Jan 04 16:56:32 daho-nethserver.home.dargels.de systemd[1]: Started Cockpit Web Service.
Jan 04 16:56:32 daho-nethserver.home.dargels.de cockpit-ws[5703]: Using certificate: /etc/cockpit/ws-certs.d/99-nethserv
lines 1-11/11 (END)

Not relevant (I think), it didnā€™t provide additional info on the problem that happened later (16:58:xx).

Maybe a service restart and checking its status afterwards could be a faster way to get relevant errors.

After reboot the server:

-- Logs begin at Fri 2024-01-05 10:12:34 CET, end at Fri 2024-01-05 10:15:03 CET. --
Jan 05 10:13:00 daho-nethserver.home.dargels.de systemd[1]: Starting Cockpit Web Service...
Jan 05 10:13:00 daho-nethserver.home.dargels.de systemd[1]: Started Cockpit Web Service.
Jan 05 10:13:00 daho-nethserver.home.dargels.de cockpit-ws[2443]: Using certificate: /etc/cockpit/ws-certs.d/99-nethserv
lines 1-4/4 (END)

after restart service

Jan 05 10:17:35 daho-nethserver.home.dargels.de systemd[1]: Starting Cockpit Web Service...
Jan 05 10:17:35 daho-nethserver.home.dargels.de remotectl[3055]: /usr/bin/chcon: can't apply partial context to unlabele
Jan 05 10:17:35 daho-nethserver.home.dargels.de remotectl[3055]: remotectl: couldn't change SELinux type context 'etc_t'
Jan 05 10:17:35 daho-nethserver.home.dargels.de systemd[1]: Started Cockpit Web Service.
Jan 05 10:17:35 daho-nethserver.home.dargels.de cockpit-ws[3058]: Using certificate: /etc/cockpit/ws-certs.d/99-nethserv
lines 1-9/9 (END)

I cannot see any new relevant information.

Checking this puts me on the right trackā€¦

Jan  5 10:15:03 daho-nethserver sshd[2679]: Accepted keyboard-interactive/pam for root from 10.99.3.2 port 63649 ssh2

10.99.3.2 is my IP if Iā€™m connected to the OPNSense-VPN.

I disconnected and oh wonder, cockpit starts.
Cockpit was only some seconds accessible, enough time to add 10.99.3.0/255.255.255.0 to trusted networks, then the session was closed.

I have to investigate out why I keep getting kicked out, even when Iā€™m not on the VPN.

2 Likes

User and groups are lost again. :frowning:

1 Like

Did you check the system memory usage? The following commands should fit ns7 too

yes, no abnormalities.

[root@daho-nethserver ~]# dmesg | grep "Out of memory"
[root@daho-nethserver ~]# dmesg | grep oom-killer
[root@daho-nethserver ~]#