One of our nethservers crashed twice the last days. This only started after having reinstalled its hypervisor and changed from ext4 to zfs and twice it was during a backup of the vm to ProxMox backup server, so I first suspected the vm crashed because of insufficient ram, but as it happened again yesterday after I limited ramusage of zfs, I want to try to reinstall all installed packages including the dependencies to be sure there is no corrupted peace of software that could cause these random panics. If the problem is persisting next step will be to get used to create kernel dump and examine it with kexec or ask here for further advice on where to look for informations that could reveal the reason of the crash. /var/log/messages did not show anything.
Speaking of /var log messages - I see tons of useless messages like this. How can those be suppressed? Nov 3 17:31:01 hostname systemd: Started Session 1212 of user root.
In the meantime I first reinstalled yum reinstall @iso. Next I’d like to reinstall all components including their dependencies, so I can create a list of installed packages by rpm -qa --qf "%{NAME}\n" | sort > /somewhere/installed-software.log
or will it be sufficient to only install nethserver-* packages thus creating the list with rpm -qa --qf "%{NAME}\n" | sort | grep "nethserver-" > /somewhere/installed-software.log? Will in this case its dependencies also be re-installed or do I have to install the first example?
I plan to execute the reinstall with: yum -y install $(cat /somewhere/installed-software.log) or do you have a better example with rpm? And where do I need to pay attention, as the configuration of the installed software must persist. Thanks in advance for any advice.
after a disk failure which resulted in corruption on our host while (Murphy) I had broke my md raid mirror to convert the host from ext4 to zfs so I had to reinstall my host (ProxMox). Its only after then that our external nethserver started to crash about once every day or 2 days. I first suspected the ramusage as zfs can be demanding. After having limited zfs to 8 gig (the system only has 32 Gig in total) and configured the vms accordingly I could rule out that this was the reason my external neth (serving nextcloud, cal-/carddav, imap, rspamd, firewall services…) I started thinking that there could be a problem with an installed component. Having talked with Andy Wismer I now think it could be that the Prox vzdump I had restored after fresh install of the hypervisor already was corrupt. That led me to the choice of making an xfs filesystem check, but I doubted it and finally prefered a clean re-install and restore from restic backup, I took regularly. That takes some time to restore but on the other hand that way I will have the best bet to get this neth stable again.
While doing the desaster recovery, I made one small mistake, updating core packages before adding back subscription, but I think that is not too bad. One other observation was, that the openvpn s2s tunnel was configured but on the restored node the static preshared key was not there, so I had to copy paste it from the other neth, that was acting as client for this connection.
Right now, the system is restoring and I am looking forward to see if after that everything runs stable as before. One thing that I learned from this desaster is that I have to implement some way of automatically copy of the config backups to an external location.
While reinstalling I also have realized that kdump is enabled by default, so if the problem unexpectedly would persist, I will have to start learning howto look for errors with kexec, but I hope that will not be necessary.
On the old system I saw many entries in /var/log/messages as this one:
Nov 3 17:31:01 hostname systemd: Started Session 1212 of user root. I think, read somewhere that this is normal and coming from cron jobs,but could be eliminated. It would be cool to learn how those lines could be eliminated.