Following Core Update NS8 no longer accessible

EddieA · June 11, 2024, 10:26pm

My current situation is almost identical to this.

I clicked to apply the latest core update, which appeared to stall at 16%. After some time, when this appear to be not moving along I went to the Logs page and started poking around to see if anything seemed out of place. In the logs I saw a large number of “permission denied” errors. After clicking around the menus a little more, the whole admin interface locked up and wouldn’t respond to anything.

After the reboot of the system nothing is working. None of the services started, no IP was assigned and most importantly, redis fails to start. Here’s the output from the 1st restart attempt:

What other information can I collect.

Cheers.

nzsolt · June 13, 2024, 2:22pm

This also happened in my test system, the main node works, but the secondary one disappeared. The secondary cluster admin tells me to administer on the main one.

mrmarkuz · June 13, 2024, 7:58pm

@EddieA I found some infos, maybe it helps:

Are there files in /etc/pam.d/* or is sss included in /etc/nsswitch.conf?
See also 1768954 – podman is not working, compains about sd-bus call

@nzsolt I don’t know if the issues are similar as @EddieA uses a single node cluster which doesn’t work anymore and you lost a node.

Did you try if the wireguard IP of the master (usually 10.5.4.1) is reachable from the second node?

ping 10.5.4.1

Are you using VPS? Maybe it helps to reload the firewall on the nodes?

firewall-cmd --reload

EddieA · July 3, 2024, 10:58pm

Only just found some time to look at this.

I started up another instance of NS8 alongside this broken one, so I could compare the contents of files/directories/systemctl status/etc. as I worked through those references to see if any were relevant. Unfortunately nothing helped and I was back to square one, so started scouring/comparing the logs of the startup of both machines. Digging deeper into one of the permission denied errors on /etc/chrony.keys it really didn’t make any sense at all until I spotted that /etc only had permissions at the owner level of root. The group and other permissions were both NONE. No wonder that hardly anything was working.

Setting those to the normal read/execute got me a lot further, but that showed that the /etc/nethserver directory had been similarly crippled to root only at the owner level. I hoped correcting this would be the final fix, but no.

Now, I have this trying to access the UI:

So, what’s my next step here. Have I got another crippled directory, and how did /etc and /etc/nethserver end up with the permissions they had.

Cheers,
Eddie

EddieA · July 5, 2024, 7:37pm

Extrapolating the premise that there still might be more directories with blown permissions I ran a couple of “find” jobs on the 2 servers looking for directories with 700 permissions and then compared results. This threw up a few more that needed fixing on the dead upgraded server:

/var/lib/nethserver
/var/lib/nethserver/cluster
/var/lib/nethserver/cluster/actions
/var/lib/nethserver/cluster/actions/join-node
/var/lib/nethserver/cluster/actions/remove-repository

After correcting the permissions on these, and restarting, it appears that I now have a fully functioning server again.

But it really would be nice to know how the Core Upgrade managed to mangle so many directory permissions as to make the server unusable.

Cheers.