Addressing random failures during NS8 Core updates

davidep · January 9, 2025, 3:30pm

There have been reports of apparently random failures during NS8 Core updates. Symptoms may vary; for example, the cluster-admin becomes inaccessible, and an “API not found” error is displayed when trying to access /cluster-admin via HTTP.

What these failed updates have in common is that some directories involved in the Core update are left with incorrect permissions. The most notable examples are /etc and /var, which cause many other errors, e.g.:

[root@rl1 ~]# ls -ld /etc/
drwx------. 99 root root 8192 Jan  9 11:33 /etc/

To get a full list of directories with incorrect permissions, run this command:

( cd / ; ls -ld $(grep /$ /var/lib/nethserver/node/state/coreimage.lst) ; )  | grep -- ------

While we work to reproduce and isolate the bug, I want to share a possible remediation procedure:

Fix directories with incorrect permissions, e.g., chmod -c 0755 ....
Reboot the node.

Ref Mattermost

davidep · January 14, 2025, 9:36am

The bug is recorded here: Corrupted system directory permissions after Core update · Issue #7250 · NethServer/dev · GitHub

As workaround, follow the instructions above:

davidep:

To get a full list of directories with incorrect permissions, run this command:
( cd / ; ls -ld $(grep /$ /var/lib/nethserver/node/state/coreimage.lst) ; )  | grep -- ------
While we work to reproduce and isolate the bug, I want to share a possible remediation procedure:

Fix directories with incorrect permissions, e.g., chmod -c 0755 ....

Reboot the node.