After installing a very current Debian from a downloaded ISO Image on Proxmox, updating and making sure the FQDN & static IP are correctly set on Debian 12, the install of NS8 according to the instructions here fail:
Actually, all seem to run through just fine, until this section of code while running ths bash installer:
Created symlink /etc/systemd/system/default.target.wants/redis.service → /etc/systemd/system/redis.service.
Generating cluster password:
Generating api-server password:
Generating node password:
AUTH failed: WRONGPASS invalid username-password pair or user is disabled.
OK
OK
OK
3
3
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
Start API server and core agents:
Created symlink /etc/systemd/system/multi-user.target.wants/api-server.service → /etc/systemd/system/api-server.service.
Created symlink /etc/systemd/system/default.target.wants/agent@cluster.service → /etc/systemd/system/agent@.service.
Created symlink /etc/systemd/system/default.target.wants/agent@node.service → /etc/systemd/system/agent@.service.
Created symlink /etc/systemd/system/default.target.wants/rclone-webdav.service → /etc/systemd/system/rclone-webdav.service.
Grant initial permissions:
Install Traefik:
<7>podman-pull-missing ghcr.io/nethserver/traefik:2.2.1
Trying to pull ghcr.io/nethserver/traefik:2.2.1...
Getting image source signatures
Copying blob sha256:64740ac9b8b758509c59ba37a98734ffcc728913955d51fb9c71c9eda801f5ff
Copying config sha256:f1acbfc376147395931b5c16bdb0de7111e94c27770c5a037b7381a7838f9c0f
Writing manifest to image destination
Storing signatures
f1acbfc376147395931b5c16bdb0de7111e94c27770c5a037b7381a7838f9c0f
<7>extract-ui ghcr.io/nethserver/traefik:2.2.1
Extracting container filesystem ui to /var/lib/nethserver/cluster/ui/apps/traefik1
ui/index.html
06ef8f5e179b637ae02cbeb2914ce90474ebbb5d7bc2676d0cb75dd9c8ea3e31
Assertion failed
File "/var/lib/nethserver/cluster/actions/add-module/50update", line 223, in <module>
agent.assert_exp(create_module_result['exit_code'] == 0) # Ensure create-module is successful
[root@suma-ns8 ~]#
To me, it seems that first Redis is failing due to some auth issue, then traefik barfs with a related error…
The VM has the following allocated:
8 CPU cores
16 GB RAM
1.6 TB Disk space, formatted in XFS on Debian
Storage is NOT ZFS !!!
The VM is stored in a qcow2 format on a PVE dedicated NAS.
Average load on Proxmox is under 10%…
This should NOT happen on a freshly installed VM (Debian).
I’ve seen other issues with Redis / NS8, but seem a bit old (2023) or concern too little VM Memory, here not really an issue IMHO…
I do hope this has nothing (yet) to do with Redis changing their license.
I’ve seen similiar issues, but most from 2022 or early 2023, so I doubt these are the same cause, especially those were too little RAM…
If Redis service is running, then it seems OK. I don’t even get to login, as the cluster page is up, but Redis is for stats etc, and as it’s not started, probably other stuff isn’t started either…
Andy, thank you for sharing the full log. I edited your post including the interesting log excerpt. Specifically this is the recurring error:
The write-hosts script cannot connect the Traefik API endpoint at 127.0.0.1:80 to retrieve a list of host names. This feature was recently introduced to configure additional host names in the DNSMasq module. I suppose there’s a race with Traefik startup, which seems regular in the next lines /cc @Tbaile
After that, Traefik seems to start correctly…
However, given the previous exit code, after 30 seconds, Systemd decides to stop the unit:
Note that the container seems to ignore SIGTERM (another issue). After being killed with signal 9, the unit is then restarted repeatedly.
I suppose your machine is faster or provides more parallelism than the developer’s typical environment. I’ll try to reproduce this bug. Meanwhile, you can recover the installation by trying to restart the Traefik service. If the error persists, please try the following workarounds.
The first one is a blind attempt: it changes the traefik.service unit type to notify.
And thanks for the feedback / suggestions. I will have time to try them this afternoon, the VM is still “ready”.
Probably not… It is a “fat” server, a Supermicro 4HE Rack unit, with 2 CPUs on sockets, but the box, even though updated with new disks, etc., is still 10 years old… 96 GB RAM and a total of 8 cores isn’t impressive nowadays… And all storage is on 5400 Disks… But it’s a solid workhorse, LAN, dedicated Storage-LAN and dedicated Backup-LAN with PBS. The replacement is next to it, a HPE 2 HE Server, MUCH more powerful and in the same Proxmox Cluster. This one is equipped with fast NVME storage.
Both to no avail. No login possible afterwards, neither with existing password or default admin password ( Nethesis,1234)
Note: in the second case, tried without the “+” in the beginning!
I’ve implemented the backoff in the PR, however I cannot reproduce the issue.
storage is on 5400 Disks
This bug is coming out most likely due to the increased latency and delay of the disks, but I’ve got only machines running on NVMe
The backoff time is 0.5 seconds (multiplying every failed attempt, maxing out at 10 retries or 27,5 seconds max) should give the service plenty of time to start up
You can try the fix using the module ghcr.io/nethserver/traefik:setting-backoff-on-init on installation.
How to here for anyone who stumbles in this thread.
Thanks, I will have time to test this.
I do have a couple of Proxmox to try this on, indeed, I am attempting the more or less same Debian 12 VM on four different environments.
Two are fairly well powered Proxmox, more than enough RAM and fully NVME equipped Storage (also ample!).
Only for one client do I have an issue, there the needed Storage is too big.
The “largest” NethServer, using local-lvm for the migration. Afterwards will be migrated (Proxmox Cluster) to the much faster newer HP Server with NVME storage…