Clean install of NS8 on Debian12, Fails with Redis

Hi

After installing a very current Debian from a downloaded ISO Image on Proxmox, updating and making sure the FQDN & static IP are correctly set on Debian 12, the install of NS8 according to the instructions here fail:

https://docs.nethserver.org/projects/ns8/en/latest/install.html#install-linux-section

Actually, all seem to run through just fine, until this section of code while running ths bash installer:

Created symlink /etc/systemd/system/default.target.wants/redis.service → /etc/systemd/system/redis.service.
Generating cluster password:
Generating api-server password:
Generating node password:
AUTH failed: WRONGPASS invalid username-password pair or user is disabled.
OK
OK
OK
3
3
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
Start API server and core agents:
Created symlink /etc/systemd/system/multi-user.target.wants/api-server.service → /etc/systemd/system/api-server.service.
Created symlink /etc/systemd/system/default.target.wants/agent@cluster.service → /etc/systemd/system/agent@.service.
Created symlink /etc/systemd/system/default.target.wants/agent@node.service → /etc/systemd/system/agent@.service.
Created symlink /etc/systemd/system/default.target.wants/rclone-webdav.service → /etc/systemd/system/rclone-webdav.service.
Grant initial permissions:
Install Traefik:
<7>podman-pull-missing ghcr.io/nethserver/traefik:2.2.1
Trying to pull ghcr.io/nethserver/traefik:2.2.1...
Getting image source signatures
Copying blob sha256:64740ac9b8b758509c59ba37a98734ffcc728913955d51fb9c71c9eda801f5ff
Copying config sha256:f1acbfc376147395931b5c16bdb0de7111e94c27770c5a037b7381a7838f9c0f
Writing manifest to image destination
Storing signatures
f1acbfc376147395931b5c16bdb0de7111e94c27770c5a037b7381a7838f9c0f
<7>extract-ui ghcr.io/nethserver/traefik:2.2.1
Extracting container filesystem ui to /var/lib/nethserver/cluster/ui/apps/traefik1
ui/index.html
06ef8f5e179b637ae02cbeb2914ce90474ebbb5d7bc2676d0cb75dd9c8ea3e31
Assertion failed
  File "/var/lib/nethserver/cluster/actions/add-module/50update", line 223, in <module>
    agent.assert_exp(create_module_result['exit_code'] == 0) # Ensure create-module is successful

[root@suma-ns8 ~]# 

To me, it seems that first Redis is failing due to some auth issue, then traefik barfs with a related error…

The VM has the following allocated:

8 CPU cores
16 GB RAM
1.6 TB Disk space, formatted in XFS on Debian

Storage is NOT ZFS !!!
The VM is stored in a qcow2 format on a PVE dedicated NAS.

Average load on Proxmox is under 10%…

This should NOT happen on a freshly installed VM (Debian).

I’ve seen other issues with Redis / NS8, but seem a bit old (2023) or concern too little VM Memory, here not really an issue IMHO…

I do hope this has nothing (yet) to do with Redis changing their license.

Anyone has any ideas?

My 2 cents
Andy

1 Like

Give me 15min…

grafik

1 Like

Even 30 or 60, whatever it takes…

All is backuped before NS8 install, but even doing a fresh install would be fine!

Yes, there is a problem:

1 Like

I got that twice at exactly the same spot.
And on two different Proxmox (in the same cluster, but VERY different hardware. Both are more than enough…

???

Do you know how long this problem exists, because yesterday I installed it 2 times…

How can I check my servers if they are infected?

I’ve seen similiar issues, but most from 2022 or early 2023, so I doubt these are the same cause, especially those were too little RAM…

:slight_smile:

If Redis service is running, then it seems OK. I don’t even get to login, as the cluster page is up, but Redis is for stats etc, and as it’s not started, probably other stuff isn’t started either…

:frowning:

Maybe @davidep or @dnutan have an idea…

My 2 cents
Andy

The line below is ugly, but harmless:

This one fails instead. Check in the logs:

api-server-logs logs -e module -m dump -n traefik1 -l 2000

or in the journal:

journalctl _UID=$(id -u traefik1)
3 Likes

api-server-logs logs -e module -m dump -n traefik1 -l 2000

returns:

[SOCKET] error executing Cmd for dump: exit status 1

journalctl _UID=$(id -u traefik1)

returns (edited by davidep)

Apr 22 11:32:59 suma-ns8 agent@traefik1[7584]: task/module/traefik1/9c180fe3-6306-437b-8dc7-9ff70f711ed4: create-module/50create is starting
Apr 22 11:32:59 suma-ns8 agent@traefik1[7584]: Created symlink /home/traefik1/.config/systemd/user/default.target.wants/traefik.service → /home/traefik1/.config/systemd/user/traefik.service.
Apr 22 11:32:59 suma-ns8 systemd[7568]: Reloading.
Apr 22 11:32:59 suma-ns8 systemd[7568]: Started certificate-exporter.path - Monitor acme.json file for changes.
Apr 22 11:32:59 suma-ns8 systemd[7568]: Starting traefik.service - Traefik edge proxy...
Apr 22 11:33:00 suma-ns8 podman[7722]: 2024-04-22 11:33:00.039506051 +0200 CEST m=+0.037350906 image pull  docker.io/traefik:v2.11.0
Apr 22 11:33:00 suma-ns8 podman[7722]: 2024-04-22 11:33:00.352978168 +0200 CEST m=+0.350823018 volume create traefik-acme
Apr 22 11:33:00 suma-ns8 podman[7722]: 
Apr 22 11:33:00 suma-ns8 podman[7722]: 2024-04-22 11:33:00.362324061 +0200 CEST m=+0.360168941 container create 725dc1daba9319f328e8bccd216b574f9d4a6d94c71d62006ecfef10b13c0db3 (image=docker.io/library/traefik:v2.11.0, name=traefik, org.opencontainers.image.version=v2.11.0, org.opencontainers.image.description=A modern reverse-proxy, org.opencontainers.image.documentation=https://docs.traefik.io, org.opencontainers.image.source=https://github.com/traefik/traefik, org.opencontainers.image.title=Traefik, PODMAN_SYSTEMD_UNIT=traefik.service, org.opencontainers.image.url=https://traefik.io, org.opencontainers.image.vendor=Traefik Labs)
Apr 22 11:33:00 suma-ns8 systemd[7568]: Started libpod-725dc1daba9319f328e8bccd216b574f9d4a6d94c71d62006ecfef10b13c0db3.scope - libcrun container.
Apr 22 11:33:00 suma-ns8 podman[7722]: 2024-04-22 11:33:00.477176659 +0200 CEST m=+0.475021578 container init 725dc1daba9319f328e8bccd216b574f9d4a6d94c71d62006ecfef10b13c0db3 (image=docker.io/library/traefik:v2.11.0, name=traefik, org.opencontainers.image.documentation=https://docs.traefik.io, org.opencontainers.image.source=https://github.com/traefik/traefik, org.opencontainers.image.title=Traefik, PODMAN_SYSTEMD_UNIT=traefik.service, org.opencontainers.image.url=https://traefik.io, org.opencontainers.image.vendor=Traefik Labs, org.opencontainers.image.version=v2.11.0, org.opencontainers.image.description=A modern reverse-proxy)
Apr 22 11:33:00 suma-ns8 podman[7722]: 2024-04-22 11:33:00.485718025 +0200 CEST m=+0.483562878 container start 725dc1daba9319f328e8bccd216b574f9d4a6d94c71d62006ecfef10b13c0db3 (image=docker.io/library/traefik:v2.11.0, name=traefik, org.opencontainers.image.documentation=https://docs.traefik.io, org.opencontainers.image.source=https://github.com/traefik/traefik, org.opencontainers.image.title=Traefik, PODMAN_SYSTEMD_UNIT=traefik.service, org.opencontainers.image.url=https://traefik.io, org.opencontainers.image.vendor=Traefik Labs, org.opencontainers.image.version=v2.11.0, org.opencontainers.image.description=A modern reverse-proxy)
Apr 22 11:33:00 suma-ns8 traefik1[7722]: 725dc1daba9319f328e8bccd216b574f9d4a6d94c71d62006ecfef10b13c0db3
Apr 22 11:33:01 suma-ns8 traefik1[7742]: Traceback (most recent call last):
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 174, in _new_conn
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     conn = connection.create_connection(
Apr 22 11:33:01 suma-ns8 traefik1[7742]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 96, in create_connection
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     raise err
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 86, in create_connection
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     sock.connect(sa)
Apr 22 11:33:01 suma-ns8 traefik1[7742]: ConnectionRefusedError: [Errno 111] Connection refused
Apr 22 11:33:01 suma-ns8 traefik1[7742]: During handling of the above exception, another exception occurred:
Apr 22 11:33:01 suma-ns8 traefik1[7742]: Traceback (most recent call last):
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 704, in urlopen
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     httplib_response = self._make_request(
Apr 22 11:33:01 suma-ns8 traefik1[7742]:                        ^^^^^^^^^^^^^^^^^^^
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 399, in _make_request
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     conn.request(method, url, **httplib_request_kw)
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 239, in request
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3.11/http/client.py", line 1282, in request
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     self._send_request(method, url, body, headers, encode_chunked)
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3.11/http/client.py", line 1328, in _send_request
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     self.endheaders(body, encode_chunked=encode_chunked)
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3.11/http/client.py", line 1277, in endheaders
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     self._send_output(message_body, encode_chunked=encode_chunked)
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3.11/http/client.py", line 1037, in _send_output
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     self.send(msg)
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3.11/http/client.py", line 975, in send
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     self.connect()
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 205, in connect
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     conn = self._new_conn()
Apr 22 11:33:01 suma-ns8 traefik1[7742]:            ^^^^^^^^^^^^^^^^
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 186, in _new_conn
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     raise NewConnectionError(
Apr 22 11:33:01 suma-ns8 traefik1[7742]: urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f18a96a4e50>: Failed to establish a new connection: [Errno 111] Connection refused
Apr 22 11:33:01 suma-ns8 traefik1[7742]: During handling of the above exception, another exception occurred:
Apr 22 11:33:01 suma-ns8 traefik1[7742]: Traceback (most recent call last):
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/local/agent/pyenv/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     resp = conn.urlopen(
Apr 22 11:33:01 suma-ns8 traefik1[7742]:            ^^^^^^^^^^^^^
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 788, in urlopen
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     retries = retries.increment(
Apr 22 11:33:01 suma-ns8 traefik1[7742]:               ^^^^^^^^^^^^^^^^^^
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 592, in increment
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     raise MaxRetryError(_pool, url, error or ResponseError(cause))
Apr 22 11:33:01 suma-ns8 traefik1[7742]: urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=80): Max retries exceeded with url: /626ae439-6b9f-41fe-b6ab-8822929c2ba7/api/http/routers (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f18a96a4e50>: Failed to establish a new connection: [Errno 111] Connection refused'))
Apr 22 11:33:01 suma-ns8 traefik1[7742]: During handling of the above exception, another exception occurred:
Apr 22 11:33:01 suma-ns8 traefik1[7742]: Traceback (most recent call last):
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/home/traefik1/.config/bin/write-hosts", line 23, in <module>
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     response = requests.get(f'http://127.0.0.1/{api_path}/api/http/routers').json()
Apr 22 11:33:01 suma-ns8 traefik1[7742]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/local/agent/pyenv/lib/python3.11/site-packages/requests/api.py", line 73, in get
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     return request("get", url, params=params, **kwargs)
Apr 22 11:33:01 suma-ns8 traefik1[7742]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/local/agent/pyenv/lib/python3.11/site-packages/requests/api.py", line 59, in request
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     return session.request(method=method, url=url, **kwargs)
Apr 22 11:33:01 suma-ns8 traefik1[7742]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/local/agent/pyenv/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     resp = self.send(prep, **send_kwargs)
Apr 22 11:33:01 suma-ns8 traefik1[7742]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/local/agent/pyenv/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     r = adapter.send(request, **kwargs)
Apr 22 11:33:01 suma-ns8 traefik1[7742]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/usr/local/agent/pyenv/lib/python3.11/site-packages/requests/adapters.py", line 519, in send
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     raise ConnectionError(e, request=request)
Apr 22 11:33:01 suma-ns8 traefik1[7742]: requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=80): Max retries exceeded with url: /626ae439-6b9f-41fe-b6ab-8822929c2ba7/api/http/routers (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f18a96a4e50>: Failed to establish a new connection: [Errno 111] Connection refused'))
Apr 22 11:33:01 suma-ns8 traefik1[7742]: During handling of the above exception, another exception occurred:
Apr 22 11:33:01 suma-ns8 traefik1[7742]: Traceback (most recent call last):
Apr 22 11:33:01 suma-ns8 traefik1[7742]:   File "/home/traefik1/.config/bin/write-hosts", line 25, in <module>
Apr 22 11:33:01 suma-ns8 traefik1[7742]:     raise Exception(f'Error reaching traefik daemon: {e}')
Apr 22 11:33:01 suma-ns8 traefik1[7742]: Exception: Error reaching traefik daemon: HTTPConnectionPool(host='127.0.0.1', port=80): Max retries exceeded with url: /626ae439-6b9f-41fe-b6ab-8822929c2ba7/api/http/routers (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f18a96a4e50>: Failed to establish a new connection: [Errno 111] Connection refused'))
Apr 22 11:33:01 suma-ns8 systemd[7568]: traefik.service: Control process exited, code=exited, status=1/FAILURE
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.yaml"
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=info msg="Traefik version 2.11.0 built on 2024-02-12T15:26:45Z"
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://doc.traefik.io/traefik/contributing/data-collection/\n"
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=info msg="Starting provider aggregator aggregator.ProviderAggregator"
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=info msg="Starting provider *file.Provider"
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=info msg="Starting provider *traefik.Provider"
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=info msg="Starting provider *acme.ChallengeTLSALPN"
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=info msg="Starting provider *acme.Provider"
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=info msg="Testing certificate renew..." ACME CA="https://acme-v02.api.letsencrypt.org/directory" providerName=acmeServer.acme
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=warning msg="IPWhiteList is deprecated, please use IPAllowList instead." routerName=ApisEndpointHttp@file entryPointName=http
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=warning msg="No domain found in rule Path(`/cluster-admin`) || PathPrefix(`/cluster-admin/`), the TLS options applied for this router will depend on the SNI of each request" entryPointName=https routerName=ApiServer-https@file
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=warning msg="IPWhiteList is deprecated, please use IPAllowList instead." entryPointName=http routerName=ApisEndpointHttp@file
Apr 22 11:33:01 suma-ns8 traefik[7738]: time="2024-04-22T09:33:01Z" level=warning msg="No domain found in rule Path(`/cluster-admin`) || PathPrefix(`/cluster-admin/`), the TLS options applied for this router will depend on the SNI of each request" entryPointName=https routerName=ApiServer-https@file
Apr 22 11:33:01 suma-ns8 systemd[7568]: Started certificate-exporter.service - Export acme.json changes.
Apr 22 11:33:02 suma-ns8 traefik1[7762]: ACME TLS certificates for Traefik were not found in /home/traefik1/.local/share/containers/storage/volumes/traefik-acme/_data/acme.json: Expecting value: line 1 column 1 (char 0)
Apr 22 11:34:31 suma-ns8 systemd[7568]: traefik.service: State 'stop-sigterm' timed out. Killing.
Apr 22 11:34:31 suma-ns8 systemd[7568]: traefik.service: Killing process 7738 (conmon) with signal SIGKILL.
Apr 22 11:34:31 suma-ns8 systemd[7568]: traefik.service: Main process exited, code=killed, status=9/KILL
Apr 22 11:34:31 suma-ns8 podman[7767]: 2024-04-22 11:34:31.803778107 +0200 CEST m=+0.160586683 container stop 725dc1daba9319f328e8bccd216b574f9d4a6d94c71d62006ecfef10b13c0db3 (image=docker.io/library/traefik:v2.11.0, name=traefik, org.opencontainers.image.description=A modern reverse-proxy, org.opencontainers.image.documentation=https://docs.traefik.io, org.opencontainers.image.source=https://github.com/traefik/traefik, org.opencontainers.image.title=Traefik, org.opencontainers.image.url=https://traefik.io, org.opencontainers.image.vendor=Traefik Labs, org.opencontainers.image.version=v2.11.0, PODMAN_SYSTEMD_UNIT=traefik.service)
Apr 22 11:34:31 suma-ns8 podman[7767]: 2024-04-22 11:34:31.922931609 +0200 CEST m=+0.279740199 container remove 725dc1daba9319f328e8bccd216b574f9d4a6d94c71d62006ecfef10b13c0db3 (image=docker.io/library/traefik:v2.11.0, name=traefik, org.opencontainers.image.source=https://github.com/traefik/traefik, org.opencontainers.image.title=Traefik, org.opencontainers.image.url=https://traefik.io, org.opencontainers.image.vendor=Traefik Labs, org.opencontainers.image.version=v2.11.0, PODMAN_SYSTEMD_UNIT=traefik.service, org.opencontainers.image.description=A modern reverse-proxy, org.opencontainers.image.documentation=https://docs.traefik.io)
Apr 22 11:34:31 suma-ns8 traefik1[7767]: 725dc1daba9319f328e8bccd216b574f9d4a6d94c71d62006ecfef10b13c0db3
Apr 22 11:34:31 suma-ns8 systemd[7568]: traefik.service: Failed with result 'exit-code'.
Apr 22 11:34:31 suma-ns8 systemd[7568]: Failed to start traefik.service - Traefik edge proxy.
Apr 22 11:34:31 suma-ns8 systemd[7568]: traefik.service: Consumed 1.524s CPU time.
Apr 22 11:34:31 suma-ns8 agent@traefik1[7584]: Job for traefik.service failed because the control process exited with error code.
Apr 22 11:34:31 suma-ns8 agent@traefik1[7584]: See "systemctl --user status traefik.service" and "journalctl --user -xeu traefik.service" for details.
Apr 22 11:34:31 suma-ns8 agent@traefik1[7584]: task/module/traefik1/9c180fe3-6306-437b-8dc7-9ff70f711ed4: action "create-module" status is "aborted" (1) at step 50create
Apr 22 11:34:32 suma-ns8 systemd[7568]: traefik.service: Scheduled restart job, restart counter is at 1.

The file is too big, I will have to find a different solution…
Here for a quick solution.

My 2 cents
Andy

2 Likes

Andy, thank you for sharing the full log. I edited your post including the interesting log excerpt. Specifically this is the recurring error:

The write-hosts script cannot connect the Traefik API endpoint at 127.0.0.1:80 to retrieve a list of host names. This feature was recently introduced to configure additional host names in the DNSMasq module. I suppose there’s a race with Traefik startup, which seems regular in the next lines /cc @Tbaile

After that, Traefik seems to start correctly…

However, given the previous exit code, after 30 seconds, Systemd decides to stop the unit:

Note that the container seems to ignore SIGTERM (another issue). After being killed with signal 9, the unit is then restarted repeatedly.


I suppose your machine is faster or provides more parallelism than the developer’s typical environment. I’ll try to reproduce this bug. Meanwhile, you can recover the installation by trying to restart the Traefik service. If the error persists, please try the following workarounds.

The first one is a blind attempt: it changes the traefik.service unit type to notify.

diff --git a/imageroot/systemd/user/traefik.service b/imageroot/systemd/user/traefik.service
index 1ff2a81..1e5a3c6 100644
--- a/imageroot/systemd/user/traefik.service
+++ b/imageroot/systemd/user/traefik.service
@@ -25,7 +25,7 @@ ExecStartPost=runagent write-hosts
 ExecStop=/usr/bin/podman stop --ignore --cidfile %t/traefik.ctr-id -t 10
 ExecStopPost=/usr/bin/podman rm --ignore -f --cidfile %t/traefik.ctr-id
 PIDFile=%t/traefik.pid
-Type=forking
+Type=notify
 WorkingDirectory=%S/state
 SyslogIdentifier=%u

The second one inserts 3 seconds of delay between the traefik container and the write-hosts script.

diff --git a/imageroot/systemd/user/traefik.service b/imageroot/systemd/user/traefik.service
index 1ff2a81..fcf85ae 100644
--- a/imageroot/systemd/user/traefik.service
+++ b/imageroot/systemd/user/traefik.service
@@ -21,6 +21,7 @@ ExecStart=/usr/bin/podman run \
     --volume=./configs:/etc/traefik/configs:Z \
     --volume=./custom_certificates:/etc/traefik/custom_certificates:Z \
     ${TRAEFIK_IMAGE}
+ExecStartPost=sleep 3
 ExecStartPost=runagent write-hosts
 ExecStop=/usr/bin/podman stop --ignore --cidfile %t/traefik.ctr-id -t 10
 ExecStopPost=/usr/bin/podman rm --ignore -f --cidfile %t/traefik.ctr-id

To edit the unit run this commands:

runagent -m traefik1
vi ../systemd/user/traefik.service
systemctl --user daemon-reload
2 Likes

Hi @davidep

And thanks for the feedback / suggestions. I will have time to try them this afternoon, the VM is still “ready”.

Probably not… It is a “fat” server, a Supermicro 4HE Rack unit, with 2 CPUs on sockets, but the box, even though updated with new disks, etc., is still 10 years old… 96 GB RAM and a total of 8 cores isn’t impressive nowadays… :slight_smile: And all storage is on 5400 Disks… But it’s a solid workhorse, LAN, dedicated Storage-LAN and dedicated Backup-LAN with PBS. The replacement is next to it, a HPE 2 HE Server, MUCH more powerful and in the same Proxmox Cluster. This one is equipped with fast NVME storage.

Thanks!

Feedback coming…

My 2 cents
Andy

1 Like

Thank you Andy for the thorough info!

Opening a PR to try to fix the issue here.

3 Likes

Bug filed here

2 Likes

Hi @davidep , @Tbaile

After having a few unrelated hardware issues like a dead onboard 4x NIC on a 4.5 year old HPE Proliant ML350 server:

I finally got to try to test out the suggested commands:

and

Both to no avail. No login possible afterwards, neither with existing password or default admin password ( Nethesis,1234)
Note: in the second case, tried without the “+” in the beginning!

Now testing Debian with Btrfs…

:slight_smile:

My 2 cents
Andy

Hi

Results with Debian / BtrFS look a bit better, but still no success…

Installing collected packages: resolvelib, pyasn1, ptyprocess, lockfile, typing-extensions, semver, requests, regex-engine, PyYAML, pycparser, psutil, pexpect, packaging, multidict, MarkupSafe, ldap3, hiredis, frozenlist, docutils, dnspython, async-timeout, yarl, redis, python-daemon, Jinja2, cffi, aiosignal, pycares, cryptography, brotlipy, ansible-runner, aiohttp, ansible-core, aiodns
  Attempting uninstall: requests
    Found existing installation: requests 2.28.1
    Not uninstalling requests at /usr/lib/python3/dist-packages, outside environment /usr/local/agent/pyenv
    Can't uninstall 'requests'. No files were found to uninstall.
Successfully installed Jinja2-3.1.2 MarkupSafe-2.1.3 PyYAML-6.0 aiodns-3.0.0 aiohttp-3.8.4 aiosignal-1.3.1 ansible-core-2.15.1 ansible-runner-2.3.1 async-timeout-4.0.2 brotlipy-0.7.0 cffi-1.15.1 cryptography-41.0.1 dnspython-2.3.0 docutils-0.20 frozenlist-1.4.1 hiredis-2.2.3 ldap3-2.9.1 lockfile-0.12.2 multidict-6.0.4 packaging-23.0 pexpect-4.8.0 psutil-5.9.4 ptyprocess-0.7.0 pyasn1-0.4.8 pycares-4.3.0 pycparser-2.21 python-daemon-3.0.1 redis-5.0.1 regex-engine-1.1.0 requests-2.31.0 resolvelib-1.0.1 semver-3.0.1 typing-extensions-4.6.3 yarl-1.9.2
Setup registry:
Add firewalld core rules:
Write initial cluster environment state
Write initial node environment state
Generating a new RSA key pair for SSH:
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:StSH5ceu4OiBx69wKS46g9jvAmQqWEfXJzxRt0xkVrw root@awr9-ns8
The key's randomart image is:
+---[RSA 3072]----+
|       o.oo.*o.  |
|    . ..=+.B ..  |
|   . .. o+o =  . |
| o. ..   . o  E  |
|=. .  . S   .    |
|=    + = . .     |
|+o  + O . .      |
|= +. * o         |
|.+ =+ o..        |
+----[SHA256]-----+
Adding id_rsa.pub to module skeleton dir:
Add /etc/hosts entries:
Generate WireGuard VPN key pair:
YIpHgkhGofop5m0UCM5sNh4gGKWDS7ZwFUN++Xslbnc=
Start Redis DB:
Created symlink /etc/systemd/system/default.target.wants/redis.service → /etc/systemd/system/redis.service.
Generating cluster password:
Generating api-server password:
Generating node password:
AUTH failed: WRONGPASS invalid username-password pair or user is disabled.
OK
OK
OK
3
3
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
OK
Start API server and core agents:
Created symlink /etc/systemd/system/multi-user.target.wants/api-server.service → /etc/systemd/system/api-server.service.
Created symlink /etc/systemd/system/default.target.wants/agent@cluster.service → /etc/systemd/system/agent@.service.
Created symlink /etc/systemd/system/default.target.wants/agent@node.service → /etc/systemd/system/agent@.service.
Created symlink /etc/systemd/system/default.target.wants/rclone-webdav.service → /etc/systemd/system/rclone-webdav.service.
Grant initial permissions:
Install Traefik:
<7>podman-pull-missing ghcr.io/nethserver/traefik:2.2.1
Trying to pull ghcr.io/nethserver/traefik:2.2.1...
Getting image source signatures
Copying blob sha256:64740ac9b8b758509c59ba37a98734ffcc728913955d51fb9c71c9eda801f5ff
Copying config sha256:f1acbfc376147395931b5c16bdb0de7111e94c27770c5a037b7381a7838f9c0f
Writing manifest to image destination
Storing signatures
f1acbfc376147395931b5c16bdb0de7111e94c27770c5a037b7381a7838f9c0f
<7>extract-ui ghcr.io/nethserver/traefik:2.2.1
Extracting container filesystem ui to /var/lib/nethserver/cluster/ui/apps/traefik1
ui/index.html
9786e8da2435c61f84024db03fae728714315775e558b93dd36c41898e1b060d
{'module_id': 'traefik1', 'image_name': 'traefik', 'image_url': 'ghcr.io/nethserver/traefik:2.2.1'}
Setting admin password to default Nethesis,1234:
True

NethServer cluster-admin UI:
  - https://awr9-ns8.r9.anwi.ch/cluster-admin/
  - https://172.25.90.20/cluster-admin/

[root@awr9-ns8 ~]# 

I do get a login screen, but when trying to login with the stated URLs / username / password (NS default) i get a “network error”…

As though redis is not correctly started or something…

:frowning:

My 2 cents
Andy

I’ve implemented the backoff in the PR, however I cannot reproduce the issue.

storage is on 5400 Disks

This bug is coming out most likely due to the increased latency and delay of the disks, but I’ve got only machines running on NVMe :sweat_smile:
The backoff time is 0.5 seconds (multiplying every failed attempt, maxing out at 10 retries or 27,5 seconds max) should give the service plenty of time to start up

You can try the fix using the module ghcr.io/nethserver/traefik:setting-backoff-on-init on installation.
How to here for anyone who stumbles in this thread.

2 Likes

Hi @Tbaile

Thanks, I will have time to test this.
I do have a couple of Proxmox to try this on, indeed, I am attempting the more or less same Debian 12 VM on four different environments.

Two are fairly well powered Proxmox, more than enough RAM and fully NVME equipped Storage (also ample!).
Only for one client do I have an issue, there the needed Storage is too big.

I will post feedback / logs here.

My 2 cents
Andy

1 Like