Can not join node to cluster

alex1 · August 9, 2023, 11:47am

Hello,
I have 2 Vbox VMs, one leader and one worker, both of them runs clean Debian 12 install.
After joining worker node it immidietly goes offline on admin panel. This is what I have found, looking in syslog:

2023-08-09T11:05:37.867897+02:00 NodeDebian agent@cluster[5337]: Traceback (most recent call last):
2023-08-09T11:05:37.868038+02:00 NodeDebian agent@cluster[5337]: File “/var/lib/nethserver/cluster/actions/join-node/30start_replication”, line 64, in
2023-08-09T11:05:37.868110+02:00 NodeDebian agent@cluster[5337]: cluster.vpn.initialize_wgconf(ip_address, listen_port, peer={
2023-08-09T11:05:37.868162+02:00 NodeDebian agent@cluster[5337]: File “/usr/local/agent/pypkg/cluster/vpn.py”, line 36, in initialize_wgconf
2023-08-09T11:05:37.868198+02:00 NodeDebian agent@cluster[5337]: peer_ep_address = socket.getaddrinfo(peer_hostname, peer_port, proto=socket.IPPROTO_UDP)[0][4][0]
2023-08-09T11:05:37.868242+02:00 NodeDebian agent@cluster[5337]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-08-09T11:05:37.868281+02:00 NodeDebian agent@cluster[5337]: File “/usr/lib/python3.11/socket.py”, line 962, in getaddrinfo
2023-08-09T11:05:37.869672+02:00 NodeDebian agent@cluster[5337]: for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
2023-08-09T11:05:37.869738+02:00 NodeDebian agent@cluster[5337]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-08-09T11:05:37.869780+02:00 NodeDebian agent@cluster[5337]: socket.gaierror: [Errno -2] Name or service not known
2023-08-09T11:05:37.936889+02:00 NodeDebian agent@cluster[5337]: task/cluster/51a774be-c186-4194-9592-794a3dba2a10: action “join-node” status is “aborted” (1) at step 30start_replication

Removing: “proto=”
from /usr/local/agent/pypkg/cluster/vpn.py
peer_ep_address = socket.getaddrinfo(peer_hostname, peer_port, proto=socket.IPPROTO_UDP)[0][4][0]

It made the error go away.

When I tried to join it again. Unfortunately it failed to connect.
Again in syslog:

2023-08-09T12:42:39.892910+02:00 NodeDebian agent@cluster[488]: Leader response is successful: the new node ID is node/8!
…
2023-08-09T12:42:41.312262+02:00 NodeDebian firewalld[560]: ERROR: NAME_CONFLICT: new_service(): ‘ns-wireguard’
2023-08-09T12:42:41.318858+02:00 NodeDebian agent@cluster[488]: Error: NAME_CONFLICT: new_service(): ‘ns-wireguard’
2023-08-09T12:42:41.374372+02:00 NodeDebian agent@cluster[488]: task/cluster/193597fc-4035-4956-8d82-481a8cff4143: action “join-node” status is “aborted” (26) at step 20wgboot

I need to mention that on second try, I uninstalled NS8 and installed it again. Strangely I don’t see any vpn connection on worker machine. Could wireguard be the problem?

davidep · August 9, 2023, 12:20pm

The uninstall procedure could have left the ns-wireguard service definition. Run it again and ensure it was removed with the firewall-cmd command.

I cannot give you the exact command options now, however --help and its manual page can give an idea.

davidep · August 9, 2023, 12:30pm

What is the leader node VPN endpoint? You can check it by running on the leader this command

redis-cli hget node/1/vpn endpoint

The host name or IP address must be reachable from the worker node.

Alternative command

api-cli run get-cluster-status | jq

alex1 · August 10, 2023, 9:17am

Hi,

thanks for quick response. The cluster now works fine. It turned out to be, the problem was name resolution.
Adding hosts to /etc/hosts solved the problem.

My sincerely thanks
Alex