Thanks for the additional steps, I ran them and I’m getting a familiar error again. The migration fails to connect and points to Wireguard as the cause. Wireguard says it can’t locate the endpoint, and the endpoint is set as $COMPANY_NAME-main.$NETH8_IP
Using the command from the web page to get the error logs:
[root@neth ~]# echo '{"action":"login","Host":"$NETH8_DOMAIN","User":"...","Password":"...","TLSVerify":"disabled"}' | /usr/bin/setsid /usr/bin/sudo /usr/libexec/nethserver/api/nethserver-ns8-migration/connection/update | jq
{
"steps": 2,
"pid": 16527,
"args": "",
"event": "nethserver-ns8-migration-save"
}
{
"step": 1,
"pid": 16527,
"action": "S05generic_template_expand",
"event": "nethserver-ns8-migration-save",
"state": "running"
}
{
"progress": "0.50",
"time": "0.108804",
"exit": 0,
"event": "nethserver-ns8-migration-save",
"state": "done",
"step": 1,
"pid": 16527,
"action": "S05generic_template_expand"
}
{
"step": 2,
"pid": 16527,
"action": "S90adjust-services",
"event": "nethserver-ns8-migration-save",
"state": "running"
}
{
"progress": "1.00",
"time": "0.84124",
"exit": 256,
"event": "nethserver-ns8-migration-save",
"state": "done",
"step": 2,
"pid": 16527,
"action": "S90adjust-services"
}
{
"pid": 16527,
"status": "failed",
"event": "nethserver-ns8-migration-save"
}
Traceback (most recent call last):
File "/usr/sbin/ns8-join", line 152, in <module>
subprocess.run(['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save'], check=True)
File "/usr/lib64/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save']' returned non-zero exit status 1.
{
"id": "1697654108",
"type": "CommandFailed",
"message": "See /var/log/messages"
}
Looking at /var/log, it looks like it tried to connect to the other node via wireguard even though the wireguard connection failed:
[root@neth ~]# tail /var/log/messages
Oct 18 11:35:08 neth systemd: wg-quick@wg0.service failed.
Oct 18 11:35:08 neth esmith::event[16527]: Job for wg-quick@wg0.service failed because the control process exited with error code. See "systemctl status wg-quick@wg0.service" and "journalctl -xe" for details.
Oct 18 11:35:08 neth esmith::event[16527]: [WARNING] restart service wg-quick@wg0 failed!
Oct 18 11:35:08 neth systemd: Reloading.
Oct 18 11:35:08 neth esmith::event[16527]: [INFO] service httpd reload
Oct 18 11:35:08 neth systemd: Reloading The Apache HTTP Server.
Oct 18 11:35:08 neth systemd: Reloaded The Apache HTTP Server.
Oct 18 11:35:08 neth esmith::event[16527]: Action: /etc/e-smith/events/actions/adjust-services FAILED: 1 [0.84124]
Oct 18 11:35:08 neth esmith::event[16527]: Event: nethserver-ns8-migration-save FAILED
Oct 18 11:35:27 neth agent: Task queue pop error: dial tcp 10.5.4.1:6379: i/o timeout
Looking at that service we can see it dialed the wrong address. Where is “-main” coming from? I don’t see this in the Neth8 side node name.
[root@neth ~]# systemctl status wg-quick@wg0.service
● wg-quick@wg0.service - WireGuard via wg-quick(8) for wg0
Loaded: loaded (/usr/lib/systemd/system/wg-quick@.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2023-10-18 11:35:08 MST; 3min 34s ago
Docs: man:wg-quick(8)
man:wg(8)
https://www.wireguard.com/
https://www.wireguard.com/quickstart/
https://git.zx2c4.com/wireguard-tools/about/src/man/wg-quick.8
https://git.zx2c4.com/wireguard-tools/about/src/man/wg.8
Main PID: 16638 (code=exited, status=1/FAILURE)
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: Starting WireGuard via wg-quick(8) for wg0...
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: [#] ip link add wg0 type wireguard
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: [#] wg setconf wg0 /dev/fd/63
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: Name or service not known: `$COMPANY_NAME-main.$NETH8_IP:55820'
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: Configuration parsing error
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: [#] ip link delete dev wg0
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: wg-quick@wg0.service: main process exited, code=exited, status=1/FAILURE
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: Failed to start WireGuard via wg-quick(8) for wg0.
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: Unit wg-quick@wg0.service entered failed state.
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: wg-quick@wg0.service failed.
Configs:
[root@neth ~]# config show ns8
ns8=configuration
Host=$NETH8_DOMAIN
LeaderIpAddress=10.5.4.1
Password=...
TLSVerify=disabled
User=...
[root@neth ~]# config show wg-quick@wg0
wg-quick@wg0=service
Address=10.5.4.29
RemoteEndpoint=$COMPANY_NAME-main.$NETH8_IP:55820
RemoteKey=...
RemoteNetwork=10.5.4.0/24
SecretKey=...
status=enabled
[root@neth ~]# config show agent
agent=service
status=enabled