NS7 to NS8 migration tool fails to connect

djx · October 2, 2023, 5:11pm

I did think of the “|” in my password before so I changed the password I was using, but the problem persists. After looking closer at db get ns8 and db getjson ns8 I can see that the mangled information from the previous password is still there. I’m guessing the migration tool doesn’t wipe it clean and instead keeps updating the coniguration object which leaves it in a non-working state.

I just ran db delete ns8 and confirmed the keys were gone. After trying it again with a password that doesn’t contain | it seems to work, but the migration tool is unable to open up.

I uninstalled the migration tool and then re-installed it, and now the configurations appear correct.

However, now upon trying to connect I see this error from wireguard:
Oct 02 09:57:54 $NETH7_URI wg-quick[4045]: Name or service not known: '$COMPANY_NAME-main.$NETH8_IP:55820'

Since uninstalling and re-installing the migration tool I did not type in $COMPANY_NAME-main in any dialog box, so I’m not sure where this string is coming from.

I manually edited /etc/wireguard/wg0.conf and removed this string so it just has the IP, but it gets overwritten again when trying to reconnect. I have a grep running now to try to find where this string is coming from, but I’m about out of time to debug this.

djx · October 10, 2023, 1:05pm

Is there anything I can do to get some more support on this? I was thinking of using my existing NS7 to help test NS8 beta - I’m not afraid of a few bugs and a little hassle; but this bug happening before I can even attempt the migration has me concerned. I have a second server primed and ready to go, but if this is unlikely to get resolved soon I’d rather shut it down and save some money until then.

davidep · October 10, 2023, 3:51pm

Maybe that string is under the wg-quick@wg0 key

config show wg-quick@wg0

djx · October 18, 2023, 3:21am

I don’t see it there either:

wg-quick@wg0=service
    Address=10.5.4.25
    RemoteEndpoint=$NETH8_IP:55820
    RemoteKey=...
    RemoteNetwork=10.5.4.0/24
    SecretKey=...
    status=enabled

One thing I do notice is that on the Neth8 instance the wireguard config shows:

[Interface]
Address = 10.5.4.1/32
ListenPort = 55820
PrivateKey = ...

[Peer]
PublicKey = ...
AllowedIPs = 10.5.4.12/32
PersistentKeepalive = 25

First, the mask for the Neth8 wireguard config is set to /32 while Neth7 config is set to /24, and also the RemoteKey and SecretKey in the Neth7 config don’t seem to match the PrivateKey and PublicKey Neth8 config - maybe they’re not supposed to?

davidep · October 18, 2023, 6:55am

This is an error!

Try to repeat the above procedure. However this time, before installing again the package delete also the key wg-quick@wg0 and agent.

config delete ns8
config delete wg-quick@wg0
config delete agent

It seems ok for me

Bug filed here

djx · October 18, 2023, 8:30pm

Thanks for the additional steps, I ran them and I’m getting a familiar error again. The migration fails to connect and points to Wireguard as the cause. Wireguard says it can’t locate the endpoint, and the endpoint is set as $COMPANY_NAME-main.$NETH8_IP

Using the command from the web page to get the error logs:

[root@neth ~]#  echo '{"action":"login","Host":"$NETH8_DOMAIN","User":"...","Password":"...","TLSVerify":"disabled"}' | /usr/bin/setsid /usr/bin/sudo /usr/libexec/nethserver/api/nethserver-ns8-migration/connection/update | jq
{
  "steps": 2,
  "pid": 16527,
  "args": "",
  "event": "nethserver-ns8-migration-save"
}
{
  "step": 1,
  "pid": 16527,
  "action": "S05generic_template_expand",
  "event": "nethserver-ns8-migration-save",
  "state": "running"
}
{
  "progress": "0.50",
  "time": "0.108804",
  "exit": 0,
  "event": "nethserver-ns8-migration-save",
  "state": "done",
  "step": 1,
  "pid": 16527,
  "action": "S05generic_template_expand"
}
{
  "step": 2,
  "pid": 16527,
  "action": "S90adjust-services",
  "event": "nethserver-ns8-migration-save",
  "state": "running"
}
{
  "progress": "1.00",
  "time": "0.84124",
  "exit": 256,
  "event": "nethserver-ns8-migration-save",
  "state": "done",
  "step": 2,
  "pid": 16527,
  "action": "S90adjust-services"
}
{
  "pid": 16527,
  "status": "failed",
  "event": "nethserver-ns8-migration-save"
}
Traceback (most recent call last):
  File "/usr/sbin/ns8-join", line 152, in <module>
    subprocess.run(['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save'], check=True)
  File "/usr/lib64/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save']' returned non-zero exit status 1.
{
  "id": "1697654108",
  "type": "CommandFailed",
  "message": "See /var/log/messages"
}

Looking at /var/log, it looks like it tried to connect to the other node via wireguard even though the wireguard connection failed:

[root@neth ~]# tail /var/log/messages
Oct 18 11:35:08 neth systemd: wg-quick@wg0.service failed.
Oct 18 11:35:08 neth esmith::event[16527]: Job for wg-quick@wg0.service failed because the control process exited with error code. See "systemctl status wg-quick@wg0.service" and "journalctl -xe" for details.
Oct 18 11:35:08 neth esmith::event[16527]: [WARNING] restart service wg-quick@wg0 failed!
Oct 18 11:35:08 neth systemd: Reloading.
Oct 18 11:35:08 neth esmith::event[16527]: [INFO] service httpd reload
Oct 18 11:35:08 neth systemd: Reloading The Apache HTTP Server.
Oct 18 11:35:08 neth systemd: Reloaded The Apache HTTP Server.
Oct 18 11:35:08 neth esmith::event[16527]: Action: /etc/e-smith/events/actions/adjust-services FAILED: 1 [0.84124]
Oct 18 11:35:08 neth esmith::event[16527]: Event: nethserver-ns8-migration-save FAILED
Oct 18 11:35:27 neth agent: Task queue pop error: dial tcp 10.5.4.1:6379: i/o timeout

Looking at that service we can see it dialed the wrong address. Where is “-main” coming from? I don’t see this in the Neth8 side node name.

[root@neth ~]# systemctl status wg-quick@wg0.service
● wg-quick@wg0.service - WireGuard via wg-quick(8) for wg0
   Loaded: loaded (/usr/lib/systemd/system/wg-quick@.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2023-10-18 11:35:08 MST; 3min 34s ago
     Docs: man:wg-quick(8)
           man:wg(8)
           https://www.wireguard.com/
           https://www.wireguard.com/quickstart/
           https://git.zx2c4.com/wireguard-tools/about/src/man/wg-quick.8
           https://git.zx2c4.com/wireguard-tools/about/src/man/wg.8
 Main PID: 16638 (code=exited, status=1/FAILURE)

Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: Starting WireGuard via wg-quick(8) for wg0...
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: [#] ip link add wg0 type wireguard
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: [#] wg setconf wg0 /dev/fd/63
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: Name or service not known: `$COMPANY_NAME-main.$NETH8_IP:55820'
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: Configuration parsing error
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: [#] ip link delete dev wg0
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: wg-quick@wg0.service: main process exited, code=exited, status=1/FAILURE
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: Failed to start WireGuard via wg-quick(8) for wg0.
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: Unit wg-quick@wg0.service entered failed state.
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: wg-quick@wg0.service failed.

Configs:

[root@neth ~]# config show ns8
ns8=configuration
    Host=$NETH8_DOMAIN
    LeaderIpAddress=10.5.4.1
    Password=...
    TLSVerify=disabled
    User=...

[root@neth ~]# config show wg-quick@wg0
wg-quick@wg0=service
    Address=10.5.4.29
    RemoteEndpoint=$COMPANY_NAME-main.$NETH8_IP:55820
    RemoteKey=...
    RemoteNetwork=10.5.4.0/24
    SecretKey=...
    status=enabled

[root@neth ~]# config show agent
agent=service
    status=enabled

davidep · October 18, 2023, 8:46pm

Did you replace $NETH8_DOMAIN to hide the real host name in your post, or is this the literal log line?

In the second case you need to fix both the DB and the host name.

djx · October 18, 2023, 8:47pm

Yes, I replaced the sensitive information. I’ll send you the full file directly.

djx · October 20, 2023, 3:43pm

After a lot of help from @davidep I got my NS7 node to connect to the new NS8 cluster.

It looks like putting a “|” in the password really messed up the configuration for migration. However, resetting the NS7 config isn’t enough to fix it; the incorrect connection attempt from NS7 also left my NS8 node in a non-connectable state.

Here’s what I did that worked:

NS7 - Uninstalled Migration tool
NS7 - Ran the config clean commands mentioned by Davide:

config delete ns8
config delete wg-quick@wg0
config delete agent

NS7 - Re-installed Migration Tool
NS8 - Rebuilt server & re-install NS8
NS7 - Connected to NS8 using IP address as FQDN

When trying to connect to NS8 using a domain address the connection failed, even though my NS7 server is correctly resolving the IP address - maybe wireguard isn’t?

Hope this helps someone else!

dnutan · April 20, 2024, 8:42am

A post was split to a new topic: NS8 Migration - Error retrieving cluster status