NS7 to NS8 migration tool fails to connect

The cluster wasn’t “working”. It’s a fresh install that I made by following the migration steps: NethServer 7 migration — NS8 documentation

Now that Beta 2 is out I reset the new server and gave migration another try. The NS7 still had the new NS8 node configured, so I clicked the button to disconnect from that node. I waited until it confirmed it had been disconnected and then I tried again.

I’m still getting an error, and not even getting as far as the VPN connection setup. Here’s the error:

Odd number of elements in hash assignment at /usr/share/perl5/vendor_perl/esmith/db.pm line 273.
{
  "steps": 2,
  "pid": 16509,
  "args": "",
  "event": "nethserver-ns8-migration-save"
}
{
  "step": 1,
  "pid": 16509,
  "action": "S05generic_template_expand",
  "event": "nethserver-ns8-migration-save",
  "state": "running"
}
{
  "progress": "0.50",
  "time": "0.108469",
  "exit": 0,
  "event": "nethserver-ns8-migration-save",
  "state": "done",
  "step": 1,
  "pid": 16509,
  "action": "S05generic_template_expand"
}
{
  "step": 2,
  "pid": 16509,
  "action": "S90adjust-services",
  "event": "nethserver-ns8-migration-save",
  "state": "running"
}
{
  "progress": "1.00",
  "time": "0.7759",
  "exit": 256,
  "event": "nethserver-ns8-migration-save",
  "state": "done",
  "step": 2,
  "pid": 16509,
  "action": "S90adjust-services"
}
{
  "pid": 16509,
  "status": "failed",
  "event": "nethserver-ns8-migration-save"
}
Traceback (most recent call last):
  File "/usr/sbin/ns8-join", line 152, in <module>
    subprocess.run(['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save'], check=True)
  File "/usr/lib64/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save']' returned non-zero exit status 1.
{
  "id": "1696202961",
  "type": "CommandFailed",
  "message": "See /var/log/messages"
}

I’m guessing it has something to do with this function that is splitting out DB configuration and for some reason comes up with an incorrect number of arguments for a proper key/value pair:

sub _db_string_to_type_and_hash ($)
{
    my ($arg) = @_;
    return ('', ()) unless defined $arg;

    # The funky regex is to avoid escaped pipes.
    # If you specify a negative limit empty trailing fields are omitted.
    return split(/(?<!\\)\|/, $arg, -1);
}

Looking into this further, it appears the DB configuration is being saved in a way that this function doesn’t expect. Pulling it as JSON it makes sense:

{"props":{"admin":"admin","":"","disabled":"disabled","User":"admin","TLSVerify":"disabled","LeaderIpAddress":"10.5.4.1","Password":"$PASSWORD","Host":"$HOSTNAME","enabled":"enabled","$SOME_WEIRD_STRING":"TLSVerify"},"name":"ns8","type":"configuration"}

but pulling it as a raw value (which is being passed in to the function above) returns this garbled mess:

configuration|||$SOME_WEIRD_STRING|TLSVerify|Host|$HOST|LeaderIpAddress|10.5.4.1|Password|$PASSWORD|TLSVerify|disabled|User|admin|admin|admin|disabled|disabled|enabled|enabled

A quick regex test shows that the first | in configuration||| gets caught by the regex

@giacomo or @davidep - I see both of you in GitHub, perhaps you can provide some insight? :slight_smile:

I’m too nervous to play around with this RegEx in my prod instance, since I’m not sure if adjusting it will break other things.

1 Like

Do you have a | (pipe) character in NS8 admin password? E-smith DB does not support strings containing it and the UI validation logic could not protect the input data enough.

:face_with_raised_eyebrow: is this a… known fact?

Yes it is a limitation or bug of e-smith that has been never fixed…

Thanks for sharing.
I’d loose the bet that this particular detail was never into NethServer documentation?

No. Consider that user input must be validated and free strings, like passwords, are not stored in e-smith DB.

I’d suggest to write this detail into the migration procedure documentation.

1 Like

I did think of the “|” in my password before so I changed the password I was using, but the problem persists. After looking closer at db get ns8 and db getjson ns8 I can see that the mangled information from the previous password is still there. I’m guessing the migration tool doesn’t wipe it clean and instead keeps updating the coniguration object which leaves it in a non-working state.

I just ran db delete ns8 and confirmed the keys were gone. After trying it again with a password that doesn’t contain | it seems to work, but the migration tool is unable to open up.

I uninstalled the migration tool and then re-installed it, and now the configurations appear correct.

However, now upon trying to connect I see this error from wireguard:
Oct 02 09:57:54 $NETH7_URI wg-quick[4045]: Name or service not known: '$COMPANY_NAME-main.$NETH8_IP:55820'

Since uninstalling and re-installing the migration tool I did not type in $COMPANY_NAME-main in any dialog box, so I’m not sure where this string is coming from.

I manually edited /etc/wireguard/wg0.conf and removed this string so it just has the IP, but it gets overwritten again when trying to reconnect. I have a grep running now to try to find where this string is coming from, but I’m about out of time to debug this.

2 Likes

Is there anything I can do to get some more support on this? I was thinking of using my existing NS7 to help test NS8 beta - I’m not afraid of a few bugs and a little hassle; but this bug happening before I can even attempt the migration has me concerned. I have a second server primed and ready to go, but if this is unlikely to get resolved soon I’d rather shut it down and save some money until then.

Maybe that string is under the wg-quick@wg0 key

config show wg-quick@wg0

I don’t see it there either:

wg-quick@wg0=service
    Address=10.5.4.25
    RemoteEndpoint=$NETH8_IP:55820
    RemoteKey=...
    RemoteNetwork=10.5.4.0/24
    SecretKey=...
    status=enabled

One thing I do notice is that on the Neth8 instance the wireguard config shows:

[Interface]
Address = 10.5.4.1/32
ListenPort = 55820
PrivateKey = ...

[Peer]
PublicKey = ...
AllowedIPs = 10.5.4.12/32
PersistentKeepalive = 25

First, the mask for the Neth8 wireguard config is set to /32 while Neth7 config is set to /24, and also the RemoteKey and SecretKey in the Neth7 config don’t seem to match the PrivateKey and PublicKey Neth8 config - maybe they’re not supposed to?

This is an error!

Try to repeat the above procedure. However this time, before installing again the package delete also the key wg-quick@wg0 and agent.

config delete ns8
config delete wg-quick@wg0
config delete agent

It seems ok for me


Bug filed here

1 Like

Thanks for the additional steps, I ran them and I’m getting a familiar error again. The migration fails to connect and points to Wireguard as the cause. Wireguard says it can’t locate the endpoint, and the endpoint is set as $COMPANY_NAME-main.$NETH8_IP

Using the command from the web page to get the error logs:

[root@neth ~]#  echo '{"action":"login","Host":"$NETH8_DOMAIN","User":"...","Password":"...","TLSVerify":"disabled"}' | /usr/bin/setsid /usr/bin/sudo /usr/libexec/nethserver/api/nethserver-ns8-migration/connection/update | jq
{
  "steps": 2,
  "pid": 16527,
  "args": "",
  "event": "nethserver-ns8-migration-save"
}
{
  "step": 1,
  "pid": 16527,
  "action": "S05generic_template_expand",
  "event": "nethserver-ns8-migration-save",
  "state": "running"
}
{
  "progress": "0.50",
  "time": "0.108804",
  "exit": 0,
  "event": "nethserver-ns8-migration-save",
  "state": "done",
  "step": 1,
  "pid": 16527,
  "action": "S05generic_template_expand"
}
{
  "step": 2,
  "pid": 16527,
  "action": "S90adjust-services",
  "event": "nethserver-ns8-migration-save",
  "state": "running"
}
{
  "progress": "1.00",
  "time": "0.84124",
  "exit": 256,
  "event": "nethserver-ns8-migration-save",
  "state": "done",
  "step": 2,
  "pid": 16527,
  "action": "S90adjust-services"
}
{
  "pid": 16527,
  "status": "failed",
  "event": "nethserver-ns8-migration-save"
}
Traceback (most recent call last):
  File "/usr/sbin/ns8-join", line 152, in <module>
    subprocess.run(['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save'], check=True)
  File "/usr/lib64/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save']' returned non-zero exit status 1.
{
  "id": "1697654108",
  "type": "CommandFailed",
  "message": "See /var/log/messages"
}

Looking at /var/log, it looks like it tried to connect to the other node via wireguard even though the wireguard connection failed:

[root@neth ~]# tail /var/log/messages
Oct 18 11:35:08 neth systemd: wg-quick@wg0.service failed.
Oct 18 11:35:08 neth esmith::event[16527]: Job for wg-quick@wg0.service failed because the control process exited with error code. See "systemctl status wg-quick@wg0.service" and "journalctl -xe" for details.
Oct 18 11:35:08 neth esmith::event[16527]: [WARNING] restart service wg-quick@wg0 failed!
Oct 18 11:35:08 neth systemd: Reloading.
Oct 18 11:35:08 neth esmith::event[16527]: [INFO] service httpd reload
Oct 18 11:35:08 neth systemd: Reloading The Apache HTTP Server.
Oct 18 11:35:08 neth systemd: Reloaded The Apache HTTP Server.
Oct 18 11:35:08 neth esmith::event[16527]: Action: /etc/e-smith/events/actions/adjust-services FAILED: 1 [0.84124]
Oct 18 11:35:08 neth esmith::event[16527]: Event: nethserver-ns8-migration-save FAILED
Oct 18 11:35:27 neth agent: Task queue pop error: dial tcp 10.5.4.1:6379: i/o timeout

Looking at that service we can see it dialed the wrong address. Where is “-main” coming from? I don’t see this in the Neth8 side node name.

[root@neth ~]# systemctl status wg-quick@wg0.service
● wg-quick@wg0.service - WireGuard via wg-quick(8) for wg0
   Loaded: loaded (/usr/lib/systemd/system/wg-quick@.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2023-10-18 11:35:08 MST; 3min 34s ago
     Docs: man:wg-quick(8)
           man:wg(8)
           https://www.wireguard.com/
           https://www.wireguard.com/quickstart/
           https://git.zx2c4.com/wireguard-tools/about/src/man/wg-quick.8
           https://git.zx2c4.com/wireguard-tools/about/src/man/wg.8
 Main PID: 16638 (code=exited, status=1/FAILURE)

Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: Starting WireGuard via wg-quick(8) for wg0...
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: [#] ip link add wg0 type wireguard
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: [#] wg setconf wg0 /dev/fd/63
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: Name or service not known: `$COMPANY_NAME-main.$NETH8_IP:55820'
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: Configuration parsing error
Oct 18 11:35:08 $NETH8_DOMAIN wg-quick[16638]: [#] ip link delete dev wg0
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: wg-quick@wg0.service: main process exited, code=exited, status=1/FAILURE
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: Failed to start WireGuard via wg-quick(8) for wg0.
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: Unit wg-quick@wg0.service entered failed state.
Oct 18 11:35:08 $NETH8_DOMAIN systemd[1]: wg-quick@wg0.service failed.

Configs:

[root@neth ~]# config show ns8
ns8=configuration
    Host=$NETH8_DOMAIN
    LeaderIpAddress=10.5.4.1
    Password=...
    TLSVerify=disabled
    User=...

[root@neth ~]# config show wg-quick@wg0
wg-quick@wg0=service
    Address=10.5.4.29
    RemoteEndpoint=$COMPANY_NAME-main.$NETH8_IP:55820
    RemoteKey=...
    RemoteNetwork=10.5.4.0/24
    SecretKey=...
    status=enabled

[root@neth ~]# config show agent
agent=service
    status=enabled

Did you replace $NETH8_DOMAIN to hide the real host name in your post, or is this the literal log line?

In the second case you need to fix both the DB and the host name.

Yes, I replaced the sensitive information. I’ll send you the full file directly.

1 Like

After a lot of help from @davidep I got my NS7 node to connect to the new NS8 cluster.

It looks like putting a “|” in the password really messed up the configuration for migration. However, resetting the NS7 config isn’t enough to fix it; the incorrect connection attempt from NS7 also left my NS8 node in a non-connectable state.

Here’s what I did that worked:

  • NS7 - Uninstalled Migration tool
  • NS7 - Ran the config clean commands mentioned by Davide:
config delete ns8
config delete wg-quick@wg0
config delete agent
  • NS7 - Re-installed Migration Tool
  • NS8 - Rebuilt server & re-install NS8
  • NS7 - Connected to NS8 using IP address as FQDN

When trying to connect to NS8 using a domain address the connection failed, even though my NS7 server is correctly resolving the IP address - maybe wireguard isn’t?

Hope this helps someone else!

3 Likes

A post was split to a new topic: NS8 Migration - Error retrieving cluster status