NS8 migration error due to locally set "flag"

NethServer Version: 7.9.2009
Module: NS8 migration

Hi @davidep

This is second attempt to migrate a current NethServer 7.9 to a new NS8.
The first attempt was done a week ago, but not finished.
The target NS8 is a fresh install to RC1.

However:

Any attempt to connect to NS8 Cluster fails.
Internal DNS resolution is fully working.
A second attempt will produce an error message saying:

This is a leftover from the attempt at migration last week

The problem seems from a local flag set last week:

subprocess.run(['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save'], check=True)

As the NS8 cluster itself (freshly installed on Debian 12 according to instructions) shows no AD Domain:


Output of the error from the console line:

[root@dbzg-nethserver ~]#  echo '{"action":"login","Host":"dbzg-ns8.domainname.com","User":"admin","Password":"XXXXX","TLSVerify":"disabled"}' | /usr/bin/setsid /usr/bin/sudo /usr/libexec/nethserver/api/nethserver-ns8-migration/connection/update | jq
{
  "steps": 2,
  "pid": 7249,
  "args": "",
  "event": "nethserver-ns8-migration-save"
}
{
  "step": 1,
  "pid": 7249,
  "action": "S05generic_template_expand",
  "event": "nethserver-ns8-migration-save",
  "state": "running"
}
{
  "progress": "0.50",
  "time": "0.086752",
  "exit": 0,
  "event": "nethserver-ns8-migration-save",
  "state": "done",
  "step": 1,
  "pid": 7249,
  "action": "S05generic_template_expand"
}
{
  "step": 2,
  "pid": 7249,
  "action": "S90adjust-services",
  "event": "nethserver-ns8-migration-save",
  "state": "running"
}
{
  "progress": "1.00",
  "time": "0.784994",
  "exit": 256,
  "event": "nethserver-ns8-migration-save",
  "state": "done",
  "step": 2,
  "pid": 7249,
  "action": "S90adjust-services"
}
{
  "pid": 7249,
  "status": "failed",
  "event": "nethserver-ns8-migration-save"
}
Traceback (most recent call last):
  File "/usr/sbin/ns8-join", line 151, in <module>
    subprocess.run(['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save'], check=True)
  File "/usr/lib64/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save']' returned non-zero exit status 1.
{
  "id": "1700872873",
  "type": "CommandFailed",
  "message": "See /var/log/messages"
}

→

Note:

BAD Programming, not actually verifying the status (during a connection), but using a “locally set flag”, maybe a result of a migration attempt with faulty connections, etc.

Actually verifying would give a correct result and allow the migration. And would only take a few seconds longer



Any suggestions on how this error can be corrected?

My 2 cents
Andy

1 Like

I hit a similar issue recently. As the error message is lying, try a page reload. You’d be able to proceed then.

I’m sure it can be fixed, thanks for the report!

1 Like

Hi @davidep

I’m still getting this error:

And the output from the console is as follows:

[root@dbzg-nethserver ~]#  echo '{"action":"login","Host":"dbzg-ns8.kardiopraxis-grafenau.ch","User":"admin","Password":"namthAicO-4","TLSVerify":"disabled"}' | /usr/bin/setsid /usr/bin/sudo /usr/libexec/nethserver/api/nethserver-ns8-migration/connection/update | jq
{
  "steps": 2,
  "pid": 30829,
  "args": "",
  "event": "nethserver-ns8-migration-save"
}
{
  "step": 1,
  "pid": 30829,
  "action": "S05generic_template_expand",
  "event": "nethserver-ns8-migration-save",
  "state": "running"
}
{
  "progress": "0.50",
  "time": "0.079562",
  "exit": 256,
  "event": "nethserver-ns8-migration-save",
  "state": "done",
  "step": 1,
  "pid": 30829,
  "action": "S05generic_template_expand"
}
{
  "step": 2,
  "pid": 30829,
  "action": "S90adjust-services",
  "event": "nethserver-ns8-migration-save",
  "state": "running"
}
{
  "progress": "1.00",
  "time": "0.709119",
  "exit": 256,
  "event": "nethserver-ns8-migration-save",
  "state": "done",
  "step": 2,
  "pid": 30829,
  "action": "S90adjust-services"
}
{
  "pid": 30829,
  "status": "failed",
  "event": "nethserver-ns8-migration-save"
}
Traceback (most recent call last):
  File "/usr/sbin/ns8-join", line 151, in <module>
    subprocess.run(['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save'], check=True)
  File "/usr/lib64/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/sbin/e-smith/signal-event', '-j', 'nethserver-ns8-migration-save']' returned non-zero exit status 1.
{
  "id": "1700909643",
  "type": "CommandFailed",
  "message": "See /var/log/messages"
}
[root@dbzg-nethserver ~]#

Clearing the browser cache (Firefox) doesn’t really help.

Running the Migration from a local Windows 10 PC, using “Edge” I get the following error also:

Name Resolution from this PC works:


To me, it seems as this setting in NethServer is causing the error:

‘nethserver-ns8-migration-save’]’ returned non-zero exit status 1.

→ Somewhere on NS7, there’s an entry with the old entry of the NS8 server, causing this error.

NS8 does display connection attempts from NS7, but no errors.

How can this entry be removed?

My 2 cents
Andy

1 Like

Any other information in /var/log/messages?

I cannot find anything that points to the error origin :frowning:

As alternative, can you list the steps that reproduce the error?

Hi @davidep

I looked hard, even installed locate & updatedb, but even then I couldn’t find anything.
Each attempt created a wg0, yes, but no configuration or even manually erased in
/var/lib/nethserver/db/configuration


All to no avail. :frowning:

So what, Friday evening, after the client finished work, I just rolled back my sleeves, made a restore of the VM from BEFORE my first attempt with NS8 Migration (Well knowing. I have to restore data for one week after migration!). That trick worked!

I only migrated the AD and Fileserver, as this client is not using mail locally. As a doctor, it’s kind of like in Italy, where companies and individuals can use a state provided mail with guaranteed checked and validated owner! This state system is for all health providers / Insurances / Hospitals and the like, and it’s mandatory.

The migration took i’s expected time, and finished well!

After that, post migration tasks, like correcting internal DNS on OPNsense (Our Firewall, DHCP & DNS server). Restored data from Proxmox PBS. Checked all doctors gadgets, which need to store data (AD environment!), like Ultrasonic, X-Ray, whatever. All worked.

Today a big thanks from my client for a punctual migration (No big delays like the Berlin Airport!), and everything working as expected!

Any other client, issues would either cost money, time or a combination of both. In the health industry, issues CAN be life-threatining!

Proxmox, with PBS saved the day and the migration from NS7 to NS8, I’m happy for both!

So for me a BIG relief, Client is happy, and a productive NS8 RC1 as a clap to our Devs!

My 2 cents
Andy

2 Likes

Thank you Andy,

:thinking: it’s a smart move to complete the migration, but we’ve lost any proof of the bug, right?

Maybe your client was on holiday and didn’t notice it :joy:

Hi @ Davide

Not when it’s a doctors or Hospital, where IT errors can have life threatening aspects

Any other client, it’s just time & costs, but health takes more responsibility!

I am aware not everyone sees it this way, just look at the monocultures in health institutes all over the world in the health industries today!


Sounds like you as a dev never had to deal with serious sysadmins like myself? :slight_smile:

You are aware of my old saying:

"Better a backup to many, than one too little!) ?

Of course I have a complete backup of the last state of the VM, (With the latest data) - the version which brought out the AD-Domain issue when migrating.
I can transfer the VM to my home, restore it on my system, and provide you or the dev team with direct access (I have a good Internet connection) to troubleshhot the Issue.

I can provide WebGUI and SSH access (I know both help a lot when troubleshooting as a dev!).
How’s that?

Tell me if you would like to, I can start migrating the VM tonight (The client has less Internet outgoing than I do, but it will improve at end of November- three days, new contract signed - fed up with national monopol provider!). So I can provide the VM and Access from Monaday onwards


My 2 cents
Andy

3 Likes

Reality TV gif. Simon Cowell as judge on AGT stands up from his table, claps, and gives two thumbs up to right of frame.

1 Like

First migration to NS8, ever :face_holding_back_tears:

And a fully productive system from day one !!!

Actual Statement from my client this later morning:

"Es funktioniert bis jetzt alles einwandfrei! "
Translation:
Until now, everything working perfectly!

My Kudos to the Dev Team!

A proud longterm NethServer user, system integrator!

Many Thanks!
Andy

6 Likes

At least for the first screenshot, I think I found the origin. It’s just cosmetic, but makes a lot of confusion, really misleading.

If it’s just cosmetic, why does the migration tool then refuse to connect?

1 Like

Because then you try again and hit against a validation error!

Just to clarify:

I hit a cosmetic error, but the migration does not move on and a “flag” is set somewhere.
The next time, it’s no more cosmetic, but because a flag is set?

So HOW is one supposed to go past a “cosmetic” error???

My 2 cents
Andy

1 Like

By retrying the validation check finds an existing user domain in the cluster: that’s the “flag” you find that prevents to go further. The flag is set in NS8, for this reason you cannot clear on NS7 side.

Do not worry, I’m preparing a fix to handle it :slight_smile:

1 Like

Well NS8 doesn’t “show” any cluster, even if refreshing the browser.
Even a reboot of NS8 doesn’t show any cluster

?

1 Like

The other error orginates from an attempt to check the status of the cluster but it fails, and another misleading error is shown: “The NS8 cluster already has a user domain
”

A reboot of the node should restart the VPN, but if that is not true, the same error is displayed again and again

1 Like

Do I remember correctly - you were looking for a valid use case for a node reboot?

1 Like

A fix for the apparent blocking “flag” is now available! Please see instructions here to test it:

1 Like