Full Disaster Recovery Findings

Thought it was about time to try a full disaster recovery scenario, especially based on the bad luck I’ve had applying updates in the past. So, this is going to be a long post, mainly of a big cut/past of a screen shot.

Downloaded and booted the latest Rocky image and selected the Restore Cluster and followed the instructions. Here’s the details following the “Something went wrong” message:






I also have a copy of the full Task Trace if that would be useful.

Looking at the core apps, there are now 2 copies of loki:

And both appear to be running.

The selection for the nethforge repository was not retained:

Within the Trace I see:

Job for traefik.service failed because the control process exited with error code.
See “systemctl --user status traefik.service” and “journalctl --user -xeu traefik.service” for details

image

The No Entries is the response to the journalctl command.

I’m not sure how many of these are minor issues that can be ignored (I can see at least 1, maybe 2) or which are portents of failure down the road should I continue to use this system, which I’m not going to be other than to pull information should you need it.

***** Update *****

Just noticed that one of the failures was trying to add an instance of traefik2.

I’m guessing had that worked it would be then be like loki, 2 instances both running. Is that really what was supposed to happen.

Cheers.

1 Like

Regarding the loki and traefik issues, i had similar results. The issue is that only one of each can be present on a particular node. I ended up removing the default instances from the command line via remove-module command.

Hope that helps.

2 Likes

Thank you for sharing this information.

It appears that the following issues occurred:

  • Mail encountered some non-fatal errors (related to the certificate and binding to the account provider). These can be resolved later in the app’s Settings page.
  • The Webserver experienced a fatal restore error (IndexError). /cc @stephdl
  • Traefik also failed to restore. The service startup is failing because ports 80 and 443 are already in use by the core Traefik instance. I believe Traefik should be excluded from the restore process, as certificates and routes are managed by the cluster backup. Installing a second Traefik instance is unnecessary.
  • On the other hand, a second instance of Loki is allowed, as documented here.

where is the evidence ???

Here it is:

Sadly it is a screenshot, I’d love the text version instead!

1 Like

Except both instances of loki are currently running. Shouldn’t loki and traefik be treated somewhat special in that the core version should be updated with the settings from the backup, instead of trying to restore the backup.

I don’t see a 3-dot menu to remove the inactive loki. *** Ignore, I found where this is now. ***

Traceback (most recent call last):
File “/home/webserver1/.config/actions/restore-module/06copyenv”, line 41, in
nginx_tcp_port = env_tcp_ports[2]
~~~~~~~~~~~~~^^^
IndexError: list index out of range

Or were you asking for everything I took as a screen shot as text. If so, I can capture each text box separately.

And as indicated earlier, I have the full trace as well.

Cheers.

I don’t think it’s quite as simple as that.

Original settings:

image

Restored server:

image

*** Update ***

Just to add in case it makes a difference, especially in regards to the Samba AD domain with the 10.5.4.0/24 address. My NS8 is migrated from an NS7 instance, not a built from scratch NS8.

Cheers.

Retracted

for webserver, the fix is coming

1 Like

hello mates and @EddieA

I would be really pleased if you could test and validate the fix to webserver and backup

please read here to do the QA

please ask whatever you need for this QA

2 Likes

That restored without error and the SFTPGo settings were correctly retained. I’m guessing that it’s not possible to restore as the same instance number, as the restore creates webserver2. Not that it really matters.

I’m also guessing that I won’t be able to test this as a disaster recovery scenario until the fix is released so that the restore will pull down the updated code.

Talking of the disaster recovery, is there anything happening regarding samba not restoring correctly, which causes other issues in mail or my suggestion for loki and traefik to not attempt a restore, but update the core version (or maybe delete the core version before a restore).

Cheers.

good and expected

will act the same way like you did

what was the error of samba ad ?

Take a look at the screenshots I pasted above. The original samba1 instance had an address on my internal network: 192.168.0.225. The restored samba1 is on the VPN: 10.5.4.1. Also look at the statistics for the domain. The original had users/groups/provider/File server. The restored samba has only provider.

Note, as I pointed out above, this is a migrated NS7 → NS8 instance, not a built from scratch NS8.

Cheers.

1 Like

Sorry if I’m getting into another topic at this point.
In your screenshot I see an AD domain structure that is unfamiliar to me.

Base DN DC=domain,DC=tld
Bind DN ldapservice@domain.tld

Was this structure already like this on the old Nethserver7 and was it therefore adopted accordingly? Or is this something new due to the takeover?

How did you even manage to do this on your Nethserver 7? In my opinion, only one AD with sub.domain.tld could be created for domains like domain.tld.

Why am I asking? I have already created various Nethserver7 ADs and also (as a test) various NS8 ADs - but I have not yet adopted any of them into NS8. So I would like to know more about the “before and after” state.

Thanks!