Full Disaster Recovery Findings

EddieA · August 7, 2024, 10:44pm

Thought it was about time to try a full disaster recovery scenario, especially based on the bad luck I’ve had applying updates in the past. So, this is going to be a long post, mainly of a big cut/past of a screen shot.

Downloaded and booted the latest Rocky image and selected the Restore Cluster and followed the instructions. Here’s the details following the “Something went wrong” message:

I also have a copy of the full Task Trace if that would be useful.

Looking at the core apps, there are now 2 copies of loki:

And both appear to be running.

The selection for the nethforge repository was not retained:

Within the Trace I see:

Job for traefik.service failed because the control process exited with error code.
See “systemctl --user status traefik.service” and “journalctl --user -xeu traefik.service” for details

The No Entries is the response to the journalctl command.

I’m not sure how many of these are minor issues that can be ignored (I can see at least 1, maybe 2) or which are portents of failure down the road should I continue to use this system, which I’m not going to be other than to pull information should you need it.

***** Update *****

Just noticed that one of the failures was trying to add an instance of traefik2.

I’m guessing had that worked it would be then be like loki, 2 instances both running. Is that really what was supposed to happen.

Cheers.

Viking · August 8, 2024, 6:36pm

Regarding the loki and traefik issues, i had similar results. The issue is that only one of each can be present on a particular node. I ended up removing the default instances from the command line via remove-module command.

Hope that helps.

davidep · August 9, 2024, 9:15am

Thank you for sharing this information.

It appears that the following issues occurred:

Mail encountered some non-fatal errors (related to the certificate and binding to the account provider). These can be resolved later in the app’s Settings page.
The Webserver experienced a fatal restore error (IndexError). /cc @stephdl
Traefik also failed to restore. The service startup is failing because ports 80 and 443 are already in use by the core Traefik instance. I believe Traefik should be excluded from the restore process, as certificates and routes are managed by the cluster backup. Installing a second Traefik instance is unnecessary.
On the other hand, a second instance of Loki is allowed, as documented here.

stephdl · August 9, 2024, 9:19am

where is the evidence ???

davidep · August 9, 2024, 9:21am

Here it is:

Sadly it is a screenshot, I’d love the text version instead!

EddieA · August 9, 2024, 5:21pm

Except both instances of loki are currently running. Shouldn’t loki and traefik be treated somewhat special in that the core version should be updated with the settings from the backup, instead of trying to restore the backup.

I don’t see a 3-dot menu to remove the inactive loki. *** Ignore, I found where this is now. ***

Traceback (most recent call last):
File “/home/webserver1/.config/actions/restore-module/06copyenv”, line 41, in
nginx_tcp_port = env_tcp_ports[2]
~~~~~~~~~~~~~^^^
IndexError: list index out of range

Or were you asking for everything I took as a screen shot as text. If so, I can capture each text box separately.

And as indicated earlier, I have the full trace as well.

Cheers.

davidep · September 10, 2024, 8:04am

11 posts were split to a new topic: Restore Samba File Server

stephdl · August 12, 2024, 9:24am

github.com/NethServer/dev

Webserver: failed restoration due to missing env variable

opened 09:19AM - 12 Aug 24 UTC

stephdl

bug

The backup process failed because a required environment variable is either miss…ing or incorrectly set. Environment variables are crucial for the backup script to locate the necessary resources, credentials, or configurations. If the variable is absent or contains incorrect data, the script can't proceed as expected, leading to the failure of the backup. **Steps to reproduce** - The first step involves setting up the web server. You successfully install the web server on your system using - Set SFTPGo Port: You define the SFTPGo service port in the server configuration, ensuring that it will listen on the correct port for secure file transfers. - Create a Virtual Host (vHost): You configure a virtual host to serve specific content or applications. This virtual host might point to a particular domain or subdomain and handle the routing of HTTP/S requests accordingly. - You then configure a backup instance, specifying the module that needs to be backed up. - With the backup instance configured, you trigger the backup process. The backup completes successfully, confirming that all specified data and configurations have been safely stored. - The backup operation reports success. All files, settings, and configurations, including the SFTPGo settings and virtual host configurations, are successfully saved to the backup instance. - After verifying the successful backup, you proceed to uninstall or remove the web server module. This step simulates a scenario where the server or its configurations might be lost or corrupted, necessitating a restore from backup. - Following the removal of the web server, you initiate the restoration process using the backup created earlier. The restoration process begins, and the system attempts to restore all configurations, files, and settings. - However, during the restoration, an error occurs. The restoration process fails because one or more environment variables required for the restoration are either missing or incorrectly formatted. Specifically, it appears that an environment variable related to SFTPGo's port configuration has split or gone out of range, causing the process to halt. ![image](https://github.com/user-attachments/assets/20e4ac9e-07ff-433b-942a-5c5ac7d84105) - Additionally, after the partial restoration, you notice that the toggle for enabling external access to SFTPGo, which was originally enabled before the backup, is now disabled. ![image](https://github.com/user-attachments/assets/6d868492-1293-4348-bed2-a634d093c35a) **Expected behavior** I expect no errors during the restoration process; all properties and configurations must be restored exactly as they were set before the backup. This includes ensuring that environment variables are correctly set, all toggles (like the external SFTPGo access) remain in their pre-backup state, and any other settings or configurations are fully restored without discrepancies. **Actual behavior** An error occurred during the restoration because an environment variable related to the SFTPGo configuration was improperly handled, causing it to be out of range. This issue disrupted the restoration, leading to incorrect settings, such as the external SFTPGo access toggle being disabled despite being enabled before the backup. Proper handling of environment variables is essential to ensure all settings are restored accurately. **Components** webserver:1.0.18 **See also** https://community.nethserver.org/t/full-disaster-recovery-findings/24241 ---- thank Eddie Atherton

for webserver, the fix is coming

stephdl · September 2, 2024, 10:22am

hello mates and @EddieA

I would be really pleased if you could test and validate the fix to webserver and backup

please read here to do the QA

please ask whatever you need for this QA

EddieA · September 4, 2024, 1:21am

That restored without error and the SFTPGo settings were correctly retained. I’m guessing that it’s not possible to restore as the same instance number, as the restore creates webserver2. Not that it really matters.

I’m also guessing that I won’t be able to test this as a disaster recovery scenario until the fix is released so that the restore will pull down the updated code.

Talking of the disaster recovery, is there anything happening regarding samba not restoring correctly, which causes other issues in mail or my suggestion for loki and traefik to not attempt a restore, but update the core version (or maybe delete the core version before a restore).

Cheers.

davidep · September 10, 2024, 8:05am

The discussion was split to this thread Restore Samba File Server