Disaster Recovery Total Fail with spinning drive

rpfeifer · June 18, 2025, 4:07pm

NethServer Version: 8
Module: backup

I recently (last week) set up a NS8 server on a PC running Debian 12 with a number services (email + RoundCube, Samba, NextCloud, web server). I had also installed Crowdsec, but later removed to due to excessive CPU utilization. Tested backup /restore (with a few sample files etc) by uninstalling NS8 and the restoring from backup to same PC - all seemed OK.

After migrating data from an old server (about 190GB) all seems to be working well on this new server. However, numerous attempts to test disaster recovery to a different, similarly configured PC have all failed spectacularly.Sometimes the cluster restore fails. Need to uninstall NS8 and start over usually then succeeds. After app restore, the server is not functional; most services are unaccessible (even if I perform the procedure with network forwarded to test server instead of working server) so all DNS etc is correct. It appears that crowdsec is not removed from cluster config backup upon uninstall; it is still included in app retore list, and sometimes re-installs it even if restore not enabled, and I don’t restore it. Samba does not re-install. Email does not work. Disk is accessing continuously after the restore. Cluster-admin UI is partially broken (just spinnning). Reboot does not help. Yikes !

Is this a valid test, or does it need to restore to same IP ? Is there some other thing I need to be aware of ?

As an aside, I notice that uninstalling NS8 does not clean out firewall rules, and can leave things (eg ssh) externally inaccessible, so I clear all firewall rules after the uninstall.

davidep · June 18, 2025, 4:26pm

Hi Randall, thank you for testing the disaster recovery procedure!

You’re raising several points that might need to be discussed separately, but I’ll try to answer them all here.

Disaster recovery can work with a different IP address, although that’s not the typical scenario. The core system and most NS8 apps are not bound to a specific IP, but there are exceptions. Samba is one of them—refer to its manual page for instructions on fixing its IP address after a restore. Samba file server — NS8 documentation
Firewalld services should be cleared by the uninstall.sh script—there is code that attempts to handle this. If you changed the SSH port using a firewalld port redirect, I’d expect that setting to remain unchanged because SSH is not managed by NS8 core.
Does the “similarly configured PC” have resources comparable to the production one? Try restoring applications gradually to help identify any individual issues.
Check the System Logs page. If you can find any error messages there, they might help understand what’s happening.

rpfeifer · June 18, 2025, 5:08pm

Thanks for the reply.

The point of this is to be able to periodically verify that the backup/restore process is working to gain confidence in my data security. As such,it is necessary to use a different address to avoid need to shut down the production server. Perhaps there is a better process for verifying backup ? FWIW, on my now half-dead test server I attempted to change the IP but got a timeout error.

The test actually has more resources than the production server (except slower disk - mechanical instead of SSD), so that shouldn’t be an issue.

I’ll try re-doing this restore one-by-one to see how it goes …

davidep · June 18, 2025, 5:14pm

Yes, manually restoring one app at a time from the Backup page is possible and less resource-intensive.

Please note that spinning drives have very poor performance and do not meet NS8 installation requirements^[1]. Disk slowness can cause service startup timeouts, which are difficult to troubleshoot. While restoring, keep an eye on iowait—top is a simple way to get an idea of disk pressure.

System requirements — NS8 documentation ↩︎

rpfeifer · June 18, 2025, 8:18pm

Hmmm… sounds a bit fragile. I might have to go buy another SSD to check that out.

Before I read your reply I shut down my production server, re-configured the test server to use same IP, and ran a real-world restore. It still failed, but this time the UI caught is and provided a trace, which I captured. You can download it at the following line within the next 24 hours: Send

Thanks

mrmarkuz · June 18, 2025, 10:29pm

I found in the trace that the Nextcloud services are timing out.
You could try to increase the service timeout value, see Continuing NS8 Nextcloud + other problems - #17 by stephdl but I recommend to use at least an SSD.

rpfeifer · June 19, 2025, 4:12pm

Thanks for the feedback. I’m not sure how to change the timeout on a service that is being installed, since by the time I can change it it would have already failed. I’m also surprised that this one fail takes out so much else. It seems to me that if the timeouts are not correct they should be addressed in installation. Given that spinning disks are still quite common, especially where larger storage is needed, this strikes me as a bit of a deficiency (and a quick fix).

So, I guess the short term solution is to buy an SSD (just for this test) or restore all but NextCloud and then somehow cobble that to fix timeout ?

Thanks

mrmarkuz · June 19, 2025, 7:25pm

Did you already try to restore the apps one by one as this would need less resources so maybe there’s no timeout anymore or at least less errors.

You could increase the timeouts after the service fails. The services can be restarted by saving the app settings or on CLI, see Howto manage or customize NS8 podman containers

It’s also possible to for example add an HDD for apps that doesn’t need much resources, see Disk usage — NS8 documentation

The timeouts should work when an SSD is used. It’s a system requirement, see System requirements — NS8 documentation
Changing the timeouts is a workaround to still use slower drives.

rpfeifer · June 19, 2025, 7:47pm

Thanks for the info. I haven’t tried restoring one at a time yet.
I’ll investigate these options.

Thanks

capote · June 22, 2025, 6:41am

But I think it’s a sensible use case. If you want to change hosting providers, you have to move the entire server / node.
Maybe it would be a good idea to integrate a migration tool, like NS7-NS8.

rpfeifer · June 22, 2025, 6:24pm

So I guess a good question to ask is what would be a usable method for verifying server backups ? Having to shut down a production server while testing a backup to avoid address conflict would seem to strongly discourage best practice of testing periodically, regardless of the hardware used to verify. An untested backup is of little more use than none at all.

FWIW, I just completed a restore to my test machine, initially restoring only Loki and Traefik apps (this appeared successful, but there were still no users/groups). I then restored Samba (assuming this restores accounts), and after changing the address as suggested, it seemed to become available and appears to be working. So far, so good. If this is parts of a usable strategy, it might be worthwhile to document a prescribed procedure. I’ll let you know how this proceeds.

davidep · June 23, 2025, 7:19am

If the goal is migrating applications to a new hosting/hardware, the clone/move function is the way to go, see Software center — NS8 documentation.

We’re planning to make the clone/move a multiple steps procedure, like the migration tool does: start, sync, finish.

I think you’re on the right track, and I agree to better document this scenario.

Card NethServer · GitHub

rpfeifer · June 23, 2025, 5:18pm

More testing results: Restoring services (beyond that listed above) one at a time. Samba appeared to go OK, became visible and available after updating it’s address as suggested above, so success on that. Mail restore seemed to complete successfully, but later, port forwarding all relevant ports to the test server instead of production, email connections erre refused. RoundCube seemed to install OK, but is not acessible (Bad Gateway). Webserver would not install - failed before even attempting to restore (trace is available at Send). NextCloud seemed to install (spinner stopped), but failed to install. Was unable to find a log to indicate problem. Several rev proxies to another server worked OK, but none of the NS8 hosted pages were available. In addition, the cluster status page is broken (several frames random fail to update, changes on refresh), though other pages seem to work.

Unfortunately, given these findings, I think I’ll need to find another solution to hosting my services, since I have been unable to gain any confidence at all that my data is secure.

mrmarkuz · June 24, 2025, 2:42pm

Sad to read that, did you test using an SSD?

As regards the error, it seems the module wasn’t installed correctly.

{
  "context": {
    "action": "determine-restore-eligibility",
    "data": {
      "path": "webserver/a2829401-7757-4569-a110-525734d31d52",
      "repository": "ab907cfe-03b4-57c3-b41c-29742a582ac4",
      "snapshot": "7ecf78164c77d3dfbbc58a536e2edfcac0a0d50e6dac51c4be45e958c78c1525"
    },
    "extra": {
      "eventId": "f482fedc-7141-464c-9a5a-cb1b051f042c",
      "isNotificationHidden": true,
      "title": "Determine restore eligibility"
    },
    "id": "6155b8e9-bd16-40da-a229-c6e1df1884ab",
    "parent": "",
    "queue": "cluster/tasks",
    "timestamp": "2025-06-22T19:18:57.190975599Z",
    "user": "admin"
  },
  "status": "validation-failed",
  "progress": 0,
  "subTasks": [],
  "validated": false,
  "result": {
    "error": "restic --option=rclone.program=/usr/local/bin/rclone-wrapper dump 7ecf78164c77d3dfbbc58a536e2edfcac0a0d50e6dac51c4be45e958c78c1525 state/environment\n",
    "exit_code": 2,
    "file": "task/cluster/6155b8e9-bd16-40da-a229-c6e1df1884ab",
    "output": [
      {
        "error": "module_not_available",
        "field": "none",
        "parameter": "none",
        "value": ""
      }
    ]
  }
}

As regards the other apps, did you try to reconfigure them by clicking save in the app settings?

rpfeifer · June 24, 2025, 3:28pm

Unfortunately, I don’t have another SSD large enough to hold my data, and would have to acquire another one.

It’s interesting that the (webserver) module was not installed correctly - the app is installed and working on the production server, backups are enabled for it, and backups have been performed daily for the last week or so without error, although I don’t think and data for this app has changed since installation. As for the other apps, this was a simple restore operation, and I made no changes beyond changing the Samba address before continuing with the other apps.

It’s unfortunate that I’ve had so much trouble with this. The server is otherwise performing very nicely, My initial recovery test with (small) sample data seemed to go well, but is failing with my migrated data (unfortunately, I began using the migrated server prematurely). I would respectfully suggest that if this is a timeout issue due to slower disk that it be relaxed to allow for slower media. It should have no effect under normal operations, but can improve reliability under some circumstances like these, which can only be good.

davidep · June 25, 2025, 8:15am

Systemd’s default service timeout is 90 seconds, which is quite generous and includes any pre-start scripts. If those scripts contribute to hitting the timeout, it’s worth investigating because it may indicate a bug.

Still, if a service can’t start in that time, it’s usually due to a system issue, and it’s safer not to force the startup under load.

As for spinning disks: while custom volume-level mounts of spinning disks are possible and we’re discussing improvements in that area, SSDs remain essential for reliably loading application images due to their faster random access times.