Snapshots are not backups

Hello NS community, thought I would share a self-inflicted disaster scenario we experienced over the weekend and how with the assistance of @Andy_Wismer we were able to restore everything within hours.

Background: A small business had a primary XCP-NG 8.1 hypervisor that was upgraded to the 8.2 LTS release. This server was configured to back up to a dedicated TrueNas Core machine every Wednesday & Saturday with nightly snapshots. Company data is stored & backed up to the TrueNAS Core machine nightly & automatically pushed to BackBlaze\several other TrueNas boxes. Nethserver along with several other Windows/Linux virtual machines provide various services such as Guacamole, AD, Email and Quickbooks.

Scope & Impact: The planned software upgrade to XCP-NG 8.2 LTS was done with the assistance of an on site technician & remote IPMI support on Friday night. Our tech casually inspected the recent Xen-Orchestra backup jobs before upgrading via USB. During the upgrade process the bootloader on the host disk was modified such that it would no longer boot & the upgrade wizard reporting a successful upgrade. I suspect the new UEFI settings were not compatible. After several failed attempts to recovery/repair the decision was made to restore from backup to a clean install of XCP-NG 8.2 due to time considerations. A reinstall of the hypervisor resulted in a complete loss of all current virtual machines, snapshots and backup/configuration settings.

Disaster Recovery Plan in Action: When we got past our oh $h*t moment we dusted off the Disaster Recovery plan created in early 2020. All VMs were successfully restored from delta backups once Xen-Orchestra was re-established taking ~ 1.5 hours. 95% of missing data was imported from TrueNas with the exception of ~ 1 weeks worth of missing email from the 10/3/21 backup. Here we identified a gap in our backup strategy for potential data loss where snapshots are not retained/recoverable; snapshots are not qualified backups.

We got lucky: This particular setup has a small email footprint with only 1 primary account in NethServer being accessed via SOGo Activesync & Outlook 2019 where we captured the PST file. We were able to identify ~65 emails within the Outlook profile from 10/3 to 10/10 that needed to be re-imported into Nethserver. Here @Andy_Wismer came to the rescue as ActiveSync/Outlook is horrible about adding/syncing new content. We devised a strategy creating a new temporary account on Nethserver, connecting it to Outlook 2019 via IMAP, copying the missing emails over to the temporary account to sync back to NethServer and finally using Imapsync to internally copy the emails from the temporary account back to the original account.

With all of that behind me some final thoughts. I cannot thank Nethserver or this community enough. Without being able to bounce ideas, ask silly questions or have access to the years of experience freely offered up here there is no way I’d be able to be as successful as I’ve been. The other is to make sure you back your stuff up & semi-regularly audit those backups. Nothing is wore than realizing you don’t have what you expect.

5 Likes

@royceb

Hi Royce

In german there’s an old saying:

“Vertrauen ist gut, Kontrolle ist besser”

Trust is fine, but double-checkimg is better!

I think it’s been attributed to Josef, also known as Stalin…

:slight_smile:

Still very much valid for backups!
Test one before you trust.

It’s mostly true that a backup system, after doing a complete backup-set, will probably do a million of them, storage withstanding…

However - does the backup contain what I need?

My 2 cents
Andy

1 Like

I also had a similar AHA moment this week, albeit many dimensions smaller, since it didn’t affect virtual machine, only payload data.

All the company files stored on an SMB share were gone, really ALL of them!
Fortunately, I had a daily backup job, separately just for that share. Within half an hour, the data was restored with one click, no loss. If I had had to fish the data out of the global machine backup, it would have been difficult.

Afterwards, it turned out that one of the employees had moved all the data from the share to her personal document directory in Nextcloud.

Incidentally, I have a similar backup job for all mail accounts every 30 minutes, so any mail loss is limited to a maximum of 30 minutes.

By the way… like many others here, I owe a lot to @Andy_Wismer .

2 Likes