Design Considerations for a VM Host and Nethserver 8

yummiweb · February 12, 2025, 7:29pm

Dear Community,

I would like to open up some design decisions for discussion. This time, it’s not primarily about the design decisions made by the Nethserver 8 developers, but about considerations regarding the settings of new overall systems, which in my case regularly rely on VM hosts, virtual machines, and containers.

Decision 1 - The VM Host

Proxmox VE (PVE) is a fantastic VM host, and when combined with the Proxmox Backup Server (PBS), it truly leaves little to be desired. At least not in the scale of the systems I manage. These are usually single PVE hosts, often with an integrated PBS. I used to work with VMware (and earlier with VirtualBox-based VM hosts), but PVE is the best solution for me, not least because it is highly customizable—just the way I like it in a Linux environment. (My PVE is often a Debian on-top installation).

Decision 2 - The Filesystem on the VM Host (Proxmox VE)

The root system is usually ext4, which is perfectly sufficient for me because the system disk is only used for booting and storing configurations. If anything goes wrong here, it can be quickly restored.

What matters more is the filesystem used to store the VM images. This filesystem should at least be encrypted because it often contains sensitive data, and the legal requirements are clear in this regard. Instead of configuring a separate encrypted filesystem for each VM (which would need to be unlocked each time), I decided to encrypt the VM storage using LUKS and unlock it after PVE starts (manually or by another method—without storing the key on the PVE itself). On top of that, an ext4 filesystem is regularly used, which has worked flawlessly so far.

However, it’s always worth considering new methods. ZFS is an interesting option for VM storage, but for my purposes, it is far too complex and resource-intensive. Moreover, I prefer solutions that are traditionally set up (I avoid LVM where possible) and can be manually managed. Additionally, it is said that ZFS does not work well with consumer hardware and can put significant strain on it.

To still take advantage of checksums and a well-functioning RAID1 with reliable “self-healing,” I use a combination of LUKS and BTRFS on top of my current PVE setup. Specifically, this involves two NVMe drives of different types, each encrypted with LUKS. Two BTRFS filesystems were created on these, and they are combined as a BTRFS RAID1 volume. After the PVE starts and the LUKS encryption is unlocked, the BTRFS RAID1 is available and runs smoothly.

Additionally, there is a hard disk set up as a BTRFS single (also on LUKS), which receives daily snapshots from the BTRFS RAID. This way, you can quickly return to a working VM state—also outside of the integrated PVE snapshots. The BTRFS snapshots can be easily written, backlinked, and used (or directly from the backup).

The PBS solution is excellent, especially because you can mirror all backups to a remote PBS. This would even cover regional disaster scenarios—though with long recovery times. Restoring individual files (within the VM) is also possible.

As an additional emergency solution (if you need to access secure data natively without PVE or PBS), there are also regular (incremental) rsync “pull” backups of most VM content. These backups can also be nicely distributed remotely.

This is how I would establish the PVE configuration going forward.

Relation to Nethserver

To maintain the connection to Nethserver, previous Nethserver 7 installations used Restic for network backups (local and remote). Unfortunately, I haven’t found a useful alternative for Nethserver 8 yet. The built-in backup solution is primarily geared towards cloud storage or separately configured S3 alternatives. This is too complex for my purposes. I would prefer a local backup (to a local drive or VM image), from which I could restore containers or migrate to a new node. Unfortunately, this has never worked properly, which has prevented the move to Nethserver 8 so far.

Decision 3 - Filesystems for Nethserver 8

For the new Nethserver, there is also the question of which filesystems to prefer within Nethserver 8 or other (mostly Linux-based) VMs?

Data integrity seems to be fundamentally ensured by the BTRFS RAID1, but if bit errors occur (for whatever reason), they will always be assigned to the entire VM image on the PVE host rather than to individual files.

So, to potentially identify and selectively replace “corrupted” files within a VM, a filesystem that at least keeps track of such errors would be beneficial. Again, BTRFS comes to mind.

For Nethserver 8, this would be feasible for “/home” using a BTRFS formatted disk (VM image).

Unfortunately, it turned out early on that RHEL clones don’t support BTRFS. One could certainly use a different kernel or modules, but would that be a reliable base for a production system?

To natively support BTRFS, one would have to consider a Debian-based system, but my previous experiences with Nethserver 8 on Debian have been less than ideal compared to, for example, Rocky 9.

What have your experiences been with Nethserver 8 on Debian?

For certain (professional) scenarios, a Debian base might not be feasible due to the lack of “subscription” options. Unfortunately, this affects the very configurations for which the considerations mentioned above were made.

Moreover, there are occasional reports of incompatibilities between BTRFS and Docker containers, which would particularly affect Nethserver 8.

What are your experiences with BTRFS in general, or specifically under Nethserver 8? What do the developers say about such a configuration?

Best regards,
Yummiweb

Andy_Wismer · February 12, 2025, 8:00pm

Hi @yummiweb

Actually, in the beginning, Red Hat was one of the major users and contributors of BtrFS, sadly, they dropped it soon after. For anyone familiar with Synology NAS, BtrFS is Rock-Solid.

I have about 10 NS8 systems in Production, all based on Debian, using BtrFS as file system, running in Proxmox, and most of the Proxmox themselves are fully ZFS equipped systems.

→ Just as a side note: I actually tested in PBS if a file level restore inside a VM using BtrFS is possible. It is, and fully supported! Some of these systems have 20-30 users, 1.5 TB on NS8 and are now running over one year without issues.

As such, ZFS is also much easier than you expect, and while there are some “gotchas” with consumer hardware, these mostly concern SMR vs CMR harddisks (Spinners)… SMR disks are not usable for ZFS, but also are problematic with BtrFS…
I’ld strongly suggest to revise your concept of using ZFS, it’s easier than you expect!
The one really important thing when setting up ZFS is correctly setting memory limits.
→ Planing in a “swap” partition is old school and not specific for ZFS, but the Proxmox setup doesn’t handle this automatically. It is easy to do, when planned in!

BTW:
Proxmox as standard for several years used XFS as normal file system, an also rock solid file system (without really being journaled…). NS7 used XFS as standard.
Ext4 is usable, but crappy with small and with large files. Much too “dated”.

All my Proxmox have at least 16 GB swap, and NS8 i set to either 8 or 16 GB swap, depending on size of setup…

→ Proxmox handles the default install on hardware MUCH better with ZFS than Debian, Proxmox is also the major contributer to ZFS/Linux. I install my Proxmox with the ISO on a USB.
Then (having ONLY the system disks connected) I will set this as ZFS Mirror and - under advanced - I’ll deduct 8 GB from each system disk. This gives me my 16 GB Swap partition, split among both disks.
After installation, I will set ZFS memory limits, and also have the intended storage disks hooked up, so I can configure them from the Proxmox GUI correctly as ZFS with redendancy support, whatever is suitable in such a use-case.

4x 20 TB Seagate Exos enterprise class spinners (7200 RPM) in ZFS RAID10…
(An additional one as “hot spare”, just in case…)

Still 20% RAM free, yet Swap used. This system came preconfigured with hardware RAIDED NVME, We intend to add in SSDs as RAID with additional swap, as anyone can see, Linux does run better with ample swap than without. And here, 16 is needed!

Follow these two suggestions and mastering ZFS is much easier!

ZFS on Proxmox:
https://pve.proxmox.com/wiki/ZFS_on_Linux#zfs_swap
Read here “Limit ZFS Memory Usage”

ZFS Tips and Tricks on Proxmox:
https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks#Example_configurations_for_running_Proxmox_VE_with_ZFS
Read here " Snapshot of LXC on ZFS"

Good to know:

It is possible, to eg install NS8 on Debian.
If the client needs commercial professional support options, spin up a new VM based on Rocky.
Use NS8 Backup / Restore options to essentially move all applications and the master node itself to the new Rocky based node.
Set IP, DNS correctly, especially for mail, AD et al.
Reboot, remove the Debian based node, and you’re essentially “Good to go”.

My 2 cents
Andy

yummiweb · February 13, 2025, 10:25am

Dear Andy,

Thank you for your attention and your $10,000 response!

Just as a side note: I actually tested in PBS if a file level restore inside a VM using BtrFS is possible. It is, and fully supported! Some of these systems have 20-30 users, 1.5 TB on NS8 and are now running over one year without issues.

This (your) review alone is very helpful to me and encourages me to implement the first NETH8 switch now Debian based with a BTRFS file system.

As such, ZFS is also much easier than you expect, and while there are some “gotchas” with consumer hardware, these mostly concern SMR vs CMR harddisks (Spinners)… SMR disks are not usable for ZFS, but also are problematic with BtrFS…
I’ld strongly suggest to revise your concept of using ZFS, it’s easier than you expect!

It’s not so much the sheer complexity that puts me off, especially since ZFS in PVE is very easy to configure via the GUI.

It’s the technical complexity required for this, which not only requires a certain amount of RAM (ECC if possible), but also a corresponding number of data storage devices (with as much of their own cache as possible), some even for special tasks. And the data storage devices alone require sufficient connection options.

And as far as you can read, it’s all very flexible, but unfortunately only “upwards”. That means that data storage devices can be replaced or added, but you would probably be stuck with certain things or would have to set up the entire pool again.

As soon as you run a larger setting with your own storage or even a cluster, that would clearly be my first choice. But for most of the scenarios I manage, that’s outside the scope. It’s more about providing “fast but small” and “slow but large” storage (see my NETH8 problem with the mounts)
and even for that, as a Raid1, at least 4 disks + system disk are required.

I know that you can “throw it all together” in ZFS if you have a fast cache disk,
but you don’t do that because you want to, but because you have to, so that the storage can offer all VMs as much performance as possible. Whether the VMs request this regularly or even simultaneously is not something you can influence in many scenarios (mixed hosting).

My scenarios are more specific and above all smaller, more typical Nethserver scenarios on VMs rather than on metal and even then my setting sometimes seems a bit too complex (even if everything is well thought out).

Nevertheless, I’ll take a look at ZFS, maybe it will come to me at some point. (and I assume that your ZFS recommendation is at least as useful as your advice on PBS)

The Nethserver 8 on Debian/BTRFS installation does not stand in the way of the storage decision, as this could be adjusted later.
So I will probably continue to follow the BTRFS storage in PVE for the time being and report back if necessary.

Until then, many thanks for your assessment!

Note:
What I somehow miss in my research (it’s probably too specific) is a file system or software raid for a mixed raid1 of hard disks and SSDs.
In other words, of a fast “primary” data storage device and a slow “secondary” data storage device (as a mirror) without limiting the basic performance to the performance of the HDD. The “Morror” is actually only needed to occasionally correct detected bit errors, and when that is the case, the overall performance can drop for a short time. But normally the “primary” SSD should determine the performance - which is unfortunately not the case with previous solutions.

Andy_Wismer · February 13, 2025, 10:47am

This CAN be done.

NORAID offers “mixing” any disks you want.

But still, as spinners have very different issues from flash (But good quality matters very much for both) like SSDs and NVMEs AND keeping the disks in sync is more and more difficult, the larger the difference in speed / access is.

In other words, the complete sentence should be:

You CAN do it, yes, but it’s about advisable as jumping out of a high story buildings window…

I would not want to burn my fingers with that…

I do keep fast and high volume storage available (Also in ZFS if possible), but with different pools.
And just for your information:

I’m sure you’ll agree this box isn’t “underpowered”…

/dev/sdf is the boot / system disk for Proxmox, it’s NVME based, so very fast. There is spare LVM space I can use (nearly 2 TB!).

This system was supplied pre-installed with Proxmox, I’ld have preferred to reinstall the box, but the client is king…

The additional disks are all Enterprise Class 20 TB spinners with 7200 RPM.

ZFS in RAID10

And here is the major file share storage:
A VM with OMV (As AD member!)

Take a note of the disk allocations…
LVM is fast, NVME based, ZFS is fast spinners…

This storage is mainly used for CAD stuff, but really HUGE files easily 100-200 GB big.
Incremental PBS backup takes less than 5 minutes.

→ None of this has ANY caching!
Clients are very happy with this setup!

I still have plenty of empty bays on the Dell server if I ever need caching…

Just to showcase that not all tweaks or options need to be used when using ZFS, it’s more than fast enough! IO on the hardware must be decent, though…

A good thing to remember when discussing spinners or digital flash based stuff:
Spinners came with built in acustic feedback as an alarm sign.
(They made screeching or clicking noises)
But you were able to “hear” them before they died. Solid State just dies silently…

My 2 cents
Andy

davidep · February 13, 2025, 11:28am

I have been using Btrfs on my Fedora Silverblue laptop for five years, and it has worked flawlessly.

Regarding NS8, we use LVM-VDO on some production servers, both in cloud VPS and on-premise bare-metal nodes. LVM-VDO is a block device that provides data deduplication and compression. It is particularly efficient when a node hosts multiple instances of the same application. For example, we run many NethVoice instances on the same node. However, since container images often share similarities across different applications, I believe deduplication is generally beneficial.