Proxmox HA chat

Hi mates

I recently had a new waves of curiosity with proxmox and how to avoid a SPOF (single point of failures). Indeed we could backup or rsync our data, the time to bring up a new server could be really long. This is a bug.
I am a long time user of PROXMOX, since years now, I use to have two instances, one at home for dev purpose (even if I migrated to virtualbox for dev, it is really fast on SSD) and one online to have some VM behind (debian, SME Server, NethServer). This is running smootly but even if I have some backup and some rsync scripts to go the data out, a recovery could take days.

The power of PROXMOX right now is snapshot, I trust in it, before to make an upgrade, do a snapshot, if something is wrong then go back to the former state. It is incredible, but I am not sure that a lot of people uses it professionally, I bet/worry/fear that still IT guys install NethServer directly on bare server.

So my purpose was to test some HA with Proxmox, of course during my holidays :smiley:

this is what I read or testing it myself

  • Poor man HA (easy)
    use two proxmox nodes in a cluster on ZFS array and replicate the VM on the other node. You can setup the sync for each minutes if you want, this is what you will loose, the data since the last sync. The cons is that if you have lost the node with the running VM, you cannot have the HA and the VM on the replicated host won’t pop up, because the HA must run with three nodes to avoid the split-brain (the same VM running simultaneity on two nodes). So either you have to issue three commands line to start the replicated VM, or use a Qdevice (debian based due to corosync3 with proxmox6), it is a third host that even you could probably run on a raspberry PI to honor the quorum. You can also have a third proxmox node, but the cost will increase.

  • Use a shared storage (good but not satisfying)
    Use three nodes of proxmox with a remote shared storage by the network like NFS, this introduce a single point of failure, even if the HA can migrate online the VM, if you have lost the storage, all is down. The network is the bottleneck.

  • Use a distributed storage like cephs or drbd (need good skills)
    This is real HA but the cost will increase massively, you have to get at least three real servers with two NIC, and the storage NIC should be in 10GB, because the storage is distributed between all hosts, it is a kind of network RAID, you loose one host, the VM can still run on the other nodes, and you do not have lost your data. This is really interesting but you need a deeply understanding of what you are doing and good trainings in sysadmin. Cephs is well implemented in proxmox, DRBD too by its developers

I am not a sys admin guy, how do you use proxmox to avoid a SPOF/HA

1 Like

When going the cluster path for Proxmox we enter the more professional setup for a virtualization platform. And I like that!.. :wink:
As soon you have multiple nodes for HA purposes, you also need a quorum data storage. This quorum disk is absolutely necessary if you want to do live migration of VM’s from 1 server to another.
Reading a bit on proxmox forums and wiki, it shows their documentation needs some serious updating…
this discussion might be of value here: https://forum.proxmox.com/threads/pve5-and-quorum-device.37183/
Unfortunately I can’t try ot test HA since I don’t have a 2nd server… :-/

You could install a virtual 3-Node HA-cluster with ceph storage (HCI), just for testing. I did it for my A-Level exam… You have to configure nested virtualization on the one Proxmox Host who run the three other proxmox server (the cluster)…

edit: PROXMOX VE WITH CEPH – HYPER-CONVERGENCE

I absolutely appreciate that topic, as I’m also using proxmox.
I’m using 2 real servers. One is a little Fujitsu T100 and the other is a newer X200 S8.
On the T100 runs the slave and on the X200 runs the production machine.
Sync intervall is 15 min. To loose the last 15 min is o.k. in our case.

I had to do a disaster recovery and had some troubles, but within about 2 hours I had a rudimental working environment back on the T100. No data was lost. Some hour later and with a little help I had also the vmail back and Sogo and some other stuff. That environment worked for 3-4 days until I had the X200 repaired. Than synced everything back and got the main machine back in production.
Real downtime only about 2 hours. For my use this is really o.k. in relation to the costs.

But some time ago there was @wahmed active in our community. I think he can give the most profound insight to this topic, as he is the author of “Mastering Proxmox”.

1 Like

At the proxmox forums and wiki I saw they used a rpi as 3rd server so full blown failover services could be added…


https://pve.proxmox.com/wiki/Raspberry_Pi_as_third_node

Yes but the warning do not use a raspb in production environment is not pushing me in that way :slight_smile:

Like @fausp wrote everything can be tested on the proxmox itself, the server must be able about the nested virtualisation and all drivers/hardware must be in virtio

I know that proxmox has made a cluster simulator that you might install to test it

@flatspin, I still keep a tab on Nethserver forum although I dont say much. :slight_smile:

Proxmox HA came very long way since I first started using Proxmox years ago. It is so much easier now. But I have to say i am personally not fan of it in a large environment. It causes too much unnecessary shifting of VMs. Of course I am strictly talking about the Proxmox HA feature. If I am not mistaking, with the latest Proxmox release there is no need for a 3rd node for fencing.

As for general storage HA which @stephdl talking about I am understanding it right, Ceph in my opinion is the best storage distributed, shared storage for VMs. Again, works really great in large environment. For small environment, as low as 2 nodes Proxmox cluster i believe Replication works great. No need for shared storage, extremely budget friendly. Downtime is also extremely minimal and it works out of the box. Since the replication can be schedule every 5 minutes, the amount of lost work is minimized extremely. Using RaidZ2/Z3 locally on 2 nodes and Replication SPOF is greatly reduced.

We use Proxmox Snapshot strictly versioning purpose as it never gets backed up with a Proxmox full backup. Mostly for testing updates/patches on live VMs.

4 Likes

Thanks for your reply. Good to know that you’re still around and hearing us. :slight_smile:

1 Like