VM start fails after Proxmox update

Hello, I have installed the update to pve-manager/6.2-6/ee1d7754 on two nodes (Community Edition). After that, the VMs won’t start anymore. Neither automatically nor manually.
But inside the Task Manager the status is “ok”

Node 1 with subscription remained untouched on pve-manager/6.2-4/9824574a.
What can I do to get the VWs running again?
Which information is still needed?

Best regards, MArko

You could boot into an older Proxmox kernel until you/we can find the problem?

@capote

Hi Marko

I have a few “No-Subscription” Porxmox around, and my official Home-Proxmox (With Subscription). Versions same as yours, 6.2-6 on Non-Subscription, 6.2-4 with Subscription.

At the moment, all VMs start as expected.

If you move a non-starting VM (For testing) to the Proxmox 6.2-4, will the VM start without issues?

Are there any relevant log entries for Proxmox / QEMU?

My 2 cents
Andy

Hi Andy,
simply and stupid… I don’t know how to move without migrating. Migrating is not possible.
I tried to delete the VM100 - no success and no error messages.
Then I tried to restore the VM100 (Zabbix-Server) VM from Backup. That produced the error “unable to restore VM 100 - can’t lock file ‘/var/lock/qemu-server/lock-100.conf’ - got timeout (500)”
The second attempt immediately after the error message restored the VM and the VM100 started well.

In the next step I tried to restore VM200; finally successful but with some hickups:

restore vma archive: lzop -d -c /mnt/pve/VZDump-Backup/dump/vzdump-qemu-200-2020_06_16-00_00_03.vma.lzo | vma extract -v -r /var/tmp/vzdumptmp4661.fifo - /var/tmp/vzdumptmp4661
CFG: size: 701 name: qemu-server.conf
DEV: dev_id=1 size: 34359738368 devname: drive-scsi0
CTIME: Tue Jun 16 00:00:04 2020
error during cfs-locked 'storage-Disk-Images' operation: got lock timeout - aborting command
Formatting '/mnt/pve/Disk-Images/images/200/vm-200-disk-0.qcow2', fmt=qcow2 size=34359738368 

I am curious to know what the cause might have been in order to prevent a repetition.
Thank you, Marko

Hi fausp,
I operate my nodes headless. How can I remote boot into another kernel?

@capote

Hi Marko

High time, that you activate this…
-> Fast Migration Cluster

Requirement: All Proxmox are in the same Network (No other requirements!)

All three should be at the same update level, or at least very close.
Next step is to create the cluster.

Here from my personal Proxmox Cheat-List:

Create the cluster

Login via ssh to the first Proxmox VE node. Use a unique name for your cluster, this name cannot be changed later.

Create:

hp1# pvecm create YOUR-CLUSTER-NAME

pvecm create PVE-CLUST

To check the state of cluster:

hp1# pvecm status

Adding nodes to the Cluster

Login via ssh to the other Proxmox VE nodes. Please note, the nodes cannot hold any VM. (If yes you will get conflicts with identical VMID¥s - to workaround, use vzdump to backup and to restore to a different VMID after the cluster configuration).

WARNING: Adding a node to the cluster will delete it’s current /etc/pve/storage.cfg. If you have VMs stored on the node, be prepared to add back your storage locations if necessary. Even though the storage locations disappear from the GUI, your data is still there.

Add the current node to the cluster:

hp2# pvecm add IP-ADDRESS-CLUSTER

For IP-ADDRESS-CLUSTER use an IP from an existing cluster node.

To check the state of cluster:

hp2# pvecm status

Do this for all cluster members.

If a VM is locked, you can unlock it with:

qm unlock VMID
(Number)

Now you have a cluster which can live migrate any VMs!
A live migration, depending on RAM, takes about 90 seconds if using shared storage (VMs are on NAS / SAN) !!!

Try it, it’s that easy!

Andy

2 Likes

I did, actually:
root@proxmox:~# pvecm status
Cluster information
-------------------
Name: pvecluster
Config Version: 8
Transport: knet
Secure auth: on

Quorum information
------------------
Date:             Mon Jun 22 22:29:25 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.2c0
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      3
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.3.200 (local)
0x00000002          1 192.168.3.201
0x00000003          1 192.168.3.204
root@proxmox:~# 

I have also activated fencing (I guess…).

That would have been the decisive clue:
qm unlock VMID
(Number)

“Unfortunately” it’s all back on now, but I’ll remember that.

Now you have a cluster which can live migrate any VMs!
A live migration, depending on RAM, takes about 90 seconds if using shared storage (VMs are on NAS / SAN) !!!

Yes, normally yes, but this time no migration could be triggered, nothing happened.
Danke Dir trotzdem, Marko

@capote

Hi Marko

This is also important to know in cluster operation:

Any time a node is not working in a three node cluster, you cluster loses “Quorate” status, meaning the cluster is no more synchronized to all nodes.
You can’t start any VM anymore, nor can you migrate…

Use this command:

pvecm expected 1

This sets the required votes for Quorate-Status to one.
Now you can edit locked files in PVE cluster config, boot VMs and migrate.

My 2 cents
Andy

Ok, I did it.
Know I get …

root@proxmox:~# pvecm status
Cluster information

Name: pvecluster
Config Version: 8
Transport: knet
Secure auth: on

Quorum information

Date: Mon Jun 22 22:41:32 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.2c0
Quorate: Yes

Votequorum information

Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate

Membership information

Nodeid      Votes Name

0x00000001 1 192.168.3.200 (local)
0x00000002 1 192.168.3.201
0x00000003 1 192.168.3.204
root@proxmox:~#

Where is the difference?

You ONLY need to use that command when you have a cluster problem, or one or more nodes are not working correctly.

If you need to edit files in /etc/pve, an unlock will not help, but “pvecm expected 1” will help!

This should only be used if the cluster is no more “Quorate”.
See Proxmox “Datacenter” -> Summary:

In your current case, I’d reboot ALL 3 Proxmox, just to make sure your cluster is working as it should. Eliminating the “pvecm expected 1” override!

My 2 cents
Andy

1 Like

ok, I understand now. Thank you very much.
But I didn’t realize/recognize that my cluster condition should have been bad.

It CAN happen (very rarely) during an update.

In January 2020, our national provider (think T-Online for Germany), had two major outages. Even emergency police / ambulance and other services were disrupted. The only service working halfway well was the mobile network. The police and other emergency organizations even published a mobile number, as the normal emergency number could not be reached…

Without knowing this (Found out the next day…) I did an update on the Network shown (SHZG), and the NethServer and one Proxmox got corrupted updates.

The next day I had to go there and fix the problem…

Andy

Sorry for the late answer… I think you already got it :slight_smile:

@capote and @Andy_Wismer
Could you please also translate your german parts for not german speaking people. I think it’s interesting for others too.

@m.traeumner

Hi Michael

Corrected - sorry, sometimes get carried away, forgetting we’re not in a private chat!
Sh*t happens!
At least I clean up my mess myself! :slight_smile:

Andy

1 Like