Help - i have an emergency

Yes, I'll come back with a status update, in the meantime thanks for your incredible support.

3 Likes

Hi all, thanks a lot again for your unbelievable help!!!

It was planed that monday morning the new system will be up and running. All problems started, after this damned new samsung T7 2 TB disk was plugged to the server for a backup, after having migrated to new hard- and software of all our systems, and sunday night everything was up and running. At least it seems like this brand new 2TB disk was the culprit, as some hours later the system paniced and its then when I realized that there are quite a few chksum errors on one of the two internal nvme 2TB disks. Only after this happend we realized, that the first time this disk was used by plugging it into a new hp laptop, it fried its ram too…

We are waiting for onsite support to bring a new one for replacement. Carsten, and Andy I cannot thank you enough. Btw carsten, our usb sticks which you helped me backup all vms saved my live. After our lenghty and much apreciated call, I went on looking into the dumps and as you already had suggested, I decided that its not really worth backuping them, and I decided to delete the pool. As I was already very tired, I had forgot :facepalm: to safe the vm configs from the db file, and only thought about it after having destroyed the pool, I then thought nevermind, could be worse, and recreated the vms manually. But even the windows machine, which I was afraid would loose its licence and would need a new one, showed status activated so no real harm by that.

As I was not sure, what would happen when I create the disks and pointing to the rpool-old (I was afraid, they would be overwritten by a new empty disk), I created the vms pointing to the empty rpool on the new proxmox install, and then edited the created conf file to point to the rpool-old disk, which worked fine.

Out of curiosity, I then deleted the vm. It was then when a last panic came up, as I realized that the disk in rpool-old (with the real data) was gone too, but realized soon enough (before getting nuts :slight_smile: ) , that fortunatelly we had saved everything to the two 500GB sticks, so it was easy after what you had teached me, to get the vm disk back with zfs send/receive.

The second disk is aparently fried as it soon started to produce checksum errors again after we plugged it in and the resilvering process had taken place.

There is one annoying thing remaining. As soon as we boot while both m2 slots are in use, the nic does not work anymore on the newly installed proxmox with separate boot disk / datapool. We see that while bios the nic has link (yellow and green led blinking), the leds are inactive as soon as proxmox starts booting and doesn’t come up at all. We have seen this on two different machines of almost the same type. Bios reset and update did not help.

I am curious if that persists or if it will work with replaced nvme disk. However initially on first install, with no separate boot pool, nic worked with both m2 disks online whereas now proxmox install was always done while both m2 ports were empty as we wanted to avoid loosing data by wrongly configuring proxmox setup and they were only added after successfull proxinstallation. I will try to use this problem to make hp not only replace my m2 but also the mainboard. Franckly - I even distrust ram and cpu of this machine now, as I dont know what else this rotten samsung t7 could have fried.

What is great though that after two workdays the reamining m2 disk show zero read/write or checksum errors :ok_hand:

Tommorrow we will plug two other nvme disks with no important data in the m2 slots, and try to boot a livelinux ex. sysrescuecd to see if nic comes up while booting. Depending on the result we will know if the problem could be solved by installing proxmox while both m2 ports contain a disk, or if we can blame hp that the machine has a problem until we get it repaired in a way that both m2 and nic will do their damn job :laughing:

As in the meantime we had migrated back to our old systems tuesday and after the great help of andy and carsten we were able to bring our new systems back online, I had a hell of a lot of work to catch up. Besides serving our clients, we had all our users sitting before new systems with a lot of questions so there was no time to come back and give status update earlier. Sorry for that.

I still have some minor problems to solve, like I need to learn about nethserver firewall / shorewall but thats stuff for a different story/thread.

Once more, you guys rock, and saved me with your kind support. I was near a panic and almost got nuts and you guys really did a hell of a job not only helping me solve the issues but also were of great help when I needed to calm down to not get nuts or getting a heart attack so again: THAAANKKK YOUUUUU!!!

6 Likes

We booted from sysrescuecd while having both m2 slots equipped with two m2 (dummy disks without our actual prod data) disks and nic works fine, so I guess the proxmox installation has a problem, as when proxmox boots with both m2 slots occupied, the nic gets inactivated. I hope that after re-installing proxmox on the boot pool on a sata ssd, while both m2 slots have a (dummy) nvme disk installed, this will be solved. I don’t really understand yet why proxmox inactivates the nic, as initially (when we had no separate bootpool on a sata disk) proxmox was installed on a mirror on two nvme disks and there was no problem with nic/lan.

Reinstallation of ProxMox on a sata drive (zfs) while both m2 slots equiped with nvme disks and everything is alright, meaning, nic is active. Aparently proxmox is a bit sensitive, as we saw that now nic gets deactivated while booting when one of the two m2 is removed. However, replacement disk installed and mirror is resilvered and online. If a zfs experienced user wants to help me reconfigure/optimize my data pool, I’d be happy, but that can wait and there is no hurry.