Nethserver in Proxmox corrupted

capote · November 13, 2023, 9:31am

NethServer Version: 7.9
Proxmox: 8.0.4

Now I have to realize that my Nethserver is hanging up.
After a hard restart of the PVE node, I see the console error:

5.6653081 XFS (dm-0): Internal error XFS_WANT_CORRUPTED_GOTO at line 1753 of file fs/xfs/libxfs/xfs_alloc.c.
Caller xfs_free_extent+0xaa/0x150 [xfs]
5.6659161 XFS (dm-B): Internal error xfs_trans_cancel at line 993 of file fs/xfs/xfs_trans.c.
Caller xfs_efi_recover+0x18e/0x1c0 Ixfs]
5.6664901 XFS (dm-0): Corruption of in memory data detected.
Shutting down filesystem
5.6665451 XFS (dm-8): Please umount the filesystem and rectify the problem(s)
5.6666081 XFS
(dm-0): Failed to recover intents
Generating
"/run/initramfs/rdsosreport.txt"
Entering emergency mode. Exit the shell to continue.
Type
"journalct 1"
to view system logs.
You might want to save
"/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.
:/#

Since access via the web console or ssh is not possible, I don’t know what I could do. The instructions in the console message are not executable for me.

I have already imported an old backup, and it also starts. I was actually reassured after that.
But at some point the error occurs again without me being able to identify the trigger or the conditions.

Does anyone have any ideas?

Sincerely, Marko

Andy_Wismer · November 13, 2023, 9:55am

Hi @capote

A file system can become corrupted, even virtual ones!
This usually can be repaired. Unfortunately, this is not possible with XFS easily, the system needs to be booted from a different system. Ext4 can be repaired on the fly, but it also needs repairs more often!
On the second hand, how stable is XFS? How long has your server been running without any file system issues? Probably a very long time!
If it keeps repeating the error (in a virtual system), there’s probably a hardware defect somewhere (Chip, IO, RAM, etc.).

Start up the VM with a working backup, verify this!
Shut down NethServer VM.
Boot using the latest SystemRescueCD Image.
This can easily repair the NethServers XFS file system (also used in Proxmox as default, if not set to use ZFS).
Use xfs_repair… Note: NethServer 7 main filesystem is on LVM. The LVM must NOT be mounted to run XFS_Repair! (Ideal for SystemRescue!)

After that, reboot the Nethserver - and make a backup right away in Proxmox.

Watch and observer the VM the next couple of days, it should be OK!

My 2 cents
Andy

From my notes:

Boot ab CD: SystemrescueCD:

xfs_repair /dev/sda1

xfs_repair /dev/VolGroup/lv_root

Fertig!

Note: SystemRescue has a GUI option (startx), but this is not needed here, console is enough!
And as it’s a VM, the paths above are always correct!

/dev/sda is the default boot partition of NS7
/dev/VolGroup/lv_root is where the actual file system resides…

→ When repairing, do both!

capote · November 13, 2023, 10:09am

Thank you Andy,

How long has your server been running without any file system issues?

My NethServer runs round about two years stable.
I will try your suggestion.

Sincerely, Marko

LayLow · November 13, 2023, 11:58am

Indeed! Normal life is applicable in a virtual world.

capote · November 13, 2023, 1:44pm

I have done it and it seems to have worked. I’ll have to monitor it over the next few days.

Sincerely, Marko

france · November 13, 2023, 3:40pm

Hi, did you have the server in production with the previous version, or did you install proxmox from scratch? Is the VM disk in nfs on nas or local ?

capote · November 13, 2023, 4:45pm

I upgraded Promox from 7.x
VM in NFS mounts

france · November 13, 2023, 6:18pm

Then I have to worry … I have the same cfg but at the moment proxmox 7

Andy_Wismer · November 13, 2023, 8:32pm

@france

No real need to worry, if the error crops up (rare!), it can be fixed in Proxmox 7 just as easily as in. Proxmox 8.
Keeping a copy of SystemrescueCD ready on Proxmox ISO libraries always makes sense!

My 2 Cents
Andy

capote · November 13, 2023, 10:57pm

If I now connect to the Nethserver web console in Proxmox I have to see:

Andy_Wismer · November 13, 2023, 11:07pm

Any signs of hardware error, egg when rebooting prox?

capote · November 14, 2023, 1:25pm

No, Proxmox is still running fine.

filippo_carletti · November 14, 2023, 1:30pm

When I experienced virtual machine corruption it was caused by failing/failed hardware.

Andy_Wismer · November 14, 2023, 1:56pm

Just by looking at it?

Ao any “real” tests done, SMART checked for all disks, RAM Test?

Almost the ONLY thing that makes a UN*X / Linux “barf” core like that is a hardware failure…

My 2 cents
Andy

LayLow · November 14, 2023, 2:02pm

Agree.

Andy_Wismer · November 14, 2023, 2:08pm

When booting such a host (suspected of hardware errors!), one almost MUST have a screen hooked up and observe the screen during boot, and about 5 minutes after boot!!!

A lot of errors are shown on the screen, and in the very depths of certain logfiles, when you know what you’re looking for.

Watching the screen is much easier!

(Normally, only the login appears, no error messages!)

My 2 cents
Andy

schulzstefan · November 14, 2023, 7:11pm

@capote

I assume you googled already issues or bugs while updating from proxmox 7 to 8. You probably know this one: Nach Update auf 8... nach kurzer Zeit nicht mehr erreichbar | Proxmox Support Forum

It’s in German. As I don’t know what kernel you do use right now, basically the thread says downgrading to i.e. kernel 5.15.108-1-pve solves the problem that the machine is not reachable.

I don’t know if there’s a fix already rolled out. You could check or roll back, just to check if you reach stability again.

regards,
stefan

capote · November 16, 2023, 12:09am

Thank you Stephan,

It is assumed that kernel 6.2.16-16 fixes the problems.

I’m on Kernel 6.2.16-19.

Before I carry out any experiments, I will test the hardware as Andy suggested.
I just don’t have the time.

Sincerely, Marko

Ps.: german is fine