Nethserver in Proxmox corrupted

NethServer Version: 7.9
Proxmox: 8.0.4

Now I have to realize that my Nethserver is hanging up.
After a hard restart of the PVE node, I see the console error:

5.6653081 XFS (dm-0): Internal error XFS_WANT_CORRUPTED_GOTO at line 1753 of file fs/xfs/libxfs/xfs_alloc.c.
Caller xfs_free_extent+0xaa/0x150 [xfs]
5.6659161 XFS (dm-B): Internal error xfs_trans_cancel at line 993 of file fs/xfs/xfs_trans.c.
Caller xfs_efi_recover+0x18e/0x1c0 Ixfs]
5.6664901 XFS (dm-0): Corruption of in memory data detected.
Shutting down filesystem
5.6665451 XFS (dm-8): Please umount the filesystem and rectify the problem(s)
5.6666081 XFS
(dm-0): Failed to recover intents
Generating
"/run/initramfs/rdsosreport.txt"
Entering emergency mode. Exit the shell to continue.
Type
"journalct 1"
to view system logs.
You might want to save
"/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.
:/#

Since access via the web console or ssh is not possible, I don’t know what I could do. The instructions in the console message are not executable for me.

I have already imported an old backup, and it also starts. I was actually reassured after that.
But at some point the error occurs again without me being able to identify the trigger or the conditions.

Does anyone have any ideas?

Sincerely, Marko

Hi @capote

A file system can become corrupted, even virtual ones!
This usually can be repaired. Unfortunately, this is not possible with XFS easily, the system needs to be booted from a different system. Ext4 can be repaired on the fly, but it also needs repairs more often!
On the second hand, how stable is XFS? How long has your server been running without any file system issues? Probably a very long time!
If it keeps repeating the error (in a virtual system), there’s probably a hardware defect somewhere (Chip, IO, RAM, etc.).

Start up the VM with a working backup, verify this!
Shut down NethServer VM.
Boot using the latest SystemRescueCD Image.
This can easily repair the NethServers XFS file system (also used in Proxmox as default, if not set to use ZFS).
Use xfs_repair… Note: NethServer 7 main filesystem is on LVM. The LVM must NOT be mounted to run XFS_Repair! (Ideal for SystemRescue!)

After that, reboot the Nethserver - and make a backup right away in Proxmox.

Watch and observer the VM the next couple of days, it should be OK!

My 2 cents
Andy

From my notes:

Boot ab CD: SystemrescueCD:

xfs_repair /dev/sda1

xfs_repair /dev/VolGroup/lv_root

Fertig!

Note: SystemRescue has a GUI option (startx), but this is not needed here, console is enough!
And as it’s a VM, the paths above are always correct!

/dev/sda is the default boot partition of NS7
/dev/VolGroup/lv_root is where the actual file system resides…

→ When repairing, do both!

:slight_smile:

3 Likes

Thank you Andy,

How long has your server been running without any file system issues?

My NethServer runs round about two years stable.
I will try your suggestion.

Sincerely, Marko

1 Like

Indeed! Normal life is applicable in a virtual world.

I have done it and it seems to have worked. I’ll have to monitor it over the next few days.

Sincerely, Marko

2 Likes

Hi, did you have the server in production with the previous version, or did you install proxmox from scratch? Is the VM disk in nfs on nas or local ?

I upgraded Promox from 7.x
VM in NFS mounts

Then I have to worry … I have the same cfg but at the moment proxmox 7

@france

No real need to worry, if the error crops up (rare!), it can be fixed in Proxmox 7 just as easily as in. Proxmox 8.
Keeping a copy of SystemrescueCD ready on Proxmox ISO libraries always makes sense!

My 2 Cents
Andy

1 Like

If I now connect to the Nethserver web console in Proxmox I have to see:

:face_vomiting:

Any signs of hardware error, egg when rebooting prox?

No, Proxmox is still running fine.

When I experienced virtual machine corruption it was caused by failing/failed hardware.

1 Like

Just by looking at it?

Ao any “real” tests done, SMART checked for all disks, RAM Test?

Almost the ONLY thing that makes a UN*X / Linux “barf” core like that is a hardware failure…

My 2 cents
Andy

1 Like

Agree.

When booting such a host (suspected of hardware errors!), one almost MUST have a screen hooked up and observe the screen during boot, and about 5 minutes after boot!!!

A lot of errors are shown on the screen, and in the very depths of certain logfiles, when you know what you’re looking for.

Watching the screen is much easier!

(Normally, only the login appears, no error messages!)

My 2 cents
Andy

1 Like

@capote

I assume you googled already issues or bugs while updating from proxmox 7 to 8. You probably know this one: Nach Update auf 8... nach kurzer Zeit nicht mehr erreichbar | Proxmox Support Forum

It’s in German. As I don’t know what kernel you do use right now, basically the thread says downgrading to i.e. kernel 5.15.108-1-pve solves the problem that the machine is not reachable.

I don’t know if there’s a fix already rolled out. You could check or roll back, just to check if you reach stability again.

regards,
stefan

2 Likes

Thank you Stephan,

It is assumed that kernel 6.2.16-16 fixes the problems.

I’m on Kernel 6.2.16-19.

Before I carry out any experiments, I will test the hardware as Andy suggested.
I just don’t have the time.

Sincerely, Marko

Ps.: german is fine :slight_smile: