Proxmox snapshot of Nethserver Virtual Server is reported as faulty

jfernandez · January 27, 2020, 9:21pm

NethServer Version: 7.7.1908 (final)

I have a Nethserver VM (Which is been used as a mail server) inside a Proxmox 5.4 (VM ID is 9007), this last Sunday (26/january) a scrub cron job over my ZFS pool reported the following:

# zpool status -xv
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 5h11m with 1 errors on Sun Jan 26 07:11:47 2020
config:

	NAME                                 STATE     READ WRITE CKSUM
	rpool                                ONLINE       0     0     1
	  mirror-0                           ONLINE       0     0     0
	    sda3                             ONLINE       0     0     0
	    sdb3                             ONLINE       0     0     0
	  mirror-1                           ONLINE       0     0     2
	    sdc                              ONLINE       0     0     2
	    ata-ST2000DM008-2FR102_WFL11A52  ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

        rpool/data/vm-9007-disk-0@before_update_20200120:<0x1>

journalctl -xef -p 3 and dmesg hasn’t reported any errors regarding the NS VM virtual disk, but I would like to do a filesystem check and also clear the error message on my ZFS pool.

Since NS uses a LVM, I don’t know the procedure for doing a filesystem checking? Any help please?

Andy_Wismer · January 27, 2020, 9:48pm

Hi
NS does use LVM, but the filesystem is XFS. Start the VM for example from a SystemRescueCD, then the filesystem can be repaired. AFAIK XFS doesn’t do a filesystem check every reboot like ext4 does. Usually it’s stable enough…

SystemRescueCD can handle both the boot partition, and the LVM (root) volume, and also contains xfs_repair…

Use xfs_repair…

See here for some details: https://serverfault.com/questions/777299/proper-way-to-deal-with-corrupt-xfs-filesystems

This works…

My 2 cents

france · January 27, 2020, 11:20pm

hi, i also have a neth 7 configuration on proxmox 6.1.5. I have had a problem similar to yours in the past but without snapshot corruption. Performing a backup on proxmox of the vm neth7, due to the slowness of data transfer on zfs, I only found cpu latency errors in the neth7 logs or a clock loss due to the slowness and pauses of the snapshot without any corruption. I currently use the qemu-guest-agent addons and have moved the vm to ssd waiting to configure my zfs volume with an ssd cache disk.

bobtskutter · January 28, 2020, 8:40am

@jfernandez I don’t claim to be a ZFS expert, but…
Mirror-1 appears to have the faults, mirror-0 is OK. Have you got mirror-1 running off a different storage controller to mirror-0?
The storage controller for mirror-1 looks faulty to me. Check cables, cards & power connectors.

If you have two storage controllers (A/B) and 4 disks (1,2,3,4) you need to set up the mirrors so you have 1 disk from each controller in each mirror.
e.g.
mirror-0 = disk1_controllerA + Disk3_controllerB
mirror-1= disk2_controllerA + Disk4_controllerB

that way you can tolerate a controller failure and still have data redundancy.

FYI, your rpool is a stripe to two mirrored arrays. ZFS will share data between each of the two mirrors and then each mirror will duplicate the data. You can tolerate a failure of 1 disk in each mirror, but not two disks of the same mirror. That is why you’re getting a data loss error.
Have a look at these links
https://docs.oracle.com/cd/E19253-01/819-5461/gazhv/index.html
https://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/zfs-zpool.html
or have a read about RaidZ1.

FYI, the scrub error is from Proxmox and ZFS not nethserver. AFAIK your neth vm disk is OK, it’s the snapshot data that’s corrupt, i.e. the OLD VM disk data from before you upgraded.

A ZFS snapshot is the OLD data from BEFORE the snapshot was created.

regards
bob

jfernandez · January 28, 2020, 7:27pm

So if I just delete the snapshot and re-run scrub, such error should disapear right ?

bobtskutter · January 28, 2020, 8:02pm

Yes, I believe so. I’ve never been in this situation myself but I’ve seen some other posts elsewhere about fixing such errors, they all say “delete the file and then scrub the pool”.

regards
bob