[SOLVED] Help 😓 Issues with md

pagaille · January 30, 2025, 12:35pm

Hi there

Anybody wants giving me a hand ?

One of my server failed last night :

A Fail event had been detected on md device /dev/md2.

It could be related to component device /dev/sdb2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [linear] [multipath] [raid10] 
md4 : active raid1 sda4[0] sdb4[1]
     3884960704 blocks [2/2] [UU]
     bitmap: 10/29 pages [40KB], 65536KB chunk

md2 : active raid1 sdb2[1](F) sda2[0]
     20478912 blocks [2/1] [U_]

unused devices: <none>

Then, 4 hours later :

A DegradedArray event had been detected on md device /dev/md2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [linear] [multipath] [raid10] 
md4 : active raid1 sda4[0] sdb4[1]
     3884960704 blocks [2/2] [UU]
     bitmap: 4/29 pages [16KB], 65536KB chunk

md2 : active raid1 sdb2[2](F) sda2[0]
     20478912 blocks [2/1] [U_]

unused devices: <none>

The server was crashed. It’s a proxmox server.

I rebooted, it started (still in degraded mode, expected), but minutes later crashed again.

Now it starts in maintenance mode only

boot messages :

And a minute later :

Does it ring a bell ?

Thanks

pagaille · January 30, 2025, 12:44pm

Andy_Wismer · January 30, 2025, 1:09pm

Salut @pagaille

This does not look like a “standard” Proxmox disk layout.
This looks more like a hand-made setup.
Proxmox, if not ZFS, will use LVM.

Please show the contents of your /etc/fstab

Second: Does sdb show up at all?
It could be that one of the two disks are badly defect.

My 2 cents
Andy

pagaille · January 30, 2025, 1:12pm

That’s OVH’s proxmox image. I don’t remember having changed anything from this point of view, but it’s been 7 years ago

Andy_Wismer · January 30, 2025, 1:15pm

OK: OVH is an option, but isn’t standard Proxmox…

My tip:

First verify both disks are working correctly, eg by booting your system with SystemRescue, and use that to test both disks individually.

If disks seem OK, then it would be worth it to try a md remirror.

Can you show the full contents of cat /proc/mdstat ?

My 2 cents
Andy

pagaille · January 30, 2025, 1:23pm

sdb is dead but has just been replaced, I’ll try to remirror it and hope it works.

I don’t understand why the server crashed since there is a remaining disk

Andy_Wismer · January 30, 2025, 1:26pm

MD Raid is a “software” raid, there is no hardware isolation like a hardware raid controller may provide.

In rare situations, this can crash a server. And hardware errors are one of the few things which can take down ANY Linux / UN*X server!

Good luck with rebuild!

My 2 cents
Andy

pagaille · January 30, 2025, 3:09pm

I reconstructed md2. It was fast. too fast to my taste.

still not working. md4 (no idea what it could be inside) is “inactive”.

if you’ve any idea, shoot…

EDIT oh I found this. mdadm RAID array with failed drive set to inactive after boot

Andy_Wismer · January 30, 2025, 3:16pm

I would say, seeing the output, that the config lost the so called RAID personality.

Above md4 you see it seems to be asking what to use.
md2, on the other hand, is clearly labeled…

Sometimes the trees are fault that one doesn’t see the forest…

But it would help to know whatt filesystem is in place on both md…
Proxmox 7 years ago tended to use XFS.

But you can still redefine the md raid with the appropriate personality (RAID1) and restart the mirroring.
(Re add sdb4?)

→ Good luck!

My 2 cents
Andy

pagaille · January 30, 2025, 3:20pm

Server is starting !! still some errors but a least I’ve an ssh console

pagaille · January 30, 2025, 3:35pm

THANK YOU Andy. Everything is working again.

pagaille · January 30, 2025, 5:36pm

For the record :

OVH configured its proxmox image like this (easier to figure out with a gui) :

software raid :

md2 → mounted as / (20Go)
md4 → LVM → mounted on /var/lib/vz as ext4 (3,5To)

Partitions :
sda3 : (reported as raid partition) : virtual memory (1Go)
sdb3 : (idem)

That’s why the machine was booting (md2 was working normally)… but not much else.

Now why did md4 broke during the crash ? No idea. The config file was ok. All I had to do is

mdadm --start /dev/md4
 mdadm --readwrite /dev/md4
mdadm -A /dev/md4 /dev/sdb4

And then it started the recovery and that was all.

Recap and conclusion :

The hard disk broke during the night (mail received)
In the morning the system was crashed (not expected)
I restarted : it booted normally (md4 was therefore ok)
Then it crashed : I believe that it was at that moment that md4 went inactive (again, since it happenened during the night).

I’m tempted to think the software raid is not as reliable as it should. Is there anything to do beside switching to hardware raid ?

pagaille · January 30, 2025, 6:54pm

ZFS looks being the way to go…

Andy_Wismer · January 31, 2025, 5:58am

@pagaille

7 years ago, Proxmox itself did not really support MD based systems, it was possible with a Debian Install and Proxmox on top.

Proxmox now supports so much more different systems (Samba for VMs - I’ld still prefer NFS…).

Disks can also die with ZFS, but it is Rock Solid (Written n Capitals!).

I have switched all systems to use ZFS Storage except for 3 systems using a hardware RAID for system. These three all have a PCIx NVME controller with mirrored ZFS formatted NVMEs.

My 2 cents
Andy

pagaille · January 31, 2025, 7:11am

Yep. At that time ZFS was still very young and quite intimidating. Yesterday I played with it on a vm and I must say its pretty convincing.

Andy_Wismer · January 31, 2025, 7:13am

7 years ago? Fully agree.

I started to move my clients about 5 years ago. Never any issues with ZFS.
Disks died, disks replaced. And back in sync much faster than md…

Today, I would not want to miss it.

zfssend, zfsrecieve, zfssync all done by the file system, not the OS - amazing.
And Rock Solid.

My 2 cents
Andy