XFS RAID5 and smartd error

loryaegis · July 1, 2020, 9:13am

NethServer Version: 7.8 2003
Module: storage

Good morning, everyone. I’m facing a very strange problem.
I have a raid5 and it appears that one of the drives is malfunctioning.
/dev/sdb

I can’t remove it through the coockpit interface, it tells me to add a new disk first… but if I don’t remove it first I can’t insert a new one.

I think there’s something wrong with the record, I’d like to remove it and have it replaced…

this what we found in the logs di NS
Device: /dev/sdb [SAT], 14 Offline uncorrectable sectors
Device: /dev/sdb [SAT], 14 Currently unreadable (pending) sectors

[root@srv01 ~]# sudo smartctl --quietmode=errorsonly --all /dev/sdb
ATA Error Count: 50 (device log contains only the most recent five errors)
Error 50 occurred at disk power-on lifetime: 20 hours (0 days + 20 hours)
Error 49 occurred at disk power-on lifetime: 20 hours (0 days + 20 hours)
Error 48 occurred at disk power-on lifetime: 2 hours (0 days + 2 hours)
Error 47 occurred at disk power-on lifetime: 2 hours (0 days + 2 hours)
Error 46 occurred at disk power-on lifetime: 2 hours (0 days + 2 hours)


[root@srv01 ~]# smartctl -a /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1127.13.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf Pro
Device Model:     ST2000NE0025-2FL101
Serial Number:    ZDS1AJSQ
LU WWN Device Id: 5 000c50 0c4ffd5c1
Firmware Version: EN02
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul  1 11:05:53 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(  575) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 221) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x50bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   067   049   044    Pre-fail  Always       -       11451636
  3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   061   060   045    Pre-fail  Always       -       1467594
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       20 (233 71 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       5
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   052   052   000    Old_age   Always       -       48
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   064   040    Old_age   Always       -       32 (Min/Max 31/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       17
194 Temperature_Celsius     0x0022   032   040   000    Old_age   Always       -       32 (0 24 0 0 0)
195 Hardware_ECC_Recovered  0x001a   070   064   000    Old_age   Always       -       11451636
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       14
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       14
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       19 (2 138 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       9039839
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3912366611

SMART Error Log Version: 1
ATA Error Count: 50 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 50 occurred at disk power-on lifetime: 20 hours (0 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d4 00 82 4f c2 00 00      03:12:00.223  SMART EXECUTE OFF-LINE IMMEDIATE
  b0 d0 01 00 4f c2 00 00      03:12:00.167  SMART READ DATA
  ec 00 01 00 00 00 00 00      03:12:00.145  IDENTIFY DEVICE
  ec 00 01 00 00 00 00 00      03:12:00.132  IDENTIFY DEVICE
  ea 00 00 00 00 00 a0 00      03:11:56.571  FLUSH CACHE EXT

Error 49 occurred at disk power-on lifetime: 20 hours (0 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d4 00 82 4f c2 00 00      03:11:46.684  SMART EXECUTE OFF-LINE IMMEDIATE
  b0 d0 01 00 4f c2 00 00      03:11:46.625  SMART READ DATA
  ec 00 01 00 00 00 00 00      03:11:46.614  IDENTIFY DEVICE
  ec 00 01 00 00 00 00 00      03:11:46.613  IDENTIFY DEVICE
  ea 00 00 00 00 00 a0 00      03:11:46.548  FLUSH CACHE EXT

Error 48 occurred at disk power-on lifetime: 2 hours (0 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 58 ff ff ff 4f 00      00:49:38.574  WRITE FPDMA QUEUED
  61 00 c0 ff ff ff 4f 00      00:49:38.574  WRITE FPDMA QUEUED
  61 00 50 ff ff ff 4f 00      00:49:38.574  WRITE FPDMA QUEUED
  60 00 48 ff ff ff 4f 00      00:49:38.574  READ FPDMA QUEUED
  60 00 b0 ff ff ff 4f 00      00:49:38.573  READ FPDMA QUEUED

Error 47 occurred at disk power-on lifetime: 2 hours (0 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 b0 ff ff ff 4f 00      00:49:26.819  READ FPDMA QUEUED
  60 00 48 ff ff ff 4f 00      00:49:26.819  READ FPDMA QUEUED
  61 00 50 ff ff ff 4f 00      00:49:26.819  WRITE FPDMA QUEUED
  61 00 c0 ff ff ff 4f 00      00:49:26.819  WRITE FPDMA QUEUED
  61 00 58 ff ff ff 4f 00      00:49:26.818  WRITE FPDMA QUEUED

Error 46 occurred at disk power-on lifetime: 2 hours (0 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 80 0f 20 00  Error: UNC at LBA = 0x00200f80 = 2101120

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00      00:24:09.371  READ FPDMA QUEUED
  60 00 80 80 0f 20 40 00      00:24:09.370  READ FPDMA QUEUED
  2f 00 01 13 00 00 a0 00      00:24:09.370  READ LOG EXT
  ef 10 02 00 00 00 a0 00      00:24:09.369  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 00      00:24:09.369  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended captive    Completed: read failure       90%        20         2099214
# 2  Extended captive    Completed: read failure       90%        20         2099214

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@srv01 ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md127 : active raid5 sda3[0] sdd1[4] sdb1[1] sdc1[2]
      5856384000 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 5/15 pages [20KB], 65536KB chunk

unused devices: <none>

Andy_Wismer · July 1, 2020, 7:15pm

@loryaegis

Bongiorno Laurinz

NethServer uses mdadm to create RAID (1 or 5 or others). You can use the usual mdadm commands to remove add deactivate or activate a disk in the RAID.

Make sure you have a working backup!!!

Example for you:

Find your RAID first with eg:

cat /proc/mdstat

Then (from man mdadm…):

MANAGE MODE
       Usage: mdadm device options... devices...

This  usage  will  allow individual devices in an array to be failed, removed or added.  It is possible to perform multiple operations with on command. For example:

`mdadm /dev/md0 -f /dev/hda1 -r /dev/hda1 -a /dev/hda1`

will firstly mark /dev/hda1 as faulty in /dev/md0 and will then remove it from the array and  finally  add it back in as a spare.  However only one md array can be affected by a single command.

Mark the disk as faulty and remove it
Replace the disk
Add the disk as hotspare

Make the RAID use the hotspare!

Now your RAID should be repaired,or at least in progress of resynching the RAID…

After the RAID is repaired, reboot the your NethServer with a copy of SystemrescueCD and do a check / repair on the XFS filesystem. XFS isn’t repaired or checked on every boot like ext4, the repair needs to be done with an unmounted system (partition). I use SystemrescueCD for this, it works quite well. (See xfs_repair eg on google).

Hope this helps

My 2 cents
Andy

loryaegis · July 2, 2020, 9:04am

Thank you, very kind. Maybe it was an obvious answer, but it’s a delicate case.

Molto gentile!