02

←	Feb 2016	→
S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28 Hotswapping a failed RAID device	29

Recently I started getting SMART warnings from on of the disks in my home NAS (a QNAP TS-419P II armel/kirkwood device running Debian Jessie):

Device: /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6 [SAT], Self-Test Log error count increased from 0 to 1

Meaning it was now time to switch out that disk from the RAID5 array.

Since everytime this happens I have to go and lookup again what to do I've decided to write it down this time.

I configure SMART to talk about devices by-id (giving me their name and model number) so first I needed to figure out what the kernel was calling this device (although mdadm is happy with the by-id path, various other bits are not):

# readlink /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6 
../../sdd

Next I needed to mark the device as failed in the array:

# mdadm --detail /dev/md0 
/dev/md0:
[...]
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

[...]
    Number   Major   Minor   RaidDevice State
       5       8       48        0      active sync   /dev/sdd
       1       8       32        1      active sync   /dev/sdc
       6       8       16        2      active sync   /dev/sdb
       4       8        0        3      active sync   /dev/sda

# mdadm --fail /dev/md0 /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6 
mdadm: set /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6 faulty in /dev/md0
# mdadm --detail /dev/md0 
/dev/md0:
[...]
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0

[...]
Number   Major   Minor   RaidDevice State
   0       0        0        0      removed
   1       8       32        1      active sync   /dev/sdc
   6       8       16        2      active sync   /dev/sdb
   4       8        0        3      active sync   /dev/sda

   5       8       48        -      faulty   /dev/sdd

If it had been the RAID subsystem rather than SMART monitoring which had first spotted the issue then this would have happened already (and I would had received a different mail from the RAID checks instead of SMART).

Once the disk is marked as failed then actually remove it from the array:

# mdadm --remove /dev/md0 /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6 
mdadm: hot removed /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6 from /dev/md0

And finally tell the kernel to delete the device:

# echo 1 > /sys/block/sdd/device/delete

At this point I can physically swap the disks.

At this point I noticed there were some interesting messages in dmesg, either from the echo to the delete node in sysfs or from the physical switch of the disks:

[1779238.656459] md: unbind<sdd>
[1779238.659455] md: export_rdev(sdd)
[1779258.686720] sd 3:0:0:0: [sdd] Synchronizing SCSI cache
[1779258.700507] sd 3:0:0:0: [sdd] Stopping disk
[1779259.377589] ata4.00: disabled
[1779371.126202] ata4: exception Emask 0x10 SAct 0x0 SErr 0x180000 action 0x6 frozen
[1779371.133740] ata4: edma_err_cause=00000020 pp_flags=00000000, SError=00180000
[1779371.141003] ata4: SError: { 10B8B Dispar }
[1779371.145309] ata4: hard resetting link
[1779371.468708] ata4: SATA link down (SStatus 0 SControl 300)
[1779371.474340] ata4: EH complete
[1779557.416735] ata4: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen
[1779557.424356] ata4: edma_err_cause=00000010 pp_flags=00000000, dev connect
[1779557.431264] ata4: SError: { PHYRdyChg DevExch }
[1779557.436008] ata4: hard resetting link
[1779563.357089] ata4: link is slow to respond, please be patient (ready=0)
[1779567.449096] ata4: SRST failed (errno=-16)
[1779567.453316] ata4: hard resetting link

I wonder if I should have used another method to detach the disk, perhaps poking the controller rather than the disk (which rang a vague bell in my memory from last time this happened) but in the end the disk is broken and the kernel seems to have coped so I'm not too worried about it.

It looked like the new disk had already been recognised:

[1779572.593471] scsi 3:0:0:0: Direct-Access     ATA      HGST HDN724040AL A5E0 PQ: 0 ANSI: 5
[1779572.604187] sd 3:0:0:0: [sdd] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
[1779572.612171] sd 3:0:0:0: [sdd] 4096-byte physical blocks
[1779572.618252] sd 3:0:0:0: Attached scsi generic sg3 type 0
[1779572.626754] sd 3:0:0:0: [sdd] Write Protect is off
[1779572.631771] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
[1779572.632588] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[1779572.665609]  sdd: unknown partition table
[1779572.671522] sd 3:0:0:0: [sdd] Attached SCSI disk
[1779855.362331]  sdd: unknown partition table

So I skipped trying to figure out how to perform a SCSI rescan and went straight to identifying that the new disk was called:

/dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1338P4GY8ENB

and then tried to do a SMART conveyancing self-test with:

# smartctl -t conveyance /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1338P4GY8ENB

But this particular drive seems to not support that, so I went straight to editing /etc/smartd.conf to replace the old disk with the new one and:

# service smartmontools reload

With all that I was ready to add the new disk to the array:

# mdadm --add /dev/md0 /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1338P4GY8ENB
mdadm: added /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1338P4GY8ENB
# mdadm --detail /dev/md0 
/dev/md0:
[...] 
      State : clean, degraded, recovering 
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

[...]
 Rebuild Status : 0% complete

[...]
Number   Major   Minor   RaidDevice State
   5       8       48        0      spare rebuilding   /dev/sdd
   1       8       32        1      active sync   /dev/sdc
   6       8       16        2      active sync   /dev/sdb
   4       8        0        3      active sync   /dev/sda

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdd[5] sda[4] sdb[6] sdc[1]
      5860538880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
      [>....................]  recovery =  0.0% (364032/1953512960) finish=1162.4min speed=28002K/sec

So now all that was left was to wait about 20 hours (with fingers crossed a second disk didn't die! spoiler: it didn't)

Posted Sun Feb 28 11:58:58 2016 Tags: howto raid