Recently I started getting SMART warnings from on of the disks in my home NAS (a QNAP TS-419P II armel/kirkwood device running Debian Jessie):
Device: /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6 [SAT], Self-Test Log error count increased from 0 to 1
Meaning it was now time to switch out that disk from the RAID5 array.
Since everytime this happens I have to go and lookup again what to do I've decided to write it down this time.
I configure SMART to talk about devices by-id (giving me their name and model number) so first I needed to figure out what the kernel was calling this device (although mdadm is happy with the by-id path, various other bits are not):
# readlink /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6
../../sdd
Next I needed to mark the device as failed in the array:
# mdadm --detail /dev/md0
/dev/md0:
[...]
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
[...]
Number Major Minor RaidDevice State
5 8 48 0 active sync /dev/sdd
1 8 32 1 active sync /dev/sdc
6 8 16 2 active sync /dev/sdb
4 8 0 3 active sync /dev/sda
# mdadm --fail /dev/md0 /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6
mdadm: set /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6 faulty in /dev/md0
# mdadm --detail /dev/md0
/dev/md0:
[...]
Active Devices : 3
Working Devices : 3
Failed Devices : 1
Spare Devices : 0
[...]
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 32 1 active sync /dev/sdc
6 8 16 2 active sync /dev/sdb
4 8 0 3 active sync /dev/sda
5 8 48 - faulty /dev/sdd
If it had been the RAID subsystem rather than SMART monitoring which had first spotted the issue then this would have happened already (and I would had received a different mail from the RAID checks instead of SMART).
Once the disk is marked as failed then actually remove it from the array:
# mdadm --remove /dev/md0 /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6
mdadm: hot removed /dev/disk/by-id/ata-ST3000DM001-1CH166_W1F2QSV6 from /dev/md0
And finally tell the kernel to delete the device:
# echo 1 > /sys/block/sdd/device/delete
At this point I can physically swap the disks.
At this point I noticed there were some interesting messages in dmesg, either from the echo to the delete node in sysfs or from the physical switch of the disks:
[1779238.656459] md: unbind<sdd>
[1779238.659455] md: export_rdev(sdd)
[1779258.686720] sd 3:0:0:0: [sdd] Synchronizing SCSI cache
[1779258.700507] sd 3:0:0:0: [sdd] Stopping disk
[1779259.377589] ata4.00: disabled
[1779371.126202] ata4: exception Emask 0x10 SAct 0x0 SErr 0x180000 action 0x6 frozen
[1779371.133740] ata4: edma_err_cause=00000020 pp_flags=00000000, SError=00180000
[1779371.141003] ata4: SError: { 10B8B Dispar }
[1779371.145309] ata4: hard resetting link
[1779371.468708] ata4: SATA link down (SStatus 0 SControl 300)
[1779371.474340] ata4: EH complete
[1779557.416735] ata4: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen
[1779557.424356] ata4: edma_err_cause=00000010 pp_flags=00000000, dev connect
[1779557.431264] ata4: SError: { PHYRdyChg DevExch }
[1779557.436008] ata4: hard resetting link
[1779563.357089] ata4: link is slow to respond, please be patient (ready=0)
[1779567.449096] ata4: SRST failed (errno=-16)
[1779567.453316] ata4: hard resetting link
I wonder if I should have used another method to detach the disk, perhaps poking the controller rather than the disk (which rang a vague bell in my memory from last time this happened) but in the end the disk is broken and the kernel seems to have coped so I'm not too worried about it.
It looked like the new disk had already been recognised:
[1779572.593471] scsi 3:0:0:0: Direct-Access ATA HGST HDN724040AL A5E0 PQ: 0 ANSI: 5
[1779572.604187] sd 3:0:0:0: [sdd] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB)
[1779572.612171] sd 3:0:0:0: [sdd] 4096-byte physical blocks
[1779572.618252] sd 3:0:0:0: Attached scsi generic sg3 type 0
[1779572.626754] sd 3:0:0:0: [sdd] Write Protect is off
[1779572.631771] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
[1779572.632588] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[1779572.665609] sdd: unknown partition table
[1779572.671522] sd 3:0:0:0: [sdd] Attached SCSI disk
[1779855.362331] sdd: unknown partition table
So I skipped trying to figure out how to perform a SCSI rescan and went straight to identifying that the new disk was called:
/dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1338P4GY8ENB
and then tried to do a SMART conveyancing self-test with:
# smartctl -t conveyance /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1338P4GY8ENB
But this particular drive seems to not support that, so I went straight to editing /etc/smartd.conf to replace the old disk with the new one and:
# service smartmontools reload
With all that I was ready to add the new disk to the array:
# mdadm --add /dev/md0 /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1338P4GY8ENB
mdadm: added /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1338P4GY8ENB
# mdadm --detail /dev/md0
/dev/md0:
[...]
State : clean, degraded, recovering
Active Devices : 3
Working Devices : 4
Failed Devices : 0
Spare Devices : 1
[...]
Rebuild Status : 0% complete
[...]
Number Major Minor RaidDevice State
5 8 48 0 spare rebuilding /dev/sdd
1 8 32 1 active sync /dev/sdc
6 8 16 2 active sync /dev/sdb
4 8 0 3 active sync /dev/sda
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdd[5] sda[4] sdb[6] sdc[1]
5860538880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
[>....................] recovery = 0.0% (364032/1953512960) finish=1162.4min speed=28002K/sec
So now all that was left was to wait about 20 hours (with fingers crossed a second disk didn't die! spoiler: it didn't)