Mdadm – Failed disk recovery (unreadable disk)
Mdadm – Failed disk recovery (unreadable disk)
Well,
After 9 more months I ran into a nother disk failure. (First disk failure found here https://www.matraex.com/mdadm-failed-disk-recovery/)
But this time, The system was unable to read the disk at all
#fdisk /dev/sdb
This process just hung for a few minutes. It seems I couldn’t simply run a few commands like before to remove and add the disk back to the software RAID. So I had to replace the disk. Before I went to the datacenter I ran
#mdadm /dev/md0 --remove /dev/sdb1
I physically went to our data center, found the disk that showed the failure (it was disk sdb so I ‘assumed’ it was the center disk out of three, but I was able to verify since it was not blinking from normal disk activity. I removed the disk, swapped it out for one that I had sitting their waiting for this to happen, and replaced it. Then I ran a command to make sure the disk was correctly partitioned to be able to fit into the array
#fdisk /dev/sdb
This command did not hang, but responded with cannot read disk. Darn, looks like some error happened within the OS or on the backplane that made it so a newly added disk wasn’t readable. I scheduled a restart on the server later when the server came back up, fdisk could read the disk. It looks like I had used the disk for something before, but since I had put it in my spare disk pile, I knew I could delete it and I partitioned it with one partion to match what the md was expecting (same as the old disk)
#fdisk /dev/sdb
>d 2 -deletes the old partition 2
>d 1 -deletes the old partition 1
>c -creates a new partion
>p – sets the new partion as primary
>1 – sets the new partion as number 1
>> <ENTER> – just press enter to accept the defaults starting cylinder
>> <ENTER> – just press enter to accept the defaults ending cylinder
>> w – write the partion changes to disk
>> Ctrl +c – break out of fdisk
Now the partition is ready to add back to the raid array
#mdadm /dev/md0 –add /dev/sdb1
And we can immediately see the progress
#mdadm /dev/md0 --detail /dev/md0: Version : 00.90.03 Creation Time : Wed Jul 18 00:57:18 2007 Raid Level : raid5 Array Size : 140632704 (134.12 GiB 144.01 GB) Device Size : 70316352 (67.06 GiB 72.00 GB) Raid Devices : 3 Total Devices : 3 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Sat Feb 22 10:32:01 2014 State : active, degraded, recovering Active Devices : 2 Working Devices : 3 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 0% complete UUID : fe510f45:66fd464d:3035a68b:f79f8e5b Events : 0.537869 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 3 8 17 1 spare rebuilding /dev/sdb1 2 8 33 2 active sync /dev/sdc1
And then to see the progress of rebuilding
#cat /proc/mdadm Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active raid5 sdb1[3] sda1[0] sdc1[2] 140632704 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U] [==============>......] recovery = 71.1% (50047872/70316352) finish=11.0min speed=30549K/sec md1 : active raid1 sda2[0] 1365440 blocks [2/1] [U_]
Wow in the time I have been blogging this, already 71 percent rebuilt!, but wait! what is this, md1 is failed? I check my monitor and what do I find but another message that shows that md1 failed with the reboot. I was so used to getting the notice saying md0 was down I did not notice that md1 did not come backup with the reboot! How can this be?
It turnd out that sdb was in use on both md1 and md0, but even through sdb could not be read at all on /dev/sdb and /dev/sdb1 failed out of the md0 array, somehow the raid subsystem had not noticed and degraded the md1 array even though the entire sdb disk was not respoding (perhaps sdb2 WAS responding back then just not sdb), who knows at this point. Maybe the errors on the old disk could have been corrected by the reboot if I had tried that before replacing the disk, but that doesn’t matter any more, All I know is that I have to repartion the sdb device in order to support both the md0 and md1 arrays.
I had to wait until sdb finished rebuilding, then remove it from md0, use fdisk to destroy the partitions, build new partitions matching sda and add the disk back to md0 and md1