Author: Michael Blood
MDADM – Failed disk recovery (too many disk errors)
MDADM – Failed disk recovery (too many disk errors)
This only happens once every couple of years, but occasionally a SCSI disk on one of our servers has too many errors, and is kicked out of the md array
And… we have to rebuild it. Perhaps we should replace it since it appears to be having problems, but really, the I in RAID is inexpensive (or something) so I would rather lean to being frugal with the disks and replacing them only if required.
I can never remember of the top of my head the commands to recover, so this time I am going to blog it so I can easily find it.
First step, take a look at the status of the arrays on the disk
#cat /proc/mdstat
(I don't have a copy of what the failed drive looks like since I didn't start blogging until after)
Sometimes an infrequent disk error can cause md to fail a hard drive and remove it from an array, even though the disk is fine.
That is what happened in this case, and I knew the disk was at least partially good. The disk / partition that failed was /dev/sdb1 and was part of a RAID V, on that same device another partition is part of a RAID I, that RAID I is still healthy so I knew the disk is just fine. So I am only re-adding the disk to the array so it can rebuild. If the disk has a second problem in the next few months, I will go ahead and replace it, since the issue that happened tonight is probably indicating a disk that is beginning to fail but probably still has lots of life in it.
The simple process is
#mdadm /dev/md0 --remove /dev/sdb1
This removed the faulty disk, that is when you would physically replace the disk in the machine, since I am only going to rebuild the disk I just skip that and move to the next step.
#mdadm /dev/md0 --re-add
The disk started to reload and VOILA! we are rebuilding and will be back online in a few minutes.
Now you take a look at the status of the arrays
#cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active raid5 sdb1[3] sdc1[2] sda1[0] 140632704 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U] [=======>.............] recovery = 35.2% (24758528/70316352) finish=26.1min speed=29020K/sec md1 : active raid1 sda2[0] sdb2[1] 1365440 blocks [2/2] [UU]
In case you want to do any trouble shooting on what happened, this command is useful in looking into the logs.
#grep mdadm /var/log/syslog -A10 -B10
But this command is the one that I use to see the important events related to the failure and rebuild. As I am typing this I am just over 60% complete rebuilt which you see in the log
#grep mdadm /var/log/syslog Jun 15 21:02:02 xxxxxx mdadm: Fail event detected on md device /dev/md0, component device /dev/sdb1 Jun 15 22:03:16 xxxxxx mdadm: RebuildStarted event detected on md device /dev/md0 Jun 15 22:11:16 xxxxxx mdadm: Rebuild20 event detected on md device /dev/md0 Jun 15 22:19:16 xxxxxx mdadm: Rebuild40 event detected on md device /dev/md0 Jun 15 22:27:16 xxxxxx mdadm: Rebuild60 event detected on md device /dev/md0
You can see from the times, it took me just over an hour to respond and start the rebuild (I know, that seems too long if I were to just do this remotely, but when I got the notice, I went on site since I thought I would have to do a physical swap and I had to wait a bit while the Colo security verified my ID, and I was probably moving a little slow after some Nachos at Jalepeno’s) Once the rebuild started it took about 10 minutes per 20% of the disk to rebuild.
————————-
Update: 9 months later the disk finally gave out and I had to manually replace the disk. I blogged again:
https://www.matraex.com/mdadm-failed-d…nreadable-disk/
Employee Appreciation Night – Idaho Steelheads Hockey
Employee Appreciation Night – Idaho Steelheads Hockey
The crew at Matraex headed down to the Century Link Arena to watch the Idaho Steelheads play against the Colorado SomethingSomethings. Â The group of 13 (including some family that came along) sat just behind the east goal and caught some great action.
Unfortunately The Steelheads lost, but the everyone enjoyed the show!
Family Bowling Night
Family Bowling Night
At 6:00 Wednesday October 3rd Matraex employees with their families had a bowling night at Big Al’s. It was great to get together with coworkers, spouses and the kids and spend some time having fun and getting to know the families. John was the experienced bowler of our bunch scoring an impressive 199. We are learning that John is a man of many talents.
Bowling at Big Als
Wednesday October 3rd
6:00
http://www.ilovebigals.com/meridian/
New Website Launched
River Raft Trip
River Raft Trip
Matraex held their 2nd 5th Annual Matraex River Rafting trip on Saturday July 21st. Nick, Michael, Vlade, Taner, Janae and John headed to the Lower South Fork of the Payette this year for some fun in the sun. Aside of a 2 hour delay from a boat stuck on a rock and John, Vlade and Michael nearly drowning in the last rapid, not much else happened here.
Like each of the other annual River Rafting trips, we finished up the evening with a dinner at the Sonora Mexican Restaurant in Horseshoe Bend
River raft trip
Saturday July 21
Lower South Fork
http://www.cascaderaft.com/
Fastlane Racing
Fastlane Racing
At 5:30 Monday June 18th Matraex Employees met at Fastlane Racing for a couple of quick go kart races. Nick has the most experience and led the pack, leaving Taner, Michael and John to follow.
Fastlane Racing
Monday June 18th at 5:30
http://www.fastlaneboise.com/
Employee Appreciation Dinner
Employee Appreciation Dinner
At 7:00pm April 16th Matraex had an Employee Appreciation Dinner at Berryhill & Co in Downtown Boise.
5 of the Matraex Employees brought their spouses for an evening filled with eating, eating and more eating as well as a little drinking.
A good time was had by all.
Berryhill & Co
Employee Appreciation Dinner
Monday April 16th
7:00 pm
http://johnberryhillrestaurants.com/
Fixed Hacked Site – PHP injection
Fixed Hacked Site – PHP injection
Today a customer called me about a PHP website that was popping up viruses all over the place.
I loaded up the site and there it was, the page was immediately redirected to a spyware / virus type site that tried to convince me to download their software to fix a problem. Since I knew better I carefully answered the browser prompts to make sure I closed out and left the page without opening anything malicious.
Then I went back to the page that had the problem and tried to load it again. But the problem was GONE!
After a bit more investigation I found that the people who wrote the virus dropped a cookie on my machine and made sure they allowed me back in the site. I am sure this trick helps them to keep the virus on a site for longer because the site administrators may not recognize it as an on going problem (or even a problem that their site caused).
In digging I found that each PHP page on the site had some PHP code added to the top of it.
something like
This was on a single line at the top of the file and even the administrator who had noticed the odd code at the top passed over it not thinking it was malicious.
However, the text inside the encoded string was VERY malicious. I decoded it and found several PHP functions and additional encoded strings.
I decided it wasnt worth figuring out what all they did with the code but instead decided to just clean it up. I assumed that the code probably helped replicate itself by checking that ALL other PHP pages on the site also had the same code in them. So if someone removed the code and then the code was run on another page it put itself back where you removed it.
Anyway, pretty sophisticated but it was easy for me to find the problem just opened and looked at the PHP file and saw code that shouldnt have been there.
A cool way that I found where the problem was before even opening the PHP file was to use HTTPWatch to see which exact files were downloaded from which site in the browser. I use the free version of the softwar and it has met all my needs so far. It is similar to firebug in FireFox.
Great SQL Formatting Tool
Great SQL Formatting Tool
We often deal with very complex, dynamically generated SQL Statements which run from our applications.
If we need to debug them for any reason we often have to display them to the screen and then copy and paste them in to an SQL Query window. The problem is that those SQL Statements are not always formatted to be very readable. Sometimes they might even be on a single line. This requires a bunch of time going through and reformatting the sql statement, making it legible for debugging.
I have used this tool SQLinFORM several times in the past but I keep forgetting about when I dont have to use it very often.
http://www.sqlinform.com/
I just copy and paste the SQL into the window and click Format.
It does a great job formatting code quickly and even has some options for how you would like to see the output, I then select the output and paste it into my SQL Query window.
If you use it often they do have a version for sale.
Linux System Discovery
Linux System Discovery
Over the last couple of weeks I have been working on doing some in depth “System Discovery” work for a client.
The client came to us after a major employee restructuring, during which they lost ALL of the technical knowledge of their network.
The potentially devastating business move on their part turned into a very intriguing challenge for me.
They asked me to come in and document what service each of their 3 Linux servers.
As I dug in I found that their network had some very unique, intelligent solutions:
- A reliable production network
- Thin Client Linux printing stations, remotely connected via VPN
- Several Object Oriented PHP based web applications
Several open source products had been combined to create robust solutions
It has been a very rewarding experience to document the systems and give ownership of the systems, network and processes back to the owner.
The documentation I have provided included
- A high level network diagram as a quick reference overview for new administrators and developers
- An overall application and major network, server and node object description
- Detailed per server/node description with connection documentation, critical processes , important paths and files and dependencies
- Contact Information for the people and companies that the systems rely on.
As a business owner myself, I have tried to help the client recognize that even when they use an outside consultant, it is VERY important that they maintain details of their critical business processes INSIDE of their company. Their might not be anything in business that is as rewarding as giving ownership of a “lost” system back to a client.