pjmac123

RAID 5: Idiots Guide to Replacing a Faulty Disk (Using Intel SRCSATAWB

Recommended Posts

Hi folks,

I'm new to RAID, so please forgive my naivety, but I have just had the weekend from hell at work and am looking for some advice on how I can make life easier next time I have a RAID failure.

I'll cut a long story short - last week, one of the disks in the RAID array on our domain controller failed and I ended up losing ALL OF OUR DATA (nearly 500GB worth of docs for over 100 staff and 600 students). Thankfully I have recovered all data from a backup so we have only lost 1 days worth of work, but even so, the whole point of having RAID is so that when there is a disk failure we can replace the disk without any data loss.

Before I replaced the disk, I scoured the web high and low and could not find any reasonable guide on how to replace a failed disk and rebuild the array, so I had to go it alone. From within the 'Intel RAID Web Console 2' I chose the failed disk and clicked 'prepare for removal'. I powered off the server, replaced the disk and after a reboot, made the disk 'online' - at this point all our server data became corrupt. I ended up having to 'initialise' the array, and then recovered the (virtual) disk from an Acronis backup (thank goodness for Acronis!).

Could somebody please help by posting an idiots guide on the correct procedure for replacing a disk without any data loss. Specific instructions for the Intel SRCSATAWB would be ideal, but failing that, any general instructions would be better than nothing. I dread the thought of another disk failing without the correct knowledge to fix it.

Many thanks in advance,

Phil.

Share this post


Link to post
Share on other sites

The best advice I have ever received is to clone all of the drives in the RAID array before attempting a rebuild. This way, if your rebuild fails you can try again. Sorry such general advice...but i have no specific advice for that controller. Sounds like a freak occurrence (the worst kind!).

Share this post


Link to post
Share on other sites

Hi there,

I think that what you encountered seems to be the so called "silent corruption".

The problem with rotating parity RAID levels (5, 6 are the most common) is that the hash information on the drives gets written when a stripe is modified.

Great, but what if a drive started having very high raw errors and a lot of stripes that got written are "bad".

There is no way of knowing what is good and what is bad until a drive fails and then you need the partiy information to recover the whole array.

There isn't exactly a way to prevent that from happening. ZFS has the "self healing" background which should resolve such inconsistencies on the fly, and for a common RAID controller you need to schedule data checks on a regular bases (but not daily!).

The solution to the problem is called backup, and you learned that lesson the hard way (unfortunately).

What I do in such cases is use a hardware controller which monitors SMART of the drives and ejects them a long time before they turn bad.

I have perfect experience with Areca controllers, which haven't had a single issue in a number of systems for at least 3+ years.

I even needed to use desktop class drives in a highly loaded enterprise environment (overnight emergency build, parts availability issue, etc.) and the drives get ejected or should I say rejected by the controller a long time before they fail to a state, where they are unusable.

What I do is just double check if the drive SN matches the failed one, and if yes, then proceed.

As RAID is best serviced offline, I also plan some downtime for the system where applicable, when we are dealing with 24/7 environment, we have a map of all drives and their serial numbers, so should we need to identify a drive by a serial number, then we are good to go.

But everything I stated above is either theoretical, advisory or pure speculation, therefore don't take the info for the one and only truth. Always look for a second opinion :)

Cheers,

SV

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now