blakerwry

Linux Soft Raid 5 Crash...

Recommended Posts

hahaha... tried to restore my gf's computer from the backup that was stored on the server failed... the backup file was unreadable like the rest of the files recovered after hde died... So, I tried to restore from a ghost image set that was made on CD-RW.... guess what, a CD-RW disk has failed.

These disks were brand new Verbatim's -openned, burned, tested, and then stored at room temperature inside a CD binder in a desk drawer. This lack of reliability is killing me

luckily my gf's HDD was still inside of warranty by about 2 weeks.

Share this post


Link to post
Share on other sites

Jeez....I'm sorry man. I know how that goes....one thing goes wrong and nothing seems to work with you.

A few notes....

/etc/raidtab is only read when creating an array.....not while starting it. the persistant superblock option in your raidtab causes raidtools to write out the array config to each individual drive during array creation. Subsequent starts rely on the info on the drives. It's a good idea to keep the raidtab file synced just in case a superblock happens to get unsynced, but it's not critical.

Also, i've been through a few bad cables and one drive failure with software raid....and I've never had to mkraid --force an existing array; relying solely on the hot tools has always done the trick. If they don't seem to be working for you....there's a good chance the problem is elsewhere. mkraid --force should always be a final act of desperation....and only when you're absolutely sure you have an up to date raidtab.

Wish I could offer more help......good luck.

-Chris

Share this post


Link to post
Share on other sites

yeah, durring the initial testing I did I found that having an up to date raidtab was very important. That's why mine ALWAYS was up to date.

Additionally, I had to resort to the mkraid --really-force because the hot tools only work when the array is started.. my array wouldnt start. With the exception of when hde dropped there was no time where I would have been able to use the hot tools. If I would have thought of it at the time, i would have tried it.

Share this post


Link to post
Share on other sites

I'm wondering why you're getting all these failures... Could it be something environmental, like a really hot damp apartament, or noisy powerlines coming into the apartament? Maybe some strange shakes from train tracks nearby? Very strange.

I've been trying to stay on top of this thread, because I've been considering software raid 5 on my server, but I'm not sure just what exactly is the cause and solution of your problem... Could you maybe write up an executive summary? :-)

Cheers,

Mitch

Share this post


Link to post
Share on other sites

Executive Summary:

Linux software RAID 5

WHY RAID 5:

RAID 5 allows increased availability by allowing the failure of a single disk without downtime or the loss of data and can additionally improve performance of data access under certain conditions. RAID 5 is particularly suited to creating larger arrays with reduncancy because of its low capacity overhead when compared with other RAID levels. It performs best in read operations, but has little affect on write operations. For a file or web server that undergoes an overwhelming amount of reads vs writes RAID5 is ideal.

ADVANTAGES OF HOST BASED RAID:

Software RAID 5 is a cost effective way of implementing RAID5. The performance of modern PC hardware makes host based RAID5 a viable alternative to expensive RAID controllers with dedicated hardware.

Hardware RAID controllers present a RAID array to the OS as a single large disk. The controller manages the array and individual disks. This has some disadvantages. Mainly the ability of the OS to manipulate a single disk is usually lost. The OS can no longer poll a disk for SMART information or run tests on an individual drive. It cannot see if a disk has failed or alert you if something is wrong. In order to do these things with a hardware controller it must come with software that interfaces directly with the controller which in turn communicates with an individual disk. The software that comes with many hardware RAID controllers is pretty good, but is not typically as mature and tested as that which comes with an operating system. The software is also usually proprietary which means that if a feature is desired it usually has to come from the manufacturer. This makes the software package as critical as the hardware that it comes with.

One advantage of RAID that is implemented by the OS is that you can use any standard utility that communicates directly with a disk(for monitoring, diagnostics, recovery or any purpose) and it will work fine.

DISADVANTAGES OF HOST BASED RAID:

Because of the lack of a dedicated processor, host based RAID relies on the main CPU to compute the extra information necessary for RAID. In RAID levels 1 and 0 this is reletively minor. In RAID 5 XOR calculations must be made to calculate the partity information necessary, this can take a chunk of CPU time and interrupts that may be noticable. However, on modern PCs this effect is small and often overways the cost of an expensive hardware solution. Additionally, a faster CPU could be purchased instead of a RAID card usually at less expense and offers more versatility.

A more important affect of host based RAID is that of increased bus usage. The redundancy information stored on a RAID array can take anywhere from 1/n to n (where n is the number of disks) more data transfered over a computer's busses compared to a hardware controller which manages this directly and transfers the extra information locally. This can lead to bottlenecks with software RAID that are not experienced with hardware controllers. This is very system and harware specific, making it important to not generalize and requires a specific knowledge of the hardware being used.

WHY LINUX:

Linux offers the ability to use host based RAID 5 for free. It also comes with a suite of tools to manipulate and manage RAID arrays. The performance of linux RAID arays is on par with that offered by windows, but is unencumbered by licensing.

Linux also offers other features that may lead someone to making the choice of using it as their operating system.

FINDINGS:

Linux software RAID 5 can offer good performance, excellent management, and recovery features that make it an attractive option. Unfortunately the docmentation is lacking in some areas and the software is still somewhat immature.

In my experience I believe that linux's software RAID 5 failed to perform as good as it should. The inability to mount a dirt degraded array was a surprise. Addionally, the array re-sync'ing when it was not supposed to was also a shock. This may be a defect in the documentation or it simply may be outdated.

FUTURE/ACTIONS:

As linux becomes more popular its software RAID features become used more. This will lead to better maturity and better documentation. For most home users I think linux's software RAID will operate "good enough" although some people do not like being beta testers there are cirtainly those that do not mind.

Remember, RAID is not a backup. It is for data availability. I expect that most home users can stand a day to a week of downtime of an array without loss of prouctivity. For these people i believe linux's software RAID 5 will meet or exceed the psuedo-hardware RAID controllers targetted to them.

For an office an hour without access to a critical array can easily mean hundreds of dollars of lost productivity. For these types of situations I would recomend a mid to high end hardware RAID controller.

Share this post


Link to post
Share on other sites

i think my specific situatoin was an unfortunate chain of coincidences.

I had a disk fail.. this was my 1st in service failure after more than 20 HDDs.

No big deal. Then, when recovering the data a disk dropped out of the array for no aparent reason.. unfortunate, but it happens. There does not apear to be anything wrong with this disk. The disk that died was an OEM unit purchased from newegg.com about 4 months ago and has been in use for the last 2-3 months.

I believe the disk that dropped, hde, was a retail unit purchased locally.

my gf's computer... it is on another outlet with a different sure suppressor. It's in the sameroom so probably on the same circuit. We seem to have very stable power here, maybe it's because we are within 2-3 blocks of a police station. To my knowledge there have been no power spikes or drops lately. Her HDD is aprox 1 year old and is an OEM unit purchased through newegg.com.

The backup of her computer was made monthly and stored on the server. When recovering data from the server I prioritized backups last. My music was 1st, then my ISO images, videos, misc items, adn then backups last. I was able to recover my music collection, but the rest died. I have hard copies of most of my ISO images and my videos are legal downloads or backup copies of things I own. I should be able to recreate just about everything with time.

The second backup is an older ghost image created a few months after the computer was built. It has all of the programs, drivers and settings configured. The main purpose of this backup was to restore incase the OS got toasted and then the monthly backup could be applied to make the system as good as new. The fact that a CD-RW disk had failed is simply a coincidence. I tried to read the disk back on 4 different readers. none could get more than 3/4ths through the image file. I tried filling the missing parts of the disk with seroes, but ghost would not accept the image file as valid when it got to that point.

The failed disk is still operable and a large part of the my documents folder has been recovered using Get data back for NTFS. unfortunately there are still some files missing and the process is very slow. I will have to have my gf go through the remains to see if she can find some of the stuff she's lost.

Before I had the server as a place to store backups I was in the habit of writing my documents folder to CD-R every few months. I kept all the abckups incase a disk died or I had deleted a file and eneded it back later. I think it might be a good idea to pick that habbit back up to make sure I always have a backup of my most critical information.

Unfortunately it is rather cumbersome to backup the whole array. I recently got a DVD burner. I have put a few things that were convenient to DVD. I think I should try and keep more stuff archived. Just to have a second backup.

Share this post


Link to post
Share on other sites
I was copying data over to my girlfriend's computer and her DM+9 died too....

At this point I would begin watching out for falling tree branches. Oh ya, avoid thunderstorms...

Share this post


Link to post
Share on other sites

yeah yah... I also want to add a point about the immaturity of the linux software RAID.

I thought this was a motherbaord problem or a controller problem, but I have switched motherboards and controllers and am experiencing the same issue.

With linux's software RAID 5 I experience data corruption when raiding drives across 2 or more controllers. When all drives are on a single controller there is no data corruption. This only happens in RAID5 (I've tried 0 and 1, but not 4). Errors are random, but repeatable if copying is done multiple times and can be tested by computing m5d sums of files copied to the RAID5 array.

Share this post


Link to post
Share on other sites

FWIW, I had a software RAID5 from 4 SCSI drives on 2 separate SCSI controllers and never had corruption problems, so I'm not sure the problem is with the RAID software itself (however, I do agree that the tools leave a LOT to be desired when something goes wrong).

Share this post


Link to post
Share on other sites
FWIW, I had a software RAID5 from 4 SCSI drives on 2 separate SCSI controllers and never had corruption problems, so I'm not sure the problem is with the RAID software itself (however, I do agree that the tools leave a LOT to be desired when something goes wrong).

I suspect the stability of the soft-raid component is entirely dependant on the kernel version being used. Both the controller drivers and softraid software differ from version to version.

Thank you for your time,

Frank Russo

Share this post


Link to post
Share on other sites

That's entirely possible. But I've experienced it in 2.4 as well as 2.6 kernels. Since the drives all work fine individually or in RAID 0 I had a hard time pin pointing the problem.

right now i'm testing RAID4. It seems to be substantially faster than linux's RAID 5 implementation(writes went up ~10MB/sec and reads went up 30MB/sec) even though they use the same driver. There's still room for improvement and speeds seem to be very CPU/FSB dependant..

I have a 3ware card on the way, hopefully it will show better performance.

Share this post


Link to post
Share on other sites
yeah yah... I also want to add a point about the immaturity of the linux software RAID.

I thought this was a motherbaord problem or a controller problem, but I have switched motherboards and controllers and am experiencing the same issue.

With linux's software RAID 5 I experience data corruption when raiding drives across 2 or more controllers. When all drives are on a single controller there is no data corruption. This only happens in RAID5 (I've tried 0 and 1, but not 4). Errors are random, but repeatable if copying is done multiple times and can be tested by computing m5d sums of files copied to the RAID5 array.

169116[/snapback]

From my experience, writing data to the array works fine its the subsequent reading(s) that somehow corrupts the data.

If the volume is mounted readonly, the data can be read and reread any given number of times without corrupting the data. Mounting it readwrite the data gets corrupted at least after a couple reads.

So it seems its also got something to do with filesystem handling.

My old system had the corruption issue, it was running Redhat 7.3 with kernel 2.4.17 using a Via 686 southbrigde, 3 Promise Ultra100 TX2 cards (pdc20268 chip) and an onboard Promise pdc20265R chip. - It worked but with corruption.

My new system is based on Fedora Core 4 with kernel 2.6.11. The funny thing is it dosn't work with more that 2 Promise cards, so buying an extra Promise Ultra100 TX2 as replacement for the old onboard pdc20265R chip was a wast (the new motherboard dosn't come with other onboard PATA ports than the standard 2 nForce2 chipset supported ones).

For what its worth the bug seems to be gone at least in kernel 2.6.11. So far I've tested with a six disk array spread on 3 controllers of 2 different brands (1 x onboard nForce2 Gigabit MCP and 2 x Promise Ultra100 TX2's (pdc20268 chip))

and no corruption yet, but I'm still testing. I'm going to add a 3rd. brand of card, based on an old SiL 0649 (ATA100) chip.

Anybody that might know about whats changed in the kernel that could have had an inpact on the corruption issue ?

Best Regards

Theis

Share this post


Link to post
Share on other sites

i _think_ a friend of mine had the same problem a while ago

he commented out the goto and it worked

i dunno whats the problem with mddev->recovery_cp beeing != MaxSector though

it may be just a bug, but i guess it has a meaning standing there :)

Share this post


Link to post
Share on other sites
My old system had the corruption issue, it was running Redhat 7.3 with kernel 2.4.17 using a Via 686 southbrigde, 3 Promise Ultra100 TX2 cards (pdc20268 chip) and an onboard Promise pdc20265R chip. - It worked but with corruption.

Via 686 is the mother-of-all corruption :)

Share this post


Link to post
Share on other sites

I had something similar happen to me two years ago and I posted the problem and fix in SR.

I remember someone later told me to install Fedora Core over it to reconstruct the array. It worked for when I had a problem again.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now