blakerwry

Linux Soft Raid 5 Crash...

Recommended Posts

Sorry, that should be:

mdadm --assemble --force /dev/md0 /dev/hde1 /dev/hdf1 /dev/hdg1 /dev/hdh1

Darn documentation isn't consistent, and mdadm doesn't seem to be available on my gentoo box, so I can't check it out personally.

Share this post


Link to post
Share on other sites

yeah, I noticed the documentation on mdadm is actually pretty poor (in my opinion)... but the same is true for linux software raid, there was alot of confusion about how the config file should be laid out, esp regarding failed disks.

In any event, the new hdd seems to be working perfectly, and the other hdds seem the same, unfortnately I am still unable to get the array to start in degraded mode.

I recompiled my kernel, but am unsure what to do after compiling. I have never compiled my kernel before, I followed the below steps that I found on a news group.

1) login as root

2) make sure the kernel source is available under /usr/src/linux

3) change directory to there

4) you may choose between the old fashioned configuration: make config

                          the menu driven configuration: make menuconfig

                          the x-Win configuration (under X only): make xconfig

-- i had to choose "make oldconfig", as none of the other options worked on my box

5) configure your new kernel --it kind of configured itself

6) exit and save the configuration

7) type: make dep clean

this will enter the configuration information to the

compilation-files and

remove all previously compiled files

8) type: make bzImage modules modules_install

this will compile the kernel and the kernel_modules and install them

9) copy the System.map file to your boot directory (/boot ?)

10) copy the bzImage file from arch/i386/boot to the same directory

11) if necessary edit your lilo.conf in /etc --Here's where I got stuck, I use grub and originally it loads this kernel: /boot/vmlinuz-2.6.5-1.358 and this ramdisk file: /boot/initrd-2.6.5-1.358.img, but I don't see where a new kernel or ramdisk file were created for me to add to my grub.conf.

12) run lilo

13) reboot

Share this post


Link to post
Share on other sites
--Here's where I got stuck, I use grub and originally it loads this kernel: /boot/vmlinuz-2.6.5-1.358 and this ramdisk file: /boot/initrd-2.6.5-1.358.img, but I don't see where a new kernel or ramdisk file were created for me to add to my grub.conf.

Are you sure raid is built into the kernel and is not being loaded as a module? After a reboot, type "lsmod" as root. If md is there, then it's a module and you don't need to recompile your kernel, just re-make the modules.

Note, compiling 2.6 is different than 2.4 and your instructions are based on 2.4. I'm assuming you patched the raid.c file first to comment out the "goto abort" line.

Now:

cd /usr/src/linux

make

make modules_install

cp -a /usr/src/linux/arch/i386/boot/bzImage /boot/vmlinuz-2.6.5-XXX

vi /boot/grub/menu.lst and add a line for the new kernel. Keep your old lines in case you need to revert.

shutdown -r now

Good luck!

Share this post


Link to post
Share on other sites

yeah, it seems to be in the kernel

Module                  Size  Used by
ipv6                  184288  16
parport_pc             19392  0
lp                      8236  0
parport                29640  2 parport_pc,lp
autofs4                10624  0
sunrpc                101064  1
sis900                 14596  0
r8169                  12164  0
sg                     27552  0
scsi_mod               91344  1 sg
reiserfs              187604  0
raid5                  15232  0
xor                    12936  1 raid5
dm_mod                 33184  0
ohci_hcd               14748  0
ehci_hcd               21896  0
button                  4504  0
battery                 6924  0
asus_acpi               8472  0
ac                      3340  0
ext3                  102376  1
jbd                    40216  1 ext3

raid 5 and xor aparently are loaded as modules... im also surprised to see ext3 listed as a module along with reiserfs, i thought those would be included in the kernel.

Share this post


Link to post
Share on other sites

RAID Reconstructor by http://www.runtime.org

Runtime's RAID Reconstructor will help you to recover data from a broken

RAID Level 5 Array consisting of 3 to 14 drives

RAID Level 0 Array (Striping) consisting of 2 drives

Even if you do not know the RAID parameters, such as drive order, block size and direction of rotation, RAID Reconstructor will analyze your drives and determine the correct values. You will then be able to create a copy of the reconstructed RAID in an image file or on a physical drive.

Because one drive is redundant in RAID 5, it is sufficient to have one less than the original number of drives (N) in the array. RAID Reconstructor can recalculate the original data from the N-1 drives.

i don't know if it will work for you but it could be usefull if it support linux raid

Share this post


Link to post
Share on other sites
yeah, it seems to be in the kernel

Module                  Size  Used by
ipv6                  184288  16
parport_pc             19392  0
lp                      8236  0
parport                29640  2 parport_pc,lp
autofs4                10624  0
sunrpc                101064  1
sis900                 14596  0
r8169                  12164  0
sg                     27552  0
scsi_mod               91344  1 sg
reiserfs              187604  0
raid5                  15232  0
xor                    12936  1 raid5
dm_mod                 33184  0
ohci_hcd               14748  0
ehci_hcd               21896  0
button                  4504  0
battery                 6924  0
asus_acpi               8472  0
ac                      3340  0
ext3                  102376  1
jbd                    40216  1 ext3

raid 5 and xor aparently are loaded as modules... im also surprised to see ext3 listed as a module along with reiserfs, i thought those would be included in the kernel.

You probably have the kernel configged to load into a ramdisk, which then loads the modules for filesystems and such. In your grub.conf, you will see "root=ram0 realroot=/dev/hda or something to that effect. This is not unusual. A monolithic kernel is prefered for situations like this, but doing this from scratch is probably outside the scope of what we can do here (on a web board)

Frank

Share this post


Link to post
Share on other sites
What exactly does this check 'do'?

I'm not sure, but from what I've read, if a disk fails and the filesystem is not cleanly umounted, then it sets this flag which triggers the goto abort line so the raid array won't start next time. That may be an over cautious policy, as people who have had similar problems and disabled the goto abort found their raid would indeed come up without apparent errors. In this case, he just wants to mount it one more time in order to do a backup, so it may be worth the risk.

Share this post


Link to post
Share on other sites

i am able to compile the kernel, but I am unable to boot... ugh... reaid reconstructor looks neat... although i'm a little afraid to hook up disks to a windows machine as windows might try to write ID's or something to them...

Share this post


Link to post
Share on other sites

Mount /boot and edit the grub.conf

Make a copy of your existing lines and change the title to backup. Edit the neam of the kernel and initrd to kernel_blah_old and initrd_blah_old

Change the title of the _old stuff to "Emergency backup bootable dohicky"

In the the root of /boot, cp your kernel and initrd to initrd_blah_old and kernel_blah_old.

Reboot, you will see 2 lines. The kernel, and the Emergency backup bootable dohicky.

Test the "Emergenct backup bootable dohicky"

Then do the following once that works

cd /usr/src/linux
make mrproper
make xconfig

Change anything you want, save (to .config) and exit

make
make modules_install
make install

Reboot and try the new kernel

Ping me in the parts that don't make sense.

Thank you for your time,

Frank Russo

Share this post


Link to post
Share on other sites

awesome!!!!! thanks everyone. I recompiled the kernel and just by booting up with the new kernel I have access to the array. Going to get everything I can off of here ;- )

I was bidding on a 3ware hardware RAID controller on ebay, but got outbid in the last half hour or so... I'm thinking that it may be a better idea for me to go hardware RAID if I can get a good deal.

Share this post


Link to post
Share on other sites

Nice work Jeff Poulin! And others, too! I actively followed this thriller and was hoping the best. And blakerwry, great job! :) :) Happy end.

BTW, I have good experience with 3ware raid card (4-channel pata) with Linux. Hardware raid is the way to go if possible.

Share this post


Link to post
Share on other sites

stinker! I must have some awefull karma right now...

while recoveryng data /dev/hde dropped out of the array.... i tried to recreate the array and something didnt work right... it looks like mkraid used the wrong algorithm (2 instead of what I had specified in the config file... 0) and it also is resync'ing the array... which at this point probably means eating all my data...

when I try to mount md0 it is detected as a reiserfs partition, but will not mount....

Share this post


Link to post
Share on other sites

Yikes. hde and hdf are the same IDE bus, right? I wonder what's up with that. Sorry you weren't able to salvage everything, but glad you were able to at least get something (hope it was the important half ;)). When you did a mkraid, do you use the --really-force option? If not, your array may still be salvagable. You may want to give up at this point, but just in case you want to keep trying, what would happen if you set hde to be an only disk (i.e. if it's a WD drive, remove the jumpers), and try to mount the array in degraded mode? (hde, hdg, hdh only). What unfortunate luck that hde would drop out while doing a backup. :(

Share this post


Link to post
Share on other sites
Nice work Jeff Poulin! And others, too! I actively followed this thriller and was hoping the best. And blakerwry, great job! :) :) Happy end.

I'll second that. Now I just hope it won't happen for my own softraid5 array at home...

Share this post


Link to post
Share on other sites
I was bidding on a 3ware hardware RAID controller on ebay, but got outbid in the last half hour or so... I'm thinking that it may be a better idea for me to go hardware RAID if I can get a good deal.

Using only one drive per channel in software RAID would probably have prevented the bad drive from taking down the array, so if you have the channels available, software RAID could still be an option.

Share this post


Link to post
Share on other sites

yeah, that's possible and it's something I've been thinking about. The thing is, when hdf died, it froze the computer.. hard lock. Which left the array dirty. With the drive dead it also left the array degraded.

Since the softRAID5 didnt want to start a dirty degraded array I dont know if it would have been any different.

When hde dropped out of the array the disk was still responding to smart commands and seemed to be operating fine.. it just dropped out of the array...

I tried stopping the array, but it would not work.. what's kind of funny is that the array still showed as online, but with only 2 drives out of 4 operational...

I couldnt tell which drive it was coming from, it sounded like a couple were reading heavily after hde had dropped.

I probably should have tried hotadding hde back in instead of rebooting.

I thought it was odd that hdf dropped and then the next time I use the array the hdd on the same data cable and power lead dies.... (2 HDDs were on a power lead and the other 2 HDDs were on another power lead, nothing else was used on these leads)

I decided to disconnect hdf from power and data along with replacing the data cable with a new one I had bought but not installed yet.

I let the computer resync the array and ran a non-destructive fsck, it came back that I would need to run with the --rebuild-tree option to fix 3 errors. So, I bit the bullet and ran the full fsck and it apears all my data is back... I'm continuing to try and salvage the next most important part of data now. -Hope it's readable.

Share this post


Link to post
Share on other sites

;-(

this is terrible... I was copying data over to my girlfriend's computer and her DM+9 died too....

smart has some weird errors that werent there last time I checked the drive... the disk is atleast responding and does have a FS on it.. I count mount it in linux, but maybe I can recover the data using a recovery util... the sad thing is that the backup was on the server... I think I may be able to recoever the backup file I'm not sure how old it is or if it even works (I think the server problems started while creating the monthly backup).

Share this post


Link to post
Share on other sites

nope, after hde dropped all data was jumbled up... it's funny listening to my mp3s as a single file now contains bits and pieces of several tracks ;-)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now