poodel

Silent filesystem corruption

Recommended Posts

I've got SERIOUS problems with my brand new fileserver. I had some .rar archives that I got checksum errors from when I tried to unpack them. Not always on the same file, and not even every time. Also, handling was a bit slow on the unpack process.

I did a copy of the file structure, and now: no problems.

Now I checksummed about 260GB worh of other files, files that were copied to the filesystem from another disk in the same machine. 9 broken DVD images!!!

I had the Areca board do a volume check, which came back with no errors. xfs_check says nothing about the affected filesystems either.

Between the initial copying and now I've added a disk and grew the RAID set and the filesystem... if that broke something, that's a very bad thing. The entire point of this excercise was to be able to grow over time.

Any ideas?

Hardware:

Areca 1220 with 4 x Seagate 7200.10 750GB in RAID5.

2 x 500GB in software RAID1.

Asus P5B-V, Core2Duo, 2GB RAM etc

Software:

Debian Etch testing with 2.16.18.3 AMD64 kernel

Share this post


Link to post
Share on other sites

Here's a suggestion - could you run a memory checker for your main DRAM (memtest86)? It's unlikely that this is it, but it's relatively easy to do, and can eliminate one possibility. As for Areca's on-board memory - don't know how to test that ...

Vlad

Share this post


Link to post
Share on other sites
Here's a suggestion - could you run a memory checker for your main DRAM (memtest86)? It's unlikely that this is it, but it's relatively easy to do, and can eliminate one possibility. As for Areca's on-board memory - don't know how to test that ...

Vlad

I ran through memtest86 one pass yesterday, and a few hours of "memtest", but I'll run some more. Memtest gave me errors once, but then I could not repeat them.

Share this post


Link to post
Share on other sites

I get this issue so much at work... bad memory is most often the cause. CPU getting too hot causes it. Rarely is it bad SATA cables as they seem to work or not.. at least I havn't really seen them cause silent corruption.

You can run memtest overnight (http://www.memtest.org) but its not 100% fullproof. If you have lots of memory it becomes less reliable but if you've got 1-2-4GB it should be good enough after 24 hours to say that its very unlikely to be a memory issue. BTW test6 is by far the best at finding errors incase you like to gamble, always put your money on test6, unless you've got a dimm that is horribly broke any test willl find it.

Anyways you can also run prime95 and set it up to use lots of ram, make sure you run enough to stress all cores.... thats probably the easiest test that does CPU+memory. HPL is great for bigger systems but harder to setup.

If you want to test the drives/controller my fav test ever is

md5sum bigfile ; while true ; do cp bigfile bigfile1 ; md5sum bigfile1 ; rm -f bigfile1 ; done

(bigfile= like 5-10GB size file, unless you got a system with like 32GB of memory then make bigfile more than that, you don't want the kernel to be able to read your file from cache ever)

Oh yea and make sure you turn off ECC when possible, it can hide errors alot of the time. Also run mce_check if you got opertons incase you got a screwed up cpu.

Share this post


Link to post
Share on other sites
How far is your processor over clocked, or rather... Is your processor over clocked?

Frank

No, no overclocking... I read something about the correct PCI bus speed being critical to the RAID cards somewhere,.

Share this post


Link to post
Share on other sites
I get this issue so much at work... bad memory is most often the cause. CPU getting too hot causes it. Rarely is it bad SATA cables as they seem to work or not.. at least I havn't really seen them cause silent corruption.

You can run memtest overnight (http://www.memtest.org) but its not 100% fullproof. If you have lots of memory it becomes less reliable but if you've got 1-2-4GB it should be good enough after 24 hours to say that its very unlikely to be a memory issue. BTW test6 is by far the best at finding errors incase you like to gamble, always put your money on test6, unless you've got a dimm that is horribly broke any test willl find it.

Anyways you can also run prime95 and set it up to use lots of ram, make sure you run enough to stress all cores.... thats probably the easiest test that does CPU+memory. HPL is great for bigger systems but harder to setup.

If you want to test the drives/controller my fav test ever is

md5sum bigfile ; while true ; do cp bigfile bigfile1 ; md5sum bigfile1 ; rm -f bigfile1 ; done

(bigfile= like 5-10GB size file, unless you got a system with like 32GB of memory then make bigfile more than that, you don't want the kernel to be able to read your file from cache ever)

Oh yea and make sure you turn off ECC when possible, it can hide errors alot of the time. Also run mce_check if you got opertons incase you got a screwed up cpu.

Thanks for the md5sum tip, I'll run that over the night.

I also ran some more "memtest" before and managed to reproduce the memory error, so then I ripped out one of the memory modules... so far no errors. I'll mix it up with some prime95 runs overnight.

Share this post


Link to post
Share on other sites

Summary:

It turned out to be the memory after all. The problem was that none of the regular memtest programs would indicate any problem, I had to use "unrar" or "cksfv" to get some fairly repeatable error.

The solution was to increase memory voltage to 2,0V (up from default 1,8?), and then it worked.

It's a bit scary that memory can do that... how would I notice if memory goes bad on me again in the future... perhaps this is what ECC memory is for?

Share this post


Link to post
Share on other sites

Replace the memory with brand version like OCZ, curcial etc. This is not good as overvolting exposes that

memory is bit weak to begin with.

Cheers, Wizard

Share this post


Link to post
Share on other sites
Replace the memory with brand version like OCZ, curcial etc. This is not good as overvolting exposes that

memory is bit weak to begin with.

Cheers, Wizard

I tested on two Geil capsules as well as Corsair... I think it's just my motherboard that can't handle default voltage. I'm contemplating a possible switch with my P5W HW mainboard, it can handle ECC memory.

Share this post


Link to post
Share on other sites
Replace the memory with brand version like OCZ, curcial etc. This is not good as overvolting exposes that

memory is bit weak to begin with.

Actually most faster DDR2 modules are specified for higher than standard voltage (1.8V) of DDR2.

I think some fastest modules require even 2.5V to work at their specified clock speed/latencies. Especially OCZ sells these "high voltage" kits

Share this post


Link to post
Share on other sites

Well... the error came back, and now it seems unfixable. It's also the weirdest error I've ever seen on a PC. Copying a large file will make the copy unidentical... cksfv will produce different errors every time. This is driving me mad!

  • It doesn't have anything to do with the ARECA card... I can reproduce the same error on the motherboard SATA connectors
  • It doesn't seem to be memory related. No memory tuning has any relevant effect, and no memtester finds any errors
  • Loading the CPU while performing the disk activities (cp, cksfv, unrar) seems to cure it temporarily
  • I can reproduce it on EXT3 or XFS, so it's not that.

Is SMP seriously broken in my kernel? Is my CPU bad? It has never crashed or indicated anything beyond these read errors.

Share this post


Link to post
Share on other sites
Is SMP seriously broken in my kernel?

Do you by any chance use Kernel 2.6.19 ?

There has been a serious flaw concerning filesystems in Kernel 2.6.19 which was fixed by Linus Torvalds, the master himself :), in 2.6.20 !

Share this post


Link to post
Share on other sites

Sorry for the follow-up post, but I don't seem to be able to edit my previous post...

More information about this data corruption bug can be found here.

Edited by Elandril

Share this post


Link to post
Share on other sites
Sorry for the follow-up post, but I don't seem to be able to edit my previous post...

More information about this data corruption bug can be found here.

2.6.18.3, but I've compiled myself a 2.6.20.3. Much better kernel for the Intel 965 chipset. The disk problem seemed to be around still though, but now I'm not sure wether the tests I performed around it was one genuinely broken file or not.

The problem is not around right now, but if it reappears, I'm tossing the motherboard, and PSU.

Share this post


Link to post
Share on other sites
Sorry for the follow-up post, but I don't seem to be able to edit my previous post...

More information about this data corruption bug can be found here.

Elandril,

Great reading, thank you!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now