Sign in to follow this  
KrazeyKami

Computer crash when processing large amounts of data

Recommended Posts

Hi all,

For a short introduction of my tech level; I’ve been a system manager for the past 8 years, and been playing around with computer for nearly 20 years.

Until recently, I’ve never faced a problem with my own computer that I couldn’t solve, or figure out what was giving issues, until now… And that’s why I need your help, and a fresh look on things.

First, my setup:

Motherboard: ASUS P6T SE

Motherboard BIOS: v.0808

CPU: Intel i920 @ 2,6 GHz (Stock)

CPU Heatsink: ProlimaTech MegaHalems + 2 Cooler Master120mm fans

CPU Idle Temp: 30-35 degrees per core

CPU Load Temp: 45-50 degrees per core (Prime95)

Memory Part Number: 6 GB OCZ Gold PC3-8500U + 6 GB OCZ Gold PC3-10700U (Both OCZ3G1333LV6GK @ 1066 MHz, Stock)

Memory Voltage: 1.65

Video Card(s): Asus nVidia GTX295

Sound Card: Sound Blaster Fatal1ty X-Fi

PSU Model Number: Cooler Master 1000W Real PowerPro

Hard Drive(s): Intel X-25 M SSD 80G (OS/Boot), 4x 2TB HD204UI Samsung Spinpoint F4 (all on ICH10 / Sata2)

Optical Drive(s): GGW-H20L BlueRay RW

Other Cooling: Cooler Master Stacker 831 case with 6x Cooler Master 120mm fans

Operating System: Windows 7 Professional x64

This machine is running 24x7, rock solid. I never had any issues with performance or stability.

Now, for the problem:

Until a few weeks ago, I didn’t have the 4x 2TB drives in it.

I bought the drives, made a RAID5 array, using the ICH10R, and that’s when the problems began:

When copying (or downloading) large amounts of data to the array, thus creating a high I/O on any of the 2TB drives, my computer suddenly reboots. No blue screens (although that option is checked ON), and the only remarks in Event Viewer are: System suddenly rebooted unexpectedly, Event ID 6008, and possible cause: Power failure, Event ID 41).

No error codes, no nothing.

I decided to break up the array, and see what happens when I copy 1.5 TB of data from 1 drive to the next; The computer reboots again. The drives are now single SATA2 drives, with a 2 TB partition. After 10-80 minutes, the system reboots without notice or error.

So, I started to systematically remove drives, and test with the other drives.

Regardless of what drive is the source, or destination (I tried all combo’s, and all directions), the system reboots when a high amount of data is generated. Not just with copying the existing data, but also when downloading (and at the same time repairing files (.PAR), and extracting).

I tried the following:

  • Check for overheating: All values are well below 40 degrees C; (also checked drives);
  • Even put an active cooler on my southbridge (The ICH10);
  • Memory checks. Ran 3 different programs to check / test my memory, ran overnight for hours and hours, multiple passes, 0 errors.
  • Remove all other hardware, except for the absolute minimal necessary;
  • Swap / replace powercords, SATA cables, even rotate drive position on the SATA connectors;
  • Reset BIOS settings;
  • Reinstall Windows 7 (delete the entire 80GB partition on the SSD and reinstall, no other tweaks, but right after install, start the copy transactions) to rule out the possibility of faulty software and / or drivers;
  • Reformat / create the drives / partitions: Tried both MBR and GPT partition; Different block sizes;
  • Turn off Write Back Cache, to even further rule out a problem with my RAM;
  • Calculate the PSU needs; I tried multiple programs, even a paid one, and counted manually: Granted, on a full Direct3D load (games on high etc), my GFX card needs around 450 watts. This makes a grant total of 950 Watts. However, the problems occur while idle in Windows, so the consumption for my GFX is max. 100 Watts, making a total of (roughly) 600 Watts, well within the limits of my 1000W PSU;
  • CPU check / Prime95; runs for days, stable, without a single error;

The facts:

  • the problem only (and only) occurs when copying / downloading a large amount of data;
  • The system runs flawlessly under high load (playing games, watching movies, running programs etc);
  • I never had this problem before, but then again, I never had the space to start downloading 250 GB of data, or copying 1.5 TB ( I didn’t even have 1,5 TB ^^) data to other drives.

I am able to reproduce a “fast” reboot error:

I created 3 separate batch files, which basicly tell Robocopy to copy data from:

  • Drive D: to Drive E:
  • Drive D: to Drive F:
  • Drive D: to Drive G:

  • When I run these scripts separate it runs for a while, but also reboots / crashes after an hour or so.
  • When I start these scripts all at once, it reboots within 5 minutes.
  • Remember, that I already cloned the drives, so I could rotate the source drive, and systematically removed / switched a destination drive, thus trying all different combo’s (and to check whether one of the drives might be faulty).
  • Further this also makes me doubt if it’s the shear size of data that causes the problem, cause within 5 minutes, not even 100 MB is being addressed, and still, it reboots.
  • This is making me think, that the PSU might be the problem. As soon as these drives are actively called upon, i can imagine a sudden increase in the 12V+ rail, can cause to overload my 12V rail... altho my PSU has 6x 12V rails, i'm not to convinced this might work as well as people say...there are many discussions on the web about the use of 6x 12V rails. Could it be, that my 12V rail is maxed? (considering it's giving power to: The Mobo, The Cpu (4/6 pins, can't remember), The GFX card (both 6 and 8 pins), 5 drives (SSD + 4x 2 TB), an Optical BD-RW, and ofcourse the onboard devices (Soundcard) and USB devices (headset, webcam).

I am all out of ideas. If there is something that I haven’t checked / tried, please tell me. I think I wrote down everything I tried thus far; maybe I missed something, but I’ve been testing and trying for 3 weeks now.

For now my conclusion / suspects are:

  • The motherboard. Either a chip in the ICH10 was fried, or the SMBus got a dent;
  • The motherboard (or ICH10 / SMBus) is just not capable of processing such large amounts of data.
  • My PSU. Mainly, the 12V+ rail. It could be (maybe), that my 12V rail is max.loaded, and when kicking in the extra drive operations, it fluctuates, and tilts it a bit above it maximum, thus crashing my computer. Looking at the symptoms (sudden reboots without any errors) it might be a more plausible cause then all the other things I tried. And yet, if you look at my hardware setup, I cannot imagine I reached the 12V max.
    However, I will be testing this week, by taking another PSU on a second desktop, and connect my hard drives to that power supply. Or maybe even, take an el cheepo GFX card and remove my GTX295, and see if the computer stays stable during copying…

I’ll post the results shortly. In the mean time, if any1 has seen this problem before, or has other ideas / solutions to try, please let me know here and I’ll try them.

P.S.:

According to the PSU calculator, this is the recommended Watts / Amperage for my setup with a full 100% load; Below that is a table of power my PSU can handle:

psu.JPG

Afaik, i can add the 12V rails together, so my PSU can handle max. 128 A on the 12V rail?

This is my PSU: http://www.coolermaster.com/product.php?product_id=2519

Does this mean my PSU should have more then enough power?

I'm still gonna try with a different PSU or GFX card to be sure, but according to the above i think all should be covered... Any thoughts?

Many thanks in advance for thinking with me.

Kind regards,

Kami.

Share this post


Link to post
Share on other sites

Well for starters, onboard RAID can cause problems, which generally manifest themselves under heavy load (large continuous file transfers). This is why you see a lot of people going towards add-on cards to offload the processes and also tend to be more reliable. Another trusted option is software RAID in Unix/Linux which is dead reliable, but not always as fast as its hardware-RAID kin. If you have played with a Synology, QNAP, or other NAS unit you've played with software RAID.

Have you physically checked the heat output off the north and southbridge chipsets during or after one of these problem power cycles to see if its getting abnormally hot? You mention checking the temps and putting an active cooler on it, but did you ever touch it to see if it was scalding hot? If it ever reached one of those points it could have done damage. My vote is still on the chipset being unreliable under RAID5 though.

Now for your power supply concerns, yours should be perfectly capable of handling your loads. The PSU calculator is probably over estimating the startup requirements, which are the highest power value the drive sees. You only experience this when you first turn on the system though, during normal operation if they dont spin down it is a fraction of that usage. For the longest time I was running 7 drives on a 330w power supply with no ill-effects.

Share this post


Link to post
Share on other sites

I wouldn't personally bother with PSU calculators or the like, just get a power meter from your local electronics retailer for $5/£5 and it gives you an exact readout of how much is being used by whatever you plug into it.

I'd agree with your analysis, in that it's probably faulty hardware on the motherboard. The chipset has no issues handling the data throughput, I can easily send a constant 700MB/sec through mine for hours at a time without a hitch. It won't be the RAID since you've tested it with RAID off.

There's a couple things you haven't mentioned trying: switching the controller between AHCI, RAID, and IDE modes in the BIOS, and also updating to the latest drivers for the SATA controller.

You mention this problem didn't happen before you got the 4x 2TB hard drives. What did you have in the system before? Just the SSD?

I would suggest testing a couple things specifically before concluding whether it's the motherboard or the drives - a continuous write test to a hard drive and a continuous read test off the SSD. Basically a) get dd for windows and get it to write a random data stream to a particular drive at full speed (data generated in RAM, not copied from another drive) and B) get a benchmarking program that allows you to continously read the sequential raw data off your SSD over and over.

a) will help us figure out whether it's reading or writing to the drive that's the problem, and therefore whether test B) will be of any use

B) will help us figure out if the new hard drives are the problem or if it happens with the SSD as well

Also, I might suggest getting a Linux live CD and attempting the same file copy from within Linux.

Share this post


Link to post
Share on other sites

Thank you both for your contribution.

I used the RAID5 setup, and it was very stable (and fast!), but i decided to take it apart and try to identify the problem.

First i thought i was RAID related, but after removing the raid and making the disks single sata (as written), the problem still persisted.

See my post and experience with the ICH10R 5 here: http://forums.overclockers.co.uk/showthread.php?t=18224121

But, in the end, i've found the culprit:

The X58 mobo can't handle 12 GB of RAM. I removed 6 GB, and the machine is rock-solid again. It seems that in Auto settings on the BIOS, the QPI voltage is too low for 12 GB of RAM and needs to be entered manually.

Thanks all for your replies.

Share this post


Link to post
Share on other sites

Interesting - particularly that RAM was the culprit yet it passed 3 different RAM tests. Gotta say though, there are plenty of X58 boards that are fine with 12GB RAM, maybe you got a dud or just a bad combination of components with low tolerences?

I take it you've put the 12GB back in with higher QPI voltage and that sorted things?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this