Sign in to follow this  
Followers 0
Skouperd

What am I missing, why does my RAID keep on failing?

8 posts in this topic

Evening everybody, I have a serious problem on my hands and I now have absolutely ran out of options so I would really appreciate some input from you guys as I am really stuck now.

  1. I've built a file server for a company, the server specs is as follows:
    1. NORCO 4U, Rackmount with 10 Hot swappable bays
    2. Intel i7-4820k
    3. Gigabyte GA-X79-UP4
    4. 32GB, Corsair RAM (4x8GB)
    5. Corsair CX750W PSU
    6. Intel RT3WB080 RAID controller
    7. 120GB SSD (OS drive)
    8. Intel i340-T4, quad 1Gb NIC
    9. Asus GT610 GPU
    10. 8xWestern Digital RED WD40EFRX drives

The SSD was connected straight to the motherboard, while the 4TB drives was connected to the RAID controller. The SSD acts as the OS drive, while the 8x4TB drives was setup in RAID 6. Over a period of about 8 months, I’ve lost more than 10 (of the 4TB drives) drives already on this setup.

The drive failures seems completely random. The NORCO chassis contains two bays with 5 drives each. The drive failures was completely random, in other words, it was not allocated to a specific drive bay (one of the two), it was not specific that it was the top drives, nor the bottom drives that failed. It was really random drive failures. Sometimes the drives will run fine for weeks, and then all of a sudden 2 (and 3) drives will fail in one weekend.

SMART information was not completely out of the norm, but there was a slightly elevated Read Allocation Errors on the drives that I’ve had to replace.

As a process of elimination, I’ve replaced the SAS cables (SAS cables fanning out to 4 SATA connectors which plugs into the back of the drive bays). This did not solve anything and the drives continued popping out of the arrays and I continued replacing them with new ones.

At this stage, I was getting really desperate to try and figure out what is causing the issues and to preserve the data (recovering from out backups took ages) so I’ve removed 4 of the drives and re-allocated 3 of them to 3 other servers and used them for internal storage (nearline backups). I’ve removed the fourth drive out of the RAID controller and connected that straight onto the motherboard (via a SATA cable to the back of the Norco backplane). This was done about 6 months ago, and to date, not a single one of these four drives have failed. The four drives that remained in the Array however, kept on failing fast.

Discussing the symptoms, random crashes, slightly more frequently over weekends, the suggestion was that it may be the PSU. (I stay in South Africa and our power supply is really unreliable). So, the next step of elimination was that the PSU may be dodgy, so I’ve replaced that with a new PSU. Unfortunately, the drives in the array kept on popping out.

I then had a massive UPS installed that is powering the whole rack. This did not help either and the drives kept on popping. (Pick an error message the drives showed, I’ve had them all…)

Ok, then it must obviously be the RAID controller, so I’ve replaced the RAID controller for a new one. Again, an Intel RT3WB080, this did not resolve anything either.

Ok, despite having replaced pretty much all the drives by now for new ones, I’ve decided maybe the WD-Red drives just ain’t good enough. I’ve had two drives at my supplier and they agreed to credit me the cost of the two RED drives and I then took 2 x WD 4TB SAS drives.

For starters, I’ve only realised after installing the SAS drives that this Intel Raid controller is not suitable for SAS drives, so I took another RAID Controller that I had close by (RocketRaid 2722) which is a really cheap controller but it has been running in another server for close to a year without any issues.

Any event, the one SAS drive seemed to be completely dead as it doesn't even register (still need to test on another SAS system). The second SAS drive was accepted and I’ve rebuild the array (again RAID 6). Truth be told, with a new RAID controller, a new PSU, new SAS cables, 3 of the 4 drives popped within 24 hours after building the array.

I figured the only thing that have not been chaged as yet is the backplane of the Norco chassis, so I’ve then removed the norco chassis and built it into one of my old no name server chassis. Connected all 4 drives straight onto the RAID controller (i.e. no backplanes between the drives and the RAID controller). Guess what, it still failed after 24 hours.

I am really, really running out of options here and will appreciate any suggestions, in summary, here is what I’ve done:

  1. Replaced the SAS cables
  2. Replaced the Raid Controller (twice)
  3. Removed the Norco Chassis (i.e. no backplanes to worry about)
  4. Replaced the PSU
  5. Installed the whole setup onto a high end UPS.

The only thing that might be slightly out of the norm is that the last failure happened exactly the same time that our backups kicked off. Most (not all) the drives popping was also over weekends, but I've never heard that a backup program can be the cause of drives popping?

I have not replaced the CPU, RAM, GPU, NIC, or motherboard, as none of these in my mind is directly connected to the drives but if you recon that this may be a problem, then please let me know.

Please, any suggestions will really be appreciated.

Kind regards

Skouperd

Edited by Skouperd
1 person likes this

Share this post


Link to post
Share on other sites

Going over the failures, the two items that stick out to me as possible causes include abnormally high vibration levels from the NORCO chassis or really really bad power supplies. You mention you replaced the PSU, but was it with another same model from NORCO? Have you measured 12v or 5v rail voltages? Also I'm unfamiliar with the case, but how are the drives mounted? Is there any vibration isolation mechanism?

The power item if voltages are greatly skewed can kill drives pretty well.

The vibration item is another area that may be of concern. You have two factors working against you right now. One being the case might not be suitable for enterprise conditions. Second the WD Red NAS drives are designed to operate in groups of 8 or less (you are covered in that regard) but WD doesnt suggest using them in a rack-mount environment. Unlike their high-end enterprise HDDs they don't offer vibration/rotational sensors. In a small setup like a NAS sitting on a desk, it only needs to worry about the 7 other HDDs in the case. In a rack environment though, it needs to worry about every other device bolted to the rack, as well as the separate resonance of the chassis bolted to a rigid structure.

Since these failures are happening over the span of months and are fairly random (but still linked) I'm guessing you might be dealing with the vibration concern.

Share this post


Link to post
Share on other sites

Might also be bad backplanes, although if you're getting truly random failures I would suspect vibration, power, or controller/drive firmware incompatibility issues.

Which exact model Norco chassis do you have? Got a link?

Share this post


Link to post
Share on other sites

Thank you very much for the information, to answer some of your questions::

This is the Norco chassis that I've got:

http://www.directron.com/rpc450th.html

and this is what the backplanes looks like: (from the back)

http://ep.yimg.com/ay/directron/norco-rpc-450th-4u-server-rackmount-chassis-with-10-hot-swappable-drive-bays-3x-5-25-inch-drive-bays-8.gif

The chassis did not ship with a PSU, but I've bought a Corsair CX750W PSU

http://www.corsair.com/en/cx750-80-plus-bronze-certified-power-supply

I've replaced the PSU for the same one some time ago (as my thoughts was also on PSU) but neither one of these two PSU's show any fluctuations in either of their lines, nor gives me issues when I run them in any other computer. (the one I've replaced was used in another gaming rig and it is running smoothly).

I've eliminated the backplanes by rebuilding the file server into a chassis that does not require any backplanes (i.e. SAS / SATA cables goes directly from the controller to the drives). However, the drives crashed within 24 hours in the new built. (that was the 2 new SAS drives and 3 of the WD Red drives).

I've since returned the two SAS drives as the one is completely dead (looks like it was dead on arrival) and the second one is giving me issues in that it is not registering it each time and when it does it seems like I am stuck on 2TB only. (The second drive seems like it took a knock or something as there are minor damages to the outside of the drive).

Regarding rack mounted, even though these WD Red drives are not meant to be rack mounted, I should add that the rack contains only the following devices:

1. 24 Port switch (no virbrations as it doesn't even have a fan)

2. Three "servers", however, none of these servers have any spinning drives in them, they are fully equiped with SSD's. The only vibrations in these may be the fans that may be spinning (120mm PSU fan, normal intel LGA 2011 CPU fan, and one low RPM 120mm exhaust fan)

3. A KVM switch (no vibrations as it doesn't have a fan installed)

4. Keyboard, mouse and screen (only virbrations when the keyboard is used which is rare consider we remote desktop into the servers)

5. The main File Server (which is giving the issues and the only one with the drives)

I should add that I've removed 3 of the 8 drives from the file server and installed one into each of the 3 servers mentioned in point 2 above. However, in order to reduce the virbrations of the tower itself, I've actually mounted the File Server on the bottom (less virbrations if any the closer it is to the floor). The UPS etc are all outside the rack.

Given the above contents of the Rack together with the placement of the server, I am not sure that the virbrations caused by other equipment is sufficient cause to damage the drives. I've also checked just now, the rack seems quite solid with no visible vibrations being emitted. (I've touched it lightly and honestly can not feel any vibrations on it)

I had a long chat to my supplier as well when I've returned the two SAS drives, and he is predominately of the opinion that I am just having an exceptionally, unnatural, super unlukcy case of getting all the really bad hard drives. However even I am getting the feeling that surely one person can not be that unlucky and that something else is at play here.

Have anybody ever had issues with software causing drive failures? The data stored on these drives are super sensitive (bank data) and as such we have various encryption software installed on the servers.

Thank you very much for taking the time to read this, but more importantely for taking the time to respond.

Kind regards

Skouperd

Share this post


Link to post
Share on other sites

Might have skipped this step, but where are you located and where are you buying your drives? How are they packaged and what is the condition upon arrival? I wonder if the drives you are getting are being hammered in shipment.

Share this post


Link to post
Share on other sites
Have anybody ever had issues with software causing drive failures?

Absolutely.

Is this a 4-post rack or a 2-post rack? If it's a 2-post those are generally not designed to contain servers, and hence vibration is often a severe problem with those.

Kevin's question regarding drive batches is a definte consideration as well.

Share this post


Link to post
Share on other sites

I am located in South Africa, so any of the drives that arrive here would be subject to at least a 3000 KM travel from the manufacturing to here. However, despite the distance, these drives are sourced from a very reputable dealer who source directly from the two main importers. I would guess that about 80% of all drives in South Africa is sourced via these two importers so they obviously know how to import it while the dealer I am buying from is one of the biggest in the country as well. I could add that I must have bought in excess of 100 drives in the last 5 years from either of the 3 companies and barring this server never really had any problems. (sure, the occasional drive failure but nothing like this)

The rack is a 4 post rack in a restricted access server room (only 2 people access the server room, me and another person).

You mentioned software could be causing drive failures, you mind giving me some examples of software which have caused these failures, apart from the obvious ones being RAID controller drivers, and raid controller firmware.

I have since moved the drives out of the rack into a seperate computer just to run some tests but I think the damage has been done already as the SATA drives keeps on failing (received the two new SAS drives and this time round they seem to hold up well).

Again, thank you very much for the feedback, very much appreciated.

Share this post


Link to post
Share on other sites

Unfortunately I am not allowed to talk about the specifics.

If the drives all work fine outside of the chassis, even when connected to the same RAID controller and power supply and whatnot, have you considered changing the chassis? Supermicro and Chenbro might be worthwhile alternatives.

At this point, if the business cost of these servers being down is getting this expensive, it's time to balance the cost between homebrewing servers vs. going to a proper system integrator or VAR, possibly one who is an authorized reseller for one of the big OEM's so you can get some proper engineering resources on it...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0