Evening everybody, I have a serious problem on my hands and I now have absolutely ran out of options so I would really appreciate some input from you guys as I am really stuck now.
I've built a file server for a company, the server specs is as follows:
NORCO 4U, Rackmount with 10 Hot swappable bays
Intel i7-4820k Gigabyte GA-X79-UP4 32GB, Corsair RAM (4x8GB) Corsair CX750W PSU Intel RT3WB080 RAID controller 120GB SSD (OS drive) Intel i340-T4, quad 1Gb NIC Asus GT610 GPU 8xWestern Digital RED WD40EFRX drives The SSD was connected straight to the motherboard, while the 4TB drives was connected to the RAID controller. The SSD acts as the OS drive, while the 8x4TB drives was setup in RAID 6. Over a period of about 8 months, I’ve lost more than 10 (of the 4TB drives) drives already on this setup.
The drive failures seems completely random. The NORCO chassis contains two bays with 5 drives each. The drive failures was completely random, in other words, it was not allocated to a specific drive bay (one of the two), it was not specific that it was the top drives, nor the bottom drives that failed. It was really random drive failures. Sometimes the drives will run fine for weeks, and then all of a sudden 2 (and 3) drives will fail in one weekend.
SMART information was not completely out of the norm, but there was a slightly elevated Read Allocation Errors on the drives that I’ve had to replace.
As a process of elimination, I’ve replaced the SAS cables (SAS cables fanning out to 4 SATA connectors which plugs into the back of the drive bays). This did not solve anything and the drives continued popping out of the arrays and I continued replacing them with new ones.
At this stage, I was getting really desperate to try and figure out what is causing the issues and to preserve the data (recovering from out backups took ages) so I’ve removed 4 of the drives and re-allocated 3 of them to 3 other servers and used them for internal storage (nearline backups). I’ve removed the fourth drive out of the RAID controller and connected that straight onto the motherboard (via a SATA cable to the back of the Norco backplane). This was done about 6 months ago, and to date, not a single one of these four drives have failed. The four drives that remained in the Array however, kept on failing fast.
Discussing the symptoms, random crashes, slightly more frequently over weekends, the suggestion was that it may be the PSU. (I stay in South Africa and our power supply is really unreliable). So, the next step of elimination was that the PSU may be dodgy, so I’ve replaced that with a new PSU. Unfortunately, the drives in the array kept on popping out.
I then had a massive UPS installed that is powering the whole rack. This did not help either and the drives kept on popping. (Pick an error message the drives showed, I’ve had them all…)
Ok, then it must obviously be the RAID controller, so I’ve replaced the RAID controller for a new one. Again, an Intel RT3WB080, this did not resolve anything either.
Ok, despite having replaced pretty much all the drives by now for new ones, I’ve decided maybe the WD-Red drives just ain’t good enough. I’ve had two drives at my supplier and they agreed to credit me the cost of the two RED drives and I then took 2 x WD 4TB SAS drives.
For starters, I’ve only realised after installing the SAS drives that this Intel Raid controller is not suitable for SAS drives, so I took another RAID Controller that I had close by (RocketRaid 2722) which is a really cheap controller but it has been running in another server for close to a year without any issues.
Any event, the one SAS drive seemed to be completely dead as it doesn't even register (still need to test on another SAS system). The second SAS drive was accepted and I’ve rebuild the array (again RAID 6). Truth be told, with a new RAID controller, a new PSU, new SAS cables, 3 of the 4 drives popped within 24 hours after building the array.
I figured the only thing that have not been chaged as yet is the backplane of the Norco chassis, so I’ve then removed the norco chassis and built it into one of my old no name server chassis. Connected all 4 drives straight onto the RAID controller (i.e. no backplanes between the drives and the RAID controller). Guess what, it still failed after 24 hours.
I am really, really running out of options here and will appreciate any suggestions, in summary, here is what I’ve done:
Replaced the SAS cables
Replaced the Raid Controller (twice) Removed the Norco Chassis (i.e. no backplanes to worry about) Replaced the PSU
Installed the whole setup onto a high end UPS.
The only thing that might be slightly out of the norm is that the last failure happened exactly the same time that our backups kicked off. Most (not all) the drives popping was also over weekends, but I've never heard that a backup program can be the cause of drives popping?
I have not replaced the CPU, RAM, GPU, NIC, or motherboard, as none of these in my mind is directly connected to the drives but if you recon that this may be a problem, then please let me know.
Please, any suggestions will really be appreciated.