jpiszcz

Velociraptor premature failure rate (bad drives, premature to market?)

Recommended Posts

I don't want to get too excited yet but after disabling NCQ I was able to write to the RAID10 - over the entire array without it crashing!

I will let it run a few more times before making any further comments though.

writing to raid10
dd: writing `file2': No space left on device
1430328+0 records in
1430327+0 records out
1499806973952 bytes (1.5 TB) copied, 3914.51 s, 383 MB/s

Just as with Linux-- when using NCQ on the drives in RAID (on the 3ware card, it is broken) just as it is when you do the same thing in Linux.

NCQ+Velociraptor => Bad in raid configuration, in non-raid it may be OK (have not tested).

Share this post


Link to post
Share on other sites

I spoke to soon, turning off NCQ helped dramatically, it worked three times!

writing to raid10

dd: writing `file2': No space left on device

1430328+0 records in

1430327+0 records out

1499806973952 bytes (1.5 TB) copied, 3914.51 s, 383 MB/s

Fri Dec 5 06:00:25 EST 2008

writing to raid10

dd: writing `file2': No space left on device

1430328+0 records in

1430327+0 records out

1499806973952 bytes (1.5 TB) copied, 4063.25 s, 369 MB/s

Fri Dec 5 07:08:11 EST 2008

writing to raid10

dd: writing `file2': No space left on device

1430328+0 records in

1430327+0 records out

1499806973952 bytes (1.5 TB) copied, 3926.71 s, 382 MB/s

Fri Dec 5 08:13:41 EST 2008

Then it crashed again, with NCQ enabled, it would not even complete one test,

So basically a new system, new PSU, new cables, its on a new APC UPS and the

problem persists even when all disks are on a RAID card, SW raid, it does not

matter, Velociraptors have problems, I think its time for me to get regular

1TiB disks and be done with it.

Justin.

Share this post


Link to post
Share on other sites

I don't understand your persistence for running the same OS and hardware on these drives. How hard is it to boot manufacturer's diagnostic utility or MHDD in DOS and do a proper check in another system (or same system if you don't have one)?

And, oh, yes, I just realized that pibibit kibibit tibibit crap is coming with X and GNOME. I haven't used any GUI on linux for so long, i have forgotten they come with it default. Hence the widespread use of crap on the net.

And yeah, file sizes column shows also for files: 5 bytes, 59.2 KB, 1.7 MB. Go figure if you can which one is bigger with a cursory look. I can't believe 15 years passed since we were trying to make GUI boot in Redhat instead of ncurses! Looks like nothing changed. Took me 3 days to make Add/ Remove software work properly to figure out what is installed on my system. Apparently internet connection is needed and it is a royal PITA to make local repository work. Something as simple as typing: rpm -qa|sort|xargs rpm -qil... took me countless hours of readings on GUI to implement. Yes, i am a text guy, GUI is not for me. GNOME looks like Windows 1.1 days anyway (KDE looks cool though). Nevermind my ranting, I just got frustrated with this Fedora 10 crap. :)

Share this post


Link to post
Share on other sites

The drives may work fine in Windows but that is not the OS I need them to work in, it is widely known that the raptors+NCQ are broken in Linux and that is the OS I need to use the disks in.

In any event, I am back on my good old raptor150s for now and the 300s have been put into another system, I will be performing the same testing I did earlier on both systems. I want to see if I can reproduce any of the problems that I had on the 300s with the 150s.

Share this post


Link to post
Share on other sites
The drives may work fine in Windows but that is not the OS I need them to work in. it is widely known that the raptors+NCQ are broken in Linux and that is the OS I need to use the disks in.

I am sorry, you complained about failing drives. Topic title reads: "Velociraptor premature failure rate (bad drives, premature to market?), I have RMA'd several times so far across 12 disks."

So it is not the drives' fault and you replaced disks for nothing. It is not going to make any difference if you try 12 more disks or change 24 more controllers. Better try something else, they are not irreplaceable. And nobody is asking you to change your OS, we were trying to figure out if it is the drives' fault or something elses (proper troubleshooting).

Edited by 6_6_6

Share this post


Link to post
Share on other sites
The drives may work fine in Windows but that is not the OS I need them to work in. it is widely known that the raptors+NCQ are broken in Linux and that is the OS I need to use the disks in.

I am sorry, you complained about failing drives. Topic title reads: "Velociraptor premature failure rate (bad drives, premature to market?), I have RMA'd several times so far across 12 disks."

So it is not the drives' fault and you replaced disks for nothing. It is not going to make any difference if you try 12 more disks or change 24 more controllers. Better try something else, they are not irreplaceable. And nobody is asking you to change your OS, we were trying to figure out if it is the drives' fault or something elses (proper troubleshooting).

Since its not the disk but Linux(i guess) where do you recommend the best place to sell them? ebay? As they all pass smart tests short/long either that or maybe I can use them in a windows box.

Share this post


Link to post
Share on other sites
Since its not the disk but Linux(i guess) where do you recommend the best place to sell them? ebay? As they all pass smart tests short/long either that or maybe I can use them in a windows box.

If i were you i would be trying a different and more recent linux distro and stick to time-tested fs like ext3 and see if i can resolve the issue. You are trying to find hte lowest common denominator. Once you find that and verify everything works to your liking, you can experiment with different hardware/software. FreeBSD / OpenBSD would be good options to give a spin for testing purposes as well.

Share this post


Link to post
Share on other sites
Were they all packed properly? Newegg, Mwave, Zipzoomfly, Allstarshop, just about every major retailer out there is notorious for improperly handling and packing OEM packaged products. I would strongly suspect that the way most retailers out there pack things is causing a significant jump in failure rates.

Hint: most retails do a single layer, maybe two layers, three if you're lucky, of the large-bubble bubble-wrap around a drive. The drive as packed then goes into the bottom of the box and topped off with foam peanuts. This is entirely INadequate protection as this results in less than 2" of bubble wrap in each dimension. Plus OEM packed products are not actually wrapped in bubble wrap til they hit the packaging stage, which means they go through the entire warehouse in just the ESD bag/clamshell... eek.

Does it really matter when the heads are parked?

Anyway the RMAs from WD have zero bubble wrap. Just the egg carton typ off things and a tiny cardboard box.

I don't have a VR300 but none other Raptors I have made it through their 5 years of warranty...

Share this post


Link to post
Share on other sites

Well after 7 weeks one of the velociraptor 150 completely failed :

Code 0225 - too many errors found

Firmware 4.01V01

So these may be lemon pie after all. The other drive is working though.

Share this post


Link to post
Share on other sites
Well after 7 weeks one of the velociraptor 150 completely failed :

Code 0225 - too many errors found

Firmware 4.01V01

So these may be lemon pie after all. The other drive is working though.

I gave up and bought RE3's, should be putting a new raid together by the end of this week.

Share this post


Link to post
Share on other sites
Does it really matter when the heads are parked?
Where did I say anything about heads?
Anyway the RMAs from WD have zero bubble wrap. Just the egg carton typ off things and a tiny cardboard box.
Yes, but the egg cartons-- which are better than bubble wrap-- are at least 2" thick on each of the 6 faces of the drive, right? That's key-- 2" of non-displacable padding on each side.

Typical retailers utterly fail to do this properly on most OEM packed drives...

Share this post


Link to post
Share on other sites

It looks like I am not alone, there is another Linux user having the exact same problem and they have replaced a few(?) drives as well, but for him, he has been lucky, none of the drives have dropped out of his array (although he is using RAID-10).

From bruno@XXXXXXXX Wed Jan 14 08:33:51 2009

Date: Wed, 14 Jan 2009 14:31:54 +0100

From: Bruno Friedmann <bruno@XXXXXXXX >

To: Justin Piszcz <jpiszcz@XXXXXXXX>

Subject: Re: [smartmontools-database] WDC WD3000GLFS-01F8U0 ( Velociraptor )

Justin Piszcz wrote:

> Here are the problems I had with mine:

> http://forums.storagereview.net/index.php?...hl=velociraptor

>

> Many bad sectors as well-- yes-- also did you ever see those errors I

> see in the logs? Timeouts? etc?

sd 2:0:0:0: [sdb] 586072368 512-byte hardware sectors (300069 MB)

sd 2:0:0:0: [sdb] Write Protect is off

sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00

sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen

ata3.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in

res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)

ata3.00: status: { DRDY }

ata3: soft resetting link

ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

seems to be the same as yours ... and that's not a good news for us ...

Did you have same firmware ... ?

The only big diff is that we never been beat by a mdadm event failure

mdadm -D /dev/md0

/dev/md0:

Version : 00.90.03

Creation Time : Wed Oct 8 17:49:43 2008

Raid Level : raid10

Array Size : 879095808 (838.37 GiB 900.19 GB)

Used Dev Size : 293031936 (279.46 GiB 300.06 GB)

Raid Devices : 6

Total Devices : 6

Preferred Minor : 0

Persistence : Superblock is persistent

Update Time : Wed Jan 14 14:30:25 2009

State : clean

Active Devices : 6

Working Devices : 6

Failed Devices : 0

Spare Devices : 0

Layout : near=1, offset=2

Chunk Size : 1024K

UUID : c9299f99:5c59456d:06ceadf1:d66ecc3e

Events : 0.4222

Number Major Minor RaidDevice State

0 8 17 0 active sync /dev/sdb1

1 8 33 1 active sync /dev/sdc1

2 8 49 2 active sync /dev/sdd1

3 8 65 3 active sync /dev/sde1

4 8 81 4 active sync /dev/sdf1

5 8 97 5 active sync /dev/sdg1

But we have the heavy load and the syslog timeout error et restart of sata bus...

We just send someone getting 3 new drives to have spare here and change the two defect drive this night.

Really hope the new one have new firmware ...

Thanks a maximum to the link ... It give us a real apperture about trouble we will (or not) have to encounters in the near future.

I will inform you about the new drives.

What is "funny" we have another array of 6 raptor but with the old 150 and this one is simply working nicely.

^^ What is funny is I also have 12 150s and ran in all sorts of raid configuration too! Never a problem, Velociraptors are no good! :(

Share this post


Link to post
Share on other sites

Well, RAID1, RAID0, RAID10 are much less taxing on disks. I had that problem with my Areca ARC-1680, it would drop one of ES.2's out immediately if I ran it on RAID5 or RAID6 on the original 1.41 (or was it 1.42?) firmware... they released 1.43 or 1.44 before it was really stable. Then Seagate rev'ed the firmware on the drives from SN03 to SN04 to SN05 to AN05 and finally now things look to be rock solid.

I assume you've kept current with firmware and driver updates-- I see you have for the 3ware but what about for the drives too?

Share this post


Link to post
Share on other sites
Well, RAID1, RAID0, RAID10 are much less taxing on disks. I had that problem with my Areca ARC-1680, it would drop one of ES.2's out immediately if I ran it on RAID5 or RAID6 on the original 1.41 (or was it 1.42?) firmware... they released 1.43 or 1.44 before it was really stable. Then Seagate rev'ed the firmware on the drives from SN03 to SN04 to SN05 to AN05 and finally now things look to be rock solid.

I assume you've kept current with firmware and driver updates-- I see you have for the 3ware but what about for the drives too?

I had opened up many cases with WD and there answer every time was to RMA and not provide me a firmware upgrade, maybe they only do that with their enterprise class disks?

Share this post


Link to post
Share on other sites

Gentlemen,

What is your current status on these drives? I'm working with a customer that has five Velociraptors configured in RAID 6. Drive failures aplently.... sometimes two in the same day (thank God for RAID 6). We replaced / RMAed drives for a while but have gotten to the point where I think the RAID controller (Intel SRCSASJV) is crying wolf. At this point, when a drive fails, we just rebuild it. I'm going to be on-site over the weekend to pull drives one at a time and test them with the Lifeguard Diagnostic in a 2nd system, so we'll see what comes of that.... but I've been down this road before with these drives.

This RAID controller is a SAS RAID controller which is talking SCSI to the drives. So, the error I'm getting is sense code that translates to Medium Error - Record Not Found, which seems to translate to "IDNF" in the SATA world.

As mentioned, I'll know for sure over the weekend if this is *actually* a physical drive problem, but I really didn't think so until I came across this forum posting. At this point we've replaced the RAID controller and cabling, so I feel as though those things can be ruled out. The only other possibility could be the drive enclosure, but I think that's kind of a stretch. Power is fine... server has dual PSUs connnected to a UPS. No power fluctuations or other issues are logged in the UPS.

Also, I've kept the RAID controller firmware up to date.

Thanks,

Frank

Share this post


Link to post
Share on other sites
Gentlemen,

What is your current status on these drives? I'm working with a customer that has five Velociraptors configured in RAID 6. Drive failures aplently.... sometimes two in the same day (thank God for RAID 6). We replaced / RMAed drives for a while but have gotten to the point where I think the RAID controller (Intel SRCSASJV) is crying wolf. At this point, when a drive fails, we just rebuild it. I'm going to be on-site over the weekend to pull drives one at a time and test them with the Lifeguard Diagnostic in a 2nd system, so we'll see what comes of that.... but I've been down this road before with these drives.

This RAID controller is a SAS RAID controller which is talking SCSI to the drives. So, the error I'm getting is sense code that translates to Medium Error - Record Not Found, which seems to translate to "IDNF" in the SATA world.

As mentioned, I'll know for sure over the weekend if this is *actually* a physical drive problem, but I really didn't think so until I came across this forum posting. At this point we've replaced the RAID controller and cabling, so I feel as though those things can be ruled out. The only other possibility could be the drive enclosure, but I think that's kind of a stretch. Power is fine... server has dual PSUs connnected to a UPS. No power fluctuations or other issues are logged in the UPS.

Also, I've kept the RAID controller firmware up to date.

Thanks,

Frank

They are sitting in a box, I no longer use them. Apparently they work well in Windows from what I hear. What OS are you running?

Share this post


Link to post
Share on other sites

My client is running Windows Server 2008 x64. But bear in mind that Windows is "talking" to the RAID controller, which is abstracting the hard drives as a single disk entity. Not sure if that matters... maybe the RAID controller is running a linux kernel of some sort.

Share this post


Link to post
Share on other sites
My client is running Windows Server 2008 x64. But bear in mind that Windows is "talking" to the RAID controller, which is abstracting the hard drives as a single disk entity. Not sure if that matters... maybe the RAID controller is running a linux kernel of some sort.

That could be a possibility. After 8-10 RMAs, I got sick of it and gave up, there is no solution, WD does not treat the velicoraptor as an enterprise drive, if you have problems with it/them, they never offered me a firmware update, their solution is to RMA the drive, after spending several weeks (or months) doing this, I went with RE3 drives and have not had a problem since.

Share this post


Link to post
Share on other sites

Just one more thought... which variety of the Velociraptor are you running?

http://www.westerndigital.com/en/products/...asp?DriveID=495

I have the 3.5" backplane-ready, 300 GB model (WD3000HLFS). Just wondering if there might be some commonality there.

One of the reasons we went with these drives is that they're listed in WD's "Enterprise" section of their products page, so it's unfortunate that they're not being more proactive in getting this issue straightened out.

Share this post


Link to post
Share on other sites
Just one more thought... which variety of the Velociraptor are you running?

http://www.westerndigital.com/en/products/...asp?DriveID=495

I have the 3.5" backplane-ready, 300 GB model (WD3000HLFS). Just wondering if there might be some commonality there.

One of the reasons we went with these drives is that they're listed in WD's "Enterprise" section of their products page, so it's unfortunate that they're not being more proactive in getting this issue straightened out.

I think there are 3 kinds but I am went through (RMA'd) 2-3 HLFS (enterprise) and the regular desktop ones, they were both bad, for me anyway.

Share this post


Link to post
Share on other sites
I am on my 3rd Velociraptor now. I've had my 300GB Velociraptor die twice :(. Is this bad luck or is this drive known to have issues?

I had numerous ones die, I stopped using them in favor of RE3s, no far, no failures, judge for yourself, try a different model (non-velociraptor) see if your problems go away. Raptor 150s work well if you are looking for 10k SATA drives.

Share this post


Link to post
Share on other sites
I am on my 3rd Velociraptor now. I've had my 300GB Velociraptor die twice :(. Is this bad luck or is this drive known to have issues?

I had numerous ones die, I stopped using them in favor of RE3s, no far, no failures, judge for yourself, try a different model (non-velociraptor) see if your problems go away. Raptor 150s work well if you are looking for 10k SATA drives.

I consolidated from two 150 raptors to one 300GB velociraptor. I hate to have more hard drives clutter my hard drive bay. Might go the RE3 route, or western digital's 640GB drive. Though I hate to give them business. Feel a little irked.

Share this post


Link to post
Share on other sites

My first drive crashed after being used for 52 days, linux mounted it as RO after the crash.

After fsck repaired many files stil many files were corrupted, so a fresh install was needed.

The smartctl long and short test did not showed any errors.

This drive was being replaced by WD without any problems. (good service I thougt :D )

I have no NCQ enabled in my BIOS.

Error 61 occurred at disk power-on lifetime: 1262 hours (52 days + 14 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 08 fa 3a 98 ea Error: UNC 8 sectors at LBA = 0x0a983afa = 177748730

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 08 fa 3a 98 0a 08 49d+17:02:47.225 READ DMA

27 00 00 00 00 00 00 08 49d+17:02:47.212 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 00 08 49d+17:02:47.202 IDENTIFY DEVICE

ef 03 46 00 00 00 00 08 49d+17:02:47.167 SET FEATURES [set transfer mode]

27 00 00 00 00 00 00 08 49d+17:02:47.163 READ NATIVE MAX ADDRESS EXT

The second drive crashed again after 50 days.

Also here the smartctl and WD own test did not showed any errors.

fsck repaired many files, and the system was still able to run.

This time WD would not replace the disk as their Data Lifeguard Diagnostics does not showed any errors.

I asked for a firmware update, but was being told that there is none for a WD1500HLFS

(normal responce time from WD, but no new firmware, not so good service from WD :( )

decided to restore to a different system and kept this for testing.

Error 96 occurred at disk power-on lifetime: 1215 hours (50 days + 15 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 08 0d 1e cf ef Error: UNC 8 sectors at LBA = 0x0fcf1e0d = 265231885

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 08 0d 1e cf 0f 08 49d+17:02:47.284 READ DMA

27 00 00 00 00 00 00 08 49d+17:02:47.284 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 00 08 49d+17:02:47.275 IDENTIFY DEVICE

ef 03 46 00 00 00 00 08 49d+17:02:47.268 SET FEATURES [set transfer mode]

27 00 00 00 00 00 00 08 49d+17:02:47.268 READ NATIVE MAX ADDRESS EXT

This same drive crashed again 57 days later (had to power cycle the system to connect a floppy drive for the WD test)

Did put some load on it to simulate normal server load.

Error 111 occurred at disk power-on lifetime: 2583 hours (107 days + 15 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

10 51 08 95 f9 89 e3 Error: IDNF at LBA = 0x0389f995 = 59373973

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

ca 00 08 95 f9 89 03 08 49d+17:02:44.190 WRITE DMA

ca 00 08 1d f8 89 03 08 49d+17:02:44.190 WRITE DMA

ca 00 30 25 4c 87 03 08 49d+17:02:44.190 WRITE DMA

27 00 00 00 00 00 00 08 49d+17:02:44.190 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 00 08 49d+17:02:44.182 IDENTIFY DEVICE

This happened 8 days ago, directly entered a ticket at WD, asked for some responce 3 days ago, but still no answer.

(Very bad service from WD :angry: )

As you can see all my drives crashed after being powered on for 49 Days 17 hours 2 minutes and a few seconds.

Searching on the internet show many more people with this 49.710 days firmware problem.

And WD still does not know about this ????

http://gpi-storage.blogspot.com/2009/01/ti...in-western.html had a good story about his problem, but the page has been removed. It was still available in Google cache, but now even this has been removed. Luckaly I have made a copy of it and will post it in the next message.

Rob.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now