Jump to content


Photo

Velociraptor premature failure rate (bad drives, premature to market?)


  • You cannot start a new topic
  • Please log in to reply
116 replies to this topic

#51 bamboe32

bamboe32

    Member

  • Member
  • 2 posts

Posted 19 June 2009 - 08:24 AM

Here a copy from http://gpi-storage.b...in-western.html


Ticking time bomb error in Western Digital’s 2.5” 10K RPM SATA VelociRaptor Product
In our testing of Western Digital’s latest 10K RPM SATA product, the 2.5 inch VelociRaptor, we have discovered a very serious issue that everyone needs to be aware of. We found that after running almost 50 days continuously the drive will throw Time Limited Error Recovery (TLER) errors. These errors will cause a RAID volume to fail and possibility causing a loss of data. We have been able to confirm this issue with Western Digital’s Support. This issue occurs when TLER is enabled (normally enabled on all Western Digital RAID Edition Products and also is enabled on the VelociRaptor [7 seconds for Read and Write commands]). See below for more information about TLER.

Details of the failure

When the continuous power on hours hits 49.7 days, an internal firmware time keeper, in the drives’ firmware wraps. When this time keeper wraps, any active Read, Write, and Flush commands will prematurely TLER timeout. Because this time keeper wraps on all of the VelociRaptor’s, installed in a system together, at the same time (all powered on at the same time), any RAID volumes on these drives will fail. In our testing, our system RAID volume failed and we were not able to recovery because all of the VelociRaptor failed together. Data is not lost but the RAID controller will think that all of the drives failed because of the incorrect TLER timeout. All RAID configurations, including RAID 5 and RAID 6, will fail because multiple drives will fail. If the system, with VelociRaptor drives, stays powered on, it will fail again every 49.7 days.

We have also been told by Western Digital that a short term work around is to power cycle any systems with the VelociRaptor drives every 30-45 days to avoid the internal firmware time keeper from wrapping at 50 days. A simple reset or restart doesn’t work. The system must be completely power off to reset the internal firmware time keeper. Western Digital is also currently working on a fix and should have something soon to resolve the issue. We have been reassured by Western Digital that this issue only exists on the VelociRaptor and not on their other products, including their RAID Edition product.

Background information about TLER

TLER or Time Limited Error Recovery was created to help SATA and IDE drives work with RAID controllers when drive errors occur. Normally a SATA and IDE drives will try extensive error recovery procedures to try to recovery data when there is an error. Sometimes these procedures will take multiple seconds per sector of data (512 bytes). If the errors affect an area larger than a single sector, these recovery procedures many take 10, 20, 30 seconds or longer. Most RAID controllers will not give a drive that much time to recovery from an error. The RAID controller will typically drop or error a drive and cause the RAID volume to go into recovery using Parity information from another drive (RAID5 or RAID6) configurations or use mirror drive information (RAID1) to recover. Windows in a non-RAID configuration will also error if the drive takes too much time trying to recover. This could cause a Blue Screen of Death or other issues. TLER helps manage error recovery times to allow the host or RAID controller to find alternate ways of dealing with a drive errors. TLER has helped SATA and IDE drives compete with SCSI and other enterprise class drives in enterprise applications. Western Digital recommends using TLER for any RAID configuration and it can also be used on normal desktop applications also. A utility is also available from Western Digital to enable/disable or change the TLER settings.

#52 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 19 June 2009 - 08:26 AM

My first drive crashed after being used for 52 days, linux mounted it as RO after the crash.
After fsck repaired many files stil many files were corrupted, so a fresh install was needed.
The smartctl long and short test did not showed any errors.
This drive was being replaced by WD without any problems. (good service I thougt :D )

I have no NCQ enabled in my BIOS.


Error 61 occurred at disk power-on lifetime: 1262 hours (52 days + 14 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 fa 3a 98 ea Error: UNC 8 sectors at LBA = 0x0a983afa = 177748730

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 fa 3a 98 0a 08 49d+17:02:47.225 READ DMA
27 00 00 00 00 00 00 08 49d+17:02:47.212 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 08 49d+17:02:47.202 IDENTIFY DEVICE
ef 03 46 00 00 00 00 08 49d+17:02:47.167 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 08 49d+17:02:47.163 READ NATIVE MAX ADDRESS EXT


The second drive crashed again after 50 days.
Also here the smartctl and WD own test did not showed any errors.
fsck repaired many files, and the system was still able to run.
This time WD would not replace the disk as their Data Lifeguard Diagnostics does not showed any errors.
I asked for a firmware update, but was being told that there is none for a WD1500HLFS
(normal responce time from WD, but no new firmware, not so good service from WD :( )
decided to restore to a different system and kept this for testing.

Error 96 occurred at disk power-on lifetime: 1215 hours (50 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 0d 1e cf ef Error: UNC 8 sectors at LBA = 0x0fcf1e0d = 265231885

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 0d 1e cf 0f 08 49d+17:02:47.284 READ DMA
27 00 00 00 00 00 00 08 49d+17:02:47.284 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 08 49d+17:02:47.275 IDENTIFY DEVICE
ef 03 46 00 00 00 00 08 49d+17:02:47.268 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 08 49d+17:02:47.268 READ NATIVE MAX ADDRESS EXT



This same drive crashed again 57 days later (had to power cycle the system to connect a floppy drive for the WD test)
Did put some load on it to simulate normal server load.
Error 111 occurred at disk power-on lifetime: 2583 hours (107 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 08 95 f9 89 e3 Error: IDNF at LBA = 0x0389f995 = 59373973

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 08 95 f9 89 03 08 49d+17:02:44.190 WRITE DMA
ca 00 08 1d f8 89 03 08 49d+17:02:44.190 WRITE DMA
ca 00 30 25 4c 87 03 08 49d+17:02:44.190 WRITE DMA
27 00 00 00 00 00 00 08 49d+17:02:44.190 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 08 49d+17:02:44.182 IDENTIFY DEVICE



This happened 8 days ago, directly entered a ticket at WD, asked for some responce 3 days ago, but still no answer.
(Very bad service from WD :angry: )

As you can see all my drives crashed after being powered on for 49 Days 17 hours 2 minutes and a few seconds.
Searching on the internet show many more people with this 49.710 days firmware problem.
And WD still does not know about this ????

http://gpi-storage.b...in-western.html had a good story about his problem, but the page has been removed. It was still available in Google cache, but now even this has been removed. Luckaly I have made a copy of it and will post it in the next message.

Rob.



I await the post!!

#53 geshel

geshel

    Member

  • Member
  • 210 posts

Posted 19 June 2009 - 04:21 PM

Your link to the blog w/ that information doesn't work. (oops, just read your comment about that).

Edited by geshel, 19 June 2009 - 04:22 PM.

#54 Michal Soltys

Michal Soltys

    Member

  • Member
  • 29 posts

Posted 23 June 2009 - 07:07 AM

Nice. Enterprise drives my ass..., this should be posted on the frontpage of SR. Although at least it doesn't turn your disk into a brick (like with recent seagate firmware problems).

Justin - you've had problems with that (particular thing) since middle of last year iirc ? (recalling lengthy threads from related linux mailing lists).

#55 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 23 June 2009 - 08:16 AM

Nice. Enterprise drives my ass..., this should be posted on the frontpage of SR. Although at least it doesn't turn your disk into a brick (like with recent seagate firmware problems).

Justin - you've had problems with that (particular thing) since middle of last year iirc ? (recalling lengthy threads from related linux mailing lists).


Yes-- even to the point of buying a 3ware controller for the drives and they locked up on that too, and that was after I pretty much completely replaced all of the components in the machine. Definitely a drive/firmware issue.

#56 continuum

continuum

    Mod

  • Mod
  • 3,597 posts

Posted 23 June 2009 - 12:14 PM

We have also been told by Western Digital that a short term work around is to power cycle any systems with the VelociRaptor drives every 30-45 days to avoid the internal firmware time keeper from wrapping at 50 days

So Seagate's not the only one screwing up with internal counters in their drives. Lovely.

(at least Seagate can keep their healthy disks from dropping from arrays!)

#57 Michal Soltys

Michal Soltys

    Member

  • Member
  • 29 posts

Posted 25 June 2009 - 12:57 PM

(at least Seagate can keep their healthy disks from dropping from arrays!)


Speaking about which - weren't there two (or three depending on perspective) separate firmware issues recently with Seagate ?

http://techreport.co...ussions.x/15954 (timeouts)
http://techreport.co...ussions.x/16232 (bricking)
http://stx.lithium.c...b...ding&page=1 (bricking after fw update to correct bricking :P)

after quick googling

#58 Cirrus Telecom

Cirrus Telecom

    Member

  • Member
  • 3 posts

Posted 10 September 2009 - 03:32 PM

Has anyone else been able to resolve this issue? We have 80 of these drives and have been going round and round with the Blade manufacturer and chipset manufacturer.

Configuration is 40 Supermicro Blades w/ 2 WD3000BLFS each in RAID1. Blades use the LSI 1068E embedded chipset.

About every 49 days and 16 hours we'll have about half of our blades go into a degraded status or blue screen. The RAID manager will just say that the drive "failed" with no other information. We have to pop out one of the drives and reseat and the RAID will rebuild. So far, no data loss, anywhere.

I am on the phone with WD now in Level 2 tech support. I tried resolving this with them in e-mail. Their e-mail tech support is extra extra bad. It's extremely bad.

I will update when I get more information.

#59 Cirrus Telecom

Cirrus Telecom

    Member

  • Member
  • 3 posts

Posted 10 September 2009 - 03:45 PM

WD says they are sending me a utility that sets the TLER to unlimited.

#60 Cirrus Telecom

Cirrus Telecom

    Member

  • Member
  • 3 posts

Posted 10 September 2009 - 06:43 PM

The "utility" is actually new firmware.

Current part numbers:

WD3000BLFS-01YBU0
WD1500BLFS-01YBU0
WD740BLFS-01YBU0
WD3000HLFS-01G6U0
WD1500HLFS-01G6U0
WD740HLFS-01G6U0

The new firmware version number is 04.04v02.

New part numbers (with this firmware version):
WD3000BLFS-01YBU1
WD1500BLFS-01YBU1
WD740BLFS-01YBU1
WD3000HLFS-01G6U1
WD1500HLFS-01G6U1
WD740HLFS-01G6U1


Dear Valued WD Customer:

As a result of product evaluation consistent with Western Digital's quality
systems and our commitment to provide the highest quality products, an
update is being made to the WD VelociRaptor product family.

Description of Change:

Performance enhancement: Sequential read and write
Compatibility enhancements: Double status FIS, TLER timer and counter synchronization

These changes do not affect the form or fit of the drive but do positively
affect the function of the drive.

Details of Firmware Changes:

Performance improvements to sequential read and write:
Firmware release includes a function to allow the drive to stay in the
sequential mode if there was no other activity. This particularly benefits
certain SATA RAID environments.

Double status FIS:

The drive sent two status FIS (C001h and 5001h) after COMRESET.
Although the drive behavior did not violate the SATA specification it still
created issues for some SATA host bus adapters. This firmware will only
post 5001h when the drive is ready.

TLER:

Corrected an issue where after 49 days of continuous operation the drive
could falsely report an error to the Host. This issue does not result in data
loss and only occurs once every 49 days of continuous operation and only
if a read/write operation is in progress when the counter rolls to zero. The
new firmware resolves this issue but as an alternate workaround, the drive
can be power cycled before 49 days elapses to avoid the potential error.

Field Update Utility/Binary available July 24, 2009.
Manufacturing Implementation Date: August 21, 2009

#61 6_6_6

6_6_6

    Member

  • Member
  • 590 posts

Posted 12 September 2009 - 12:59 PM

Hmm... Interesting situation.

What would you choose if you had only 2 options?

1. Get Seagate and pray the servers do not crash.

2. Get WD and power cycle the servers every 49 days.


I have some SCSI servers with 2000 days of uptime on them. They are on a remote datacenter. Hmm... This even makes it more complicated for me to choose something if i had to.

#62 6_6_6

6_6_6

    Member

  • Member
  • 590 posts

Posted 12 September 2009 - 01:11 PM

Wow, Justin...

I can't believe you were bitten by this insidious/heinous/sneaky bug almost a year ago!

#63 continuum

continuum

    Mod

  • Mod
  • 3,597 posts

Posted 15 September 2009 - 03:07 PM

Wow, Justin...

I can't believe you were bitten by this insidious/heinous/sneaky bug almost a year ago!

Only took WD a year to own up to it? Is that better or worse than Seagate?? :P

Man that sucks though. WD's still... yeah, no, I think I'm covered under NDA. Sigh.

#64 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 19 September 2009 - 07:54 AM

Hi,

Can I get a copy of the firmware update?

I have 12 of these drives, it would be nice to use them.

Justin.

#65 moheban79

moheban79

    Member

  • Member
  • 3 posts

Posted 20 September 2009 - 05:06 PM

I have also had numerous problems with these Velociraptors wd3000GLFS drives. I use to get these raid access errors on my 680i system that were only fixed temporarily by turning on TLER on both drives and running that utility. After 6 months the problem came back and I assumed it was the 680i fault. So I plugged them into my new x58 EVGA board and now only one of the drives successfully posts at start up. The other one only sometimes will register at post and most of the time does not despite swapping power cable and port cables. I must have a bad drive? Or is this a firmware problem as was mentioned before?

In any case I am so pissed at WD I don't think I could ever support their company again...

#66 moheban79

moheban79

    Member

  • Member
  • 3 posts

Posted 20 September 2009 - 05:10 PM

Double post edited... please ignore

Edited by moheban79, 20 September 2009 - 05:11 PM.

#67 vesastef

vesastef

    Member

  • Member
  • 3 posts

Posted 21 September 2009 - 07:50 AM

Hi all,

It would be a wonderful idea to have that firmware!

I have 3 Velociraptor WD3000GLFS drives and it is impossible to use them reliably in a RAID setup with a 3ware card (NO warning on the WD website), so as last shot I would try to reflash their firmware...
For now I've only wasted a lot of money.

Stefano

#68 moheban79

moheban79

    Member

  • Member
  • 3 posts

Posted 21 September 2009 - 08:19 AM

Hi all,

It would be a wonderful idea to have that firmware!

I have 3 Velociraptor WD3000GLFS drives and it is impossible to use them reliably in a RAID setup with a 3ware card (NO warning on the WD website), so as last shot I would try to reflash their firmware...
For now I've only wasted a lot of money.

Stefano


I found this link in another forum from HardOcp claiming this is that firmware. Here is the link: Velociraptor Firmware. Its too late for one of my drives - may it rest in peace or pieces...

#69 6_6_6

6_6_6

    Member

  • Member
  • 590 posts

Posted 21 September 2009 - 08:34 AM

Why are you bothering with firmware and all? Did you people take your drives back to WD? Dead or alive. They did't replace them? They are covered by warranty and these are clearly faulty drives.

WTF does that even mean "Reboot your system every 49 days?" I was laughing at Seagate but this is even more ridiculous.

#70 6_6_6

6_6_6

    Member

  • Member
  • 590 posts

Posted 21 September 2009 - 08:41 AM

I had 2 WDs brought to me recently. Both were locked-up. Different people. Guy said he booted, and hard disk was not recognizable. Upon investigation, we saw both drives had their master passwords set. We managed to recover one but the other one just would not do secure erase and password reset. I remember a bug like this in the past too with WD firmwares. I wonder if this one is a common occurance or an isolated bug. These people definitely did not have their passwords set themselves. It happenned by itself just like Seagates not booting some day.

#71 vesastef

vesastef

    Member

  • Member
  • 3 posts

Posted 21 September 2009 - 08:51 AM

Why are you bothering with firmware and all? Did you people take your drives back to WD? Dead or alive. They did't replace them? They are covered by warranty and these are clearly faulty drives.

WTF does that even mean "Reboot your system every 49 days?" I was laughing at Seagate but this is even more ridiculous.



I've already contacted WD to try to swap my GLFS drives with HLFS ones, because when I purchased them, only GLFS was in production.
They rejected my request because my disks are not "made for RAID" suggesting me to buy RE3s and bla bla bla...
Now i read here that also "unofficial RAID edition HFLS" (I could place here an emoticon...) has it own weird troubles... So the firmware flashing could make sense...

Stefano

#72 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 21 September 2009 - 08:58 AM

Why are you bothering with firmware and all? Did you people take your drives back to WD? Dead or alive. They did't replace them? They are covered by warranty and these are clearly faulty drives.

WTF does that even mean "Reboot your system every 49 days?" I was laughing at Seagate but this is even more ridiculous.



I've already contacted WD to try to swap my GLFS drives with HLFS ones, because when I purchased them, only GLFS was in production.
They rejected my request because my disks are not "made for RAID" suggesting me to buy RE3s and bla bla bla...
Now i read here that also "unofficial RAID edition HFLS" (I could place here an emoticon...) has it own weird troubles... So the firmware flashing could make sense...

Stefano


I had problems with HLFS and BLFS drives, I have not gotten time to flash them yet though.

#73 6_6_6

6_6_6

    Member

  • Member
  • 590 posts

Posted 21 September 2009 - 10:12 PM

What is HLFS, GLFS and BLFS?

#74 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 23 September 2009 - 08:16 AM

What is HLFS, GLFS and BLFS?


http://www.hardforum...d.php?t=1345966

3 models:
BLFS - no heatsink, SATA connector located at the back of the 2.5" HDD itself
HLFS - heatsink, SATA connector relocated to the correct location for a 3.5" HDD
GLFS - heatsink, SATA connector located at the back of the 2.5" HDD itself

Correction to my earlier post, I have HLFS (2) & GLFS (10) drives, not BLFS.

#75 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 23 September 2009 - 03:58 PM

What is HLFS, GLFS and BLFS?


http://www.hardforum...d.php?t=1345966

3 models:
BLFS - no heatsink, SATA connector located at the back of the 2.5" HDD itself
HLFS - heatsink, SATA connector relocated to the correct location for a 3.5" HDD
GLFS - heatsink, SATA connector located at the back of the 2.5" HDD itself

Correction to my earlier post, I have HLFS (2) & GLFS (10) drives, not BLFS.


Bad news...



0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users