Storage Forums: Velociraptor premature failure rate (bad drives, premature to market?) - Storage Forums

Jump to content

Advertisement

  • 12 Pages +
  • « First
  • 4
  • 5
  • 6
  • 7
  • 8
  • Last »
  • You cannot start a new topic
  • You cannot reply to this topic

Velociraptor premature failure rate (bad drives, premature to market?) I have RMA'd several times so far across 12 disks.

#51 User is offline   bamboe32 

  • Member
  • Group: Member
  • Posts: 2
  • Joined: 19-June 09

Posted 19 June 2009 - 08:24 AM

Here a copy from http://gpi-storage.blogspot.com/2009/01/ti...in-western.html


Ticking time bomb error in Western Digital’s 2.5” 10K RPM SATA VelociRaptor Product
In our testing of Western Digital’s latest 10K RPM SATA product, the 2.5 inch VelociRaptor, we have discovered a very serious issue that everyone needs to be aware of. We found that after running almost 50 days continuously the drive will throw Time Limited Error Recovery (TLER) errors. These errors will cause a RAID volume to fail and possibility causing a loss of data. We have been able to confirm this issue with Western Digital’s Support. This issue occurs when TLER is enabled (normally enabled on all Western Digital RAID Edition Products and also is enabled on the VelociRaptor [7 seconds for Read and Write commands]). See below for more information about TLER.

Details of the failure

When the continuous power on hours hits 49.7 days, an internal firmware time keeper, in the drives’ firmware wraps. When this time keeper wraps, any active Read, Write, and Flush commands will prematurely TLER timeout. Because this time keeper wraps on all of the VelociRaptor’s, installed in a system together, at the same time (all powered on at the same time), any RAID volumes on these drives will fail. In our testing, our system RAID volume failed and we were not able to recovery because all of the VelociRaptor failed together. Data is not lost but the RAID controller will think that all of the drives failed because of the incorrect TLER timeout. All RAID configurations, including RAID 5 and RAID 6, will fail because multiple drives will fail. If the system, with VelociRaptor drives, stays powered on, it will fail again every 49.7 days.

We have also been told by Western Digital that a short term work around is to power cycle any systems with the VelociRaptor drives every 30-45 days to avoid the internal firmware time keeper from wrapping at 50 days. A simple reset or restart doesn’t work. The system must be completely power off to reset the internal firmware time keeper. Western Digital is also currently working on a fix and should have something soon to resolve the issue. We have been reassured by Western Digital that this issue only exists on the VelociRaptor and not on their other products, including their RAID Edition product.

Background information about TLER

TLER or Time Limited Error Recovery was created to help SATA and IDE drives work with RAID controllers when drive errors occur. Normally a SATA and IDE drives will try extensive error recovery procedures to try to recovery data when there is an error. Sometimes these procedures will take multiple seconds per sector of data (512 bytes). If the errors affect an area larger than a single sector, these recovery procedures many take 10, 20, 30 seconds or longer. Most RAID controllers will not give a drive that much time to recovery from an error. The RAID controller will typically drop or error a drive and cause the RAID volume to go into recovery using Parity information from another drive (RAID5 or RAID6) configurations or use mirror drive information (RAID1) to recover. Windows in a non-RAID configuration will also error if the drive takes too much time trying to recover. This could cause a Blue Screen of Death or other issues. TLER helps manage error recovery times to allow the host or RAID controller to find alternate ways of dealing with a drive errors. TLER has helped SATA and IDE drives compete with SCSI and other enterprise class drives in enterprise applications. Western Digital recommends using TLER for any RAID configuration and it can also be used on normal desktop applications also. A utility is also available from Western Digital to enable/disable or change the TLER settings.


If you would like to remove this advertisement, please register.

#52 User is offline   jpiszcz 

  • Member
  • Group: Member
  • Posts: 578
  • Joined: 15-January 06

Posted 19 June 2009 - 08:26 AM

View Postbamboe32, on Jun 19 2009, 09:16 AM, said:

My first drive crashed after being used for 52 days, linux mounted it as RO after the crash.
After fsck repaired many files stil many files were corrupted, so a fresh install was needed.
The smartctl long and short test did not showed any errors.
This drive was being replaced by WD without any problems. (good service I thougt :D )

I have no NCQ enabled in my BIOS.


Error 61 occurred at disk power-on lifetime: 1262 hours (52 days + 14 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 fa 3a 98 ea Error: UNC 8 sectors at LBA = 0x0a983afa = 177748730

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 fa 3a 98 0a 08 49d+17:02:47.225 READ DMA
27 00 00 00 00 00 00 08 49d+17:02:47.212 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 08 49d+17:02:47.202 IDENTIFY DEVICE
ef 03 46 00 00 00 00 08 49d+17:02:47.167 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 08 49d+17:02:47.163 READ NATIVE MAX ADDRESS EXT


The second drive crashed again after 50 days.
Also here the smartctl and WD own test did not showed any errors.
fsck repaired many files, and the system was still able to run.
This time WD would not replace the disk as their Data Lifeguard Diagnostics does not showed any errors.
I asked for a firmware update, but was being told that there is none for a WD1500HLFS
(normal responce time from WD, but no new firmware, not so good service from WD :( )
decided to restore to a different system and kept this for testing.

Error 96 occurred at disk power-on lifetime: 1215 hours (50 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 0d 1e cf ef Error: UNC 8 sectors at LBA = 0x0fcf1e0d = 265231885

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 0d 1e cf 0f 08 49d+17:02:47.284 READ DMA
27 00 00 00 00 00 00 08 49d+17:02:47.284 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 08 49d+17:02:47.275 IDENTIFY DEVICE
ef 03 46 00 00 00 00 08 49d+17:02:47.268 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 08 49d+17:02:47.268 READ NATIVE MAX ADDRESS EXT



This same drive crashed again 57 days later (had to power cycle the system to connect a floppy drive for the WD test)
Did put some load on it to simulate normal server load.
Error 111 occurred at disk power-on lifetime: 2583 hours (107 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 08 95 f9 89 e3 Error: IDNF at LBA = 0x0389f995 = 59373973

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 08 95 f9 89 03 08 49d+17:02:44.190 WRITE DMA
ca 00 08 1d f8 89 03 08 49d+17:02:44.190 WRITE DMA
ca 00 30 25 4c 87 03 08 49d+17:02:44.190 WRITE DMA
27 00 00 00 00 00 00 08 49d+17:02:44.190 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 08 49d+17:02:44.182 IDENTIFY DEVICE



This happened 8 days ago, directly entered a ticket at WD, asked for some responce 3 days ago, but still no answer.
(Very bad service from WD :angry: )

As you can see all my drives crashed after being powered on for 49 Days 17 hours 2 minutes and a few seconds.
Searching on the internet show many more people with this 49.710 days firmware problem.
And WD still does not know about this ????

http://gpi-storage.blogspot.com/2009/01/ti...in-western.html had a good story about his problem, but the page has been removed. It was still available in Google cache, but now even this has been removed. Luckaly I have made a copy of it and will post it in the next message.

Rob.



I await the post!!

#53 User is offline   geshel 

  • Member
  • Group: Member
  • Posts: 210
  • Joined: 31-December 08

Posted 19 June 2009 - 04:21 PM

Your link to the blog w/ that information doesn't work. (oops, just read your comment about that).

This post has been edited by geshel: 19 June 2009 - 04:22 PM


#54 User is offline   Michal Soltys 

  • Member
  • Group: Member
  • Posts: 29
  • Joined: 22-June 07

Posted 23 June 2009 - 07:07 AM

Nice. Enterprise drives my ass..., this should be posted on the frontpage of SR. Although at least it doesn't turn your disk into a brick (like with recent seagate firmware problems).

Justin - you've had problems with that (particular thing) since middle of last year iirc ? (recalling lengthy threads from related linux mailing lists).

#55 User is offline   jpiszcz 

  • Member
  • Group: Member
  • Posts: 578
  • Joined: 15-January 06

Posted 23 June 2009 - 08:16 AM

View PostMichal Soltys, on Jun 23 2009, 08:07 AM, said:

Nice. Enterprise drives my ass..., this should be posted on the frontpage of SR. Although at least it doesn't turn your disk into a brick (like with recent seagate firmware problems).

Justin - you've had problems with that (particular thing) since middle of last year iirc ? (recalling lengthy threads from related linux mailing lists).


Yes-- even to the point of buying a 3ware controller for the drives and they locked up on that too, and that was after I pretty much completely replaced all of the components in the machine. Definitely a drive/firmware issue.

#56 User is offline   continuum 

  • Mod
  • Group: Mod
  • Posts: 3,271
  • Joined: 31-December 01

Posted 23 June 2009 - 12:14 PM

Quote

We have also been told by Western Digital that a short term work around is to power cycle any systems with the VelociRaptor drives every 30-45 days to avoid the internal firmware time keeper from wrapping at 50 days
So Seagate's not the only one screwing up with internal counters in their drives. Lovely.

(at least Seagate can keep their healthy disks from dropping from arrays!)

#57 User is offline   Michal Soltys 

  • Member
  • Group: Member
  • Posts: 29
  • Joined: 22-June 07

Posted 25 June 2009 - 12:57 PM

View Postcontinuum, on Jun 23 2009, 07:14 PM, said:

(at least Seagate can keep their healthy disks from dropping from arrays!)


Speaking about which - weren't there two (or three depending on perspective) separate firmware issues recently with Seagate ?

http://techreport.co...ussions.x/15954 (timeouts)
http://techreport.co...ussions.x/16232 (bricking)
http://stx.lithium.com/stx/board/message?b...ding&page=1 (bricking after fw update to correct bricking :P)

after quick googling

#58 User is offline   Cirrus Telecom 

  • Member
  • Group: Member
  • Posts: 3
  • Joined: 10-September 09

Posted 10 September 2009 - 03:32 PM

Has anyone else been able to resolve this issue? We have 80 of these drives and have been going round and round with the Blade manufacturer and chipset manufacturer.

Configuration is 40 Supermicro Blades w/ 2 WD3000BLFS each in RAID1. Blades use the LSI 1068E embedded chipset.

About every 49 days and 16 hours we'll have about half of our blades go into a degraded status or blue screen. The RAID manager will just say that the drive "failed" with no other information. We have to pop out one of the drives and reseat and the RAID will rebuild. So far, no data loss, anywhere.

I am on the phone with WD now in Level 2 tech support. I tried resolving this with them in e-mail. Their e-mail tech support is extra extra bad. It's extremely bad.

I will update when I get more information.

#59 User is offline   Cirrus Telecom 

  • Member
  • Group: Member
  • Posts: 3
  • Joined: 10-September 09

Posted 10 September 2009 - 03:45 PM

WD says they are sending me a utility that sets the TLER to unlimited.

#60 User is offline   Cirrus Telecom 

  • Member
  • Group: Member
  • Posts: 3
  • Joined: 10-September 09

Posted 10 September 2009 - 06:43 PM

The "utility" is actually new firmware.

Current part numbers:

WD3000BLFS-01YBU0
WD1500BLFS-01YBU0
WD740BLFS-01YBU0
WD3000HLFS-01G6U0
WD1500HLFS-01G6U0
WD740HLFS-01G6U0

The new firmware version number is 04.04v02.

New part numbers (with this firmware version):
WD3000BLFS-01YBU1
WD1500BLFS-01YBU1
WD740BLFS-01YBU1
WD3000HLFS-01G6U1
WD1500HLFS-01G6U1
WD740HLFS-01G6U1


Dear Valued WD Customer:

As a result of product evaluation consistent with Western Digital's quality
systems and our commitment to provide the highest quality products, an
update is being made to the WD VelociRaptor product family.

Description of Change:

Performance enhancement: Sequential read and write
Compatibility enhancements: Double status FIS, TLER timer and counter synchronization

These changes do not affect the form or fit of the drive but do positively
affect the function of the drive.

Details of Firmware Changes:

Performance improvements to sequential read and write:
Firmware release includes a function to allow the drive to stay in the
sequential mode if there was no other activity. This particularly benefits
certain SATA RAID environments.

Double status FIS:

The drive sent two status FIS (C001h and 5001h) after COMRESET.
Although the drive behavior did not violate the SATA specification it still
created issues for some SATA host bus adapters. This firmware will only
post 5001h when the drive is ready.

TLER:

Corrected an issue where after 49 days of continuous operation the drive
could falsely report an error to the Host. This issue does not result in data
loss and only occurs once every 49 days of continuous operation and only
if a read/write operation is in progress when the counter rolls to zero. The
new firmware resolves this issue but as an alternate workaround, the drive
can be power cycled before 49 days elapses to avoid the potential error.

Field Update Utility/Binary available July 24, 2009.
Manufacturing Implementation Date: August 21, 2009

  • 12 Pages +
  • « First
  • 4
  • 5
  • 6
  • 7
  • 8
  • Last »
  • You cannot start a new topic
  • You cannot reply to this topic

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users