bamboe32

Member
  • Content Count

    2
  • Joined

  • Last visited

Community Reputation

0 Neutral

About bamboe32

  • Rank
    Member
  1. Here a copy from http://gpi-storage.blogspot.com/2009/01/ti...in-western.html Ticking time bomb error in Western Digital’s 2.5†10K RPM SATA VelociRaptor Product In our testing of Western Digital’s latest 10K RPM SATA product, the 2.5 inch VelociRaptor, we have discovered a very serious issue that everyone needs to be aware of. We found that after running almost 50 days continuously the drive will throw Time Limited Error Recovery (TLER) errors. These errors will cause a RAID volume to fail and possibility causing a loss of data. We have been able to confirm this issue with Western Digital’s Support. This issue occurs when TLER is enabled (normally enabled on all Western Digital RAID Edition Products and also is enabled on the VelociRaptor [7 seconds for Read and Write commands]). See below for more information about TLER. Details of the failure When the continuous power on hours hits 49.7 days, an internal firmware time keeper, in the drives’ firmware wraps. When this time keeper wraps, any active Read, Write, and Flush commands will prematurely TLER timeout. Because this time keeper wraps on all of the VelociRaptor’s, installed in a system together, at the same time (all powered on at the same time), any RAID volumes on these drives will fail. In our testing, our system RAID volume failed and we were not able to recovery because all of the VelociRaptor failed together. Data is not lost but the RAID controller will think that all of the drives failed because of the incorrect TLER timeout. All RAID configurations, including RAID 5 and RAID 6, will fail because multiple drives will fail. If the system, with VelociRaptor drives, stays powered on, it will fail again every 49.7 days. We have also been told by Western Digital that a short term work around is to power cycle any systems with the VelociRaptor drives every 30-45 days to avoid the internal firmware time keeper from wrapping at 50 days. A simple reset or restart doesn’t work. The system must be completely power off to reset the internal firmware time keeper. Western Digital is also currently working on a fix and should have something soon to resolve the issue. We have been reassured by Western Digital that this issue only exists on the VelociRaptor and not on their other products, including their RAID Edition product. Background information about TLER TLER or Time Limited Error Recovery was created to help SATA and IDE drives work with RAID controllers when drive errors occur. Normally a SATA and IDE drives will try extensive error recovery procedures to try to recovery data when there is an error. Sometimes these procedures will take multiple seconds per sector of data (512 bytes). If the errors affect an area larger than a single sector, these recovery procedures many take 10, 20, 30 seconds or longer. Most RAID controllers will not give a drive that much time to recovery from an error. The RAID controller will typically drop or error a drive and cause the RAID volume to go into recovery using Parity information from another drive (RAID5 or RAID6) configurations or use mirror drive information (RAID1) to recover. Windows in a non-RAID configuration will also error if the drive takes too much time trying to recover. This could cause a Blue Screen of Death or other issues. TLER helps manage error recovery times to allow the host or RAID controller to find alternate ways of dealing with a drive errors. TLER has helped SATA and IDE drives compete with SCSI and other enterprise class drives in enterprise applications. Western Digital recommends using TLER for any RAID configuration and it can also be used on normal desktop applications also. A utility is also available from Western Digital to enable/disable or change the TLER settings.
  2. My first drive crashed after being used for 52 days, linux mounted it as RO after the crash. After fsck repaired many files stil many files were corrupted, so a fresh install was needed. The smartctl long and short test did not showed any errors. This drive was being replaced by WD without any problems. (good service I thougt ) I have no NCQ enabled in my BIOS. Error 61 occurred at disk power-on lifetime: 1262 hours (52 days + 14 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 fa 3a 98 ea Error: UNC 8 sectors at LBA = 0x0a983afa = 177748730 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 fa 3a 98 0a 08 49d+17:02:47.225 READ DMA 27 00 00 00 00 00 00 08 49d+17:02:47.212 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 49d+17:02:47.202 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 49d+17:02:47.167 SET FEATURES [set transfer mode] 27 00 00 00 00 00 00 08 49d+17:02:47.163 READ NATIVE MAX ADDRESS EXT The second drive crashed again after 50 days. Also here the smartctl and WD own test did not showed any errors. fsck repaired many files, and the system was still able to run. This time WD would not replace the disk as their Data Lifeguard Diagnostics does not showed any errors. I asked for a firmware update, but was being told that there is none for a WD1500HLFS (normal responce time from WD, but no new firmware, not so good service from WD ) decided to restore to a different system and kept this for testing. Error 96 occurred at disk power-on lifetime: 1215 hours (50 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 0d 1e cf ef Error: UNC 8 sectors at LBA = 0x0fcf1e0d = 265231885 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 0d 1e cf 0f 08 49d+17:02:47.284 READ DMA 27 00 00 00 00 00 00 08 49d+17:02:47.284 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 49d+17:02:47.275 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 49d+17:02:47.268 SET FEATURES [set transfer mode] 27 00 00 00 00 00 00 08 49d+17:02:47.268 READ NATIVE MAX ADDRESS EXT This same drive crashed again 57 days later (had to power cycle the system to connect a floppy drive for the WD test) Did put some load on it to simulate normal server load. Error 111 occurred at disk power-on lifetime: 2583 hours (107 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 08 95 f9 89 e3 Error: IDNF at LBA = 0x0389f995 = 59373973 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 08 95 f9 89 03 08 49d+17:02:44.190 WRITE DMA ca 00 08 1d f8 89 03 08 49d+17:02:44.190 WRITE DMA ca 00 30 25 4c 87 03 08 49d+17:02:44.190 WRITE DMA 27 00 00 00 00 00 00 08 49d+17:02:44.190 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 49d+17:02:44.182 IDENTIFY DEVICE This happened 8 days ago, directly entered a ticket at WD, asked for some responce 3 days ago, but still no answer. (Very bad service from WD ) As you can see all my drives crashed after being powered on for 49 Days 17 hours 2 minutes and a few seconds. Searching on the internet show many more people with this 49.710 days firmware problem. And WD still does not know about this ???? http://gpi-storage.blogspot.com/2009/01/ti...in-western.html had a good story about his problem, but the page has been removed. It was still available in Google cache, but now even this has been removed. Luckaly I have made a copy of it and will post it in the next message. Rob.