Jump to content


Photo

Velociraptor premature failure rate (bad drives, premature to market?)


  • You cannot start a new topic
  • Please log in to reply
116 replies to this topic

#1 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 09 November 2008 - 05:11 PM

The VR300GB reviews on NewEgg:
http://www.newegg.co...N82E16822136260

Sort by lowest rating, I find this very accurate, out of 12 drives I have purchased all at the same time (bad idea), I have had numerous failures, in almost every instance its bad sectors. I have a similar configuration but with Raptor150s and it has worked fine with only 1 or 2 failures in the past 2-4 years.

For the Velociraptor 300s, I have had a total of:

5 failures from the original disks
1 failure from the RMA I received from WD
6 total failures out of 12 drives

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   10%	  3059		 586068047
# 2  Short offline	   Completed: read failure	   60%	  3058		 586068047
# 3  Conveyance offline  Completed without error	   00%	  3053		 -
# 4  Short offline	   Completed without error	   00%	  3053		 -
# 5  Extended offline	Completed without error	   00%	  3053		 -
# 6  Selective offline   Completed without error	   00%	  3052		 -
# 7  Extended offline	Completed: read failure	   10%	  3049		 586068047
# 8  Short offline	   Completed without error	   00%	  3024		 -
# 9  Short offline	   Completed without error	   00%	  3001		 -

Another one:
[140278.271138] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[140278.271148] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[140278.271149]		  res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[140278.271154] ata3.00: status: { DRDY }
[140278.271160] ata3: hard resetting link
[140278.576071] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[140278.601864] ata3.00: configured for UDMA/133
[140278.601871] end_request: I/O error, dev sdc, sector 586067067
														^^^^^^^^^
												  Bad sector above.

[140278.601876] md: super_written gets error=-5, uptodate=0
[140278.601880] raid1: Disk failure on sdc3, disabling device.
[140278.601881] raid1: Operation continuing on 1 devices.
[140278.601908] ata3: EH complete
[140278.612277] sd 2:0:0:0: [sdc] 586072368 512-byte hardware sectors (300069 MB)

A week later:

	1 mdadm monitor  6:00pm	 (1K) Fail event on /dev/md2

[ .. ]

[35216.712178] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[35216.722031] ata3.00: configured for UDMA/133
[35216.722040] end_request: I/O error, dev sdc, sector 310170509
													   ^^^^^^^^^
												  Bad sector above.
[35216.722048] raid1: Disk failure on sdc3, disabling device.
[35216.722049] raid1: Operation continuing on 1 devices.
[35216.722072] ata3: EH complete
[35216.722509] sd 2:0:0:0: [sdc] 586072368 512-byte hardware sectors (300069 MB)
[35216.729432] md: recovery of RAID array md2

A day after that:

Fail event on /dev/md2:p34.internal.lan


Oct 26 09:00:33 p34 kernel: [89217.234246] end_request: I/O error, dev sdc, sector 309961068
   ^^^^^^^^^
   Bad sector above.
Oct 26 09:00:33 p34 kernel: [89217.234259] raid1: Disk failure on sdc3, disabling device.
Oct 26 09:00:33 p34 kernel: [89217.234260] raid1: Operation continuing on 1 devices.
Oct 26 09:00:33 p34 kernel: [89217.234276] ata3: EH complete
Oct 26 09:00:33 p34 kernel: [89217.244372] sd 2:0:0:0: [sdc] 586072368 512-byte hardware sectors (300069 MB)

I am really getting tired of RMA'ing disks to WD, I have even opened a case with them in June/July and asked them about premature failure rates, they said they did testing and there was no such problem. However, with all of the comments on NewEgg and all my failures in only 5-6 months, I am not sure I would recommend Velociraptors to anyone.

Comments? Does anyone else have a stack of VR300s? What has been your experience with them?

Justin.

#2 continuum

continuum

    Mod

  • Mod
  • 3,540 posts

Posted 09 November 2008 - 07:40 PM

You bought them from Newegg? OEM I assume? :scared:

Were they all packed properly? Newegg, Mwave, Zipzoomfly, Allstarshop, just about every major retailer out there is notorious for improperly handling and packing OEM packaged products. I would strongly suspect that the way most retailers out there pack things is causing a significant jump in failure rates.


Hint: most retails do a single layer, maybe two layers, three if you're lucky, of the large-bubble bubble-wrap around a drive. The drive as packed then goes into the bottom of the box and topped off with foam peanuts. This is entirely INadequate protection as this results in less than 2" of bubble wrap in each dimension. Plus OEM packed products are not actually wrapped in bubble wrap til they hit the packaging stage, which means they go through the entire warehouse in just the ESD bag/clamshell... eek.


I don't have a huge database of Velociraptors here, but we have a few and haven't noticed anything out of the ordinary-- but that said my sample of Velociraptors is very small by my standards as well as the fact that it's somewhat affected by other peculiarities to the products we design/build.

#3 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 10 November 2008 - 04:05 AM

You bought them from Newegg? OEM I assume? :scared:

Were they all packed properly? Newegg, Mwave, Zipzoomfly, Allstarshop, just about every major retailer out there is notorious for improperly handling and packing OEM packaged products. I would strongly suspect that the way most retailers out there pack things is causing a significant jump in failure rates.


OEM yes and when I ordered I made 12 separate orders, each for 1 drive each, so each drive was shipped in its own box and no drive was DOA (at least to begin with).

Justin.

#4 continuum

continuum

    Mod

  • Mod
  • 3,540 posts

Posted 10 November 2008 - 04:49 PM

Still if you did it all at the same time then they're all from the same batch, and if they were all from a single vendor they were all subject to the vendor's handling (or lack thereof). I would strongly suspect a mishandled batch, poor handling by the vendor (yes, 12 separate drives being tossed individually into the box before packing is just as likely as 12 drives in a single order each being tossed separately), etc.

Mishandling does not always cause DOA. And each drive being shipped individually means each drive can still be improperly packed...

#5 troubleshooter

troubleshooter

    Member

  • Member
  • 2 posts

Posted 24 November 2008 - 10:38 AM

I'm curious; How many hours of runtime (Power_On_Hours) as reported by SMART (smartctl -a) were on the drives when they failed?

#6 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 24 November 2008 - 10:44 AM

I'm curious; How many hours of runtime (Power_On_Hours) as reported by SMART (smartctl -a) were on the drives when they failed?


The most recent one below, 899 hours, its cute too, notice how there are no bad sectors in the smart statistics, but scroll all the way down and you will see what happens if you try to use the disk, the filesystem becomes corrupted and a ton of disk errors.

Well this is the 9th drive that failed last Friday, I submitted an RMA for this and I am also ordering a 3ware RAID controller to better deal with these problems.

=== START OF INFORMATION SECTION ===
Device Model:	 WDC WD3000HLFS-01G6U0
Serial Number:	[snip]
Firmware Version: 04.04V01
User Capacity:	300,069,052,416 bytes
Device is:		Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:	Mon Nov 24 10:40:44 2008 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
										was completed without error.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever 
										been run.
Total time to complete Offline 
data collection:				 (4800) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine 
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		(  59) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303f) SCT Status supported.
										SCT Feature Control supported.
										SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   200   199   051	Pre-fail  Always	   -	   0
  3 Spin_Up_Time			0x0003   198   198   021	Pre-fail  Always	   -	   3083
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   22
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   099   099   000	Old_age   Always	   -	   899
 10 Spin_Retry_Count		0x0012   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0012   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   22
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   13
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   22
194 Temperature_Celsius	 0x0022   122   115   000	Old_age   Always	   -	   25
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0012   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
ATA Error Count: 268 (device log contains only the most recent five errors)
		CR = Command Register [HEX]
		FR = Features Register [HEX]
		SC = Sector Count Register [HEX]
		SN = Sector Number Register [HEX]
		CL = Cylinder Low Register [HEX]
		CH = Cylinder High Register [HEX]
		DH = Device/Head Register [HEX]
		DC = Device Command Register [HEX]
		ER = Error register [HEX]
		ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 268 occurred at disk power-on lifetime: 889 hours (37 days + 1 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 34 cf f3 a3
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 00 08  11d+07:49:52.019  FLUSH CACHE EXIT
  ca 00 08 30 6d 2f 0e 08  11d+07:49:51.942  WRITE DMA
  35 00 00 30 69 2f 0e 08  11d+07:49:51.940  WRITE DMA EXT
  35 00 00 30 65 2f 0e 08  11d+07:49:51.938  WRITE DMA EXT
  35 00 00 30 61 2f 0e 08  11d+07:49:51.936  WRITE DMA EXT

Error 267 occurred at disk power-on lifetime: 889 hours (37 days + 1 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 00 28 4b 00 e0  Error: IDNF at LBA = 0x00004b28 = 19240

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 00 28 4b 00 0e 08  11d+07:49:26.705  WRITE DMA EXT
  35 00 00 d0 91 fb 0d 08  11d+07:49:26.395  WRITE DMA EXT
  35 00 00 d0 8d fb 0d 08  11d+07:49:26.393  WRITE DMA EXT
  35 00 00 d0 89 fb 0d 08  11d+07:49:26.391  WRITE DMA EXT
  35 00 00 d0 85 fb 0d 08  11d+07:49:26.389  WRITE DMA EXT

Error 266 occurred at disk power-on lifetime: 889 hours (37 days + 1 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 00 98 17 dc e0  Error: IDNF at LBA = 0x00dc1798 = 14423960

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 00 00 98 17 dc 0d 08  11d+07:49:05.358  WRITE DMA EXT
  ca 00 10 c0 0e 00 00 08  11d+07:49:05.309  WRITE DMA
  35 00 00 f0 d2 d6 0d 08  11d+07:49:05.303  WRITE DMA EXT
  35 00 00 f0 ce d6 0d 08  11d+07:49:05.301  WRITE DMA EXT
  35 00 00 f0 ca d6 0d 08  11d+07:49:05.299  WRITE DMA EXT

Error 265 occurred at disk power-on lifetime: 866 hours (36 days + 2 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 01 ba 48 40  Error: UNC at LBA = 0x0048ba01 = 4766209

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  20 00 01 01 ba 48 00 08  10d+08:31:15.085  READ SECTOR(S)
  20 00 01 00 ba 48 00 08  10d+08:31:08.857  READ SECTOR(S)
  27 00 00 00 00 00 00 08  10d+08:31:08.845  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08  10d+08:31:08.838  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08  10d+08:31:08.838  SET FEATURES [Set transfer mode]

Error 264 occurred at disk power-on lifetime: 866 hours (36 days + 2 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 ff b9 48 40  Error: UNC at LBA = 0x0048b9ff = 4766207

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  20 00 01 ff b9 48 00 08  10d+08:31:01.825  READ SECTOR(S)
  27 00 00 00 00 00 00 08  10d+08:31:01.813  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08  10d+08:31:01.807  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08  10d+08:31:01.807  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 00 08  10d+08:31:01.798  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	   899		 -
# 2  Extended offline	Completed without error	   00%	   893		 -
# 3  Short offline	   Completed without error	   00%	   892		 -
# 4  Short offline	   Interrupted (host reset)	  90%	   889		 -
# 5  Extended offline	Completed without error	   00%	   871		 -
# 6  Short offline	   Completed without error	   00%	   870		 -
# 7  Extended offline	Interrupted (host reset)	  30%	   866		 -
# 8  Extended offline	Completed without error	   00%	   858		 -
# 9  Short offline	   Completed without error	   00%	   857		 -
#10  Short offline	   Completed without error	   00%	   842		 -
#11  Extended offline	Completed without error	   00%	   823		 -
#12  Short offline	   Completed without error	   00%	   822		 -
#13  Short offline	   Completed without error	   00%	   822		 -
#14  Short offline	   Completed without error	   00%	   818		 -
#15  Short offline	   Completed without error	   00%	   794		 -
#16  Short offline	   Completed without error	   00%	   771		 -
#17  Short offline	   Completed without error	   00%	   747		 -
#18  Short offline	   Completed without error	   00%	   723		 -
#19  Extended offline	Completed without error	   00%	   701		 -
#20  Short offline	   Completed without error	   00%	   676		 -
#21  Short offline	   Completed without error	   00%	   652		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

p34:~#

Nov 24 01:04:11 p34 kernel: [749803.050746] ata1.00: configured for UDMA/133
Nov 24 01:04:11 p34 kernel: [749803.050758] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Nov 24 01:04:11 p34 kernel: [749803.050761] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Nov 24 01:04:11 p34 kernel: [749803.050765] Descriptor sense data with sense descriptors (in hex):
Nov 24 01:04:11 p34 kernel: [749803.050767]		 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Nov 24 01:04:11 p34 kernel: [749803.050774]		 0d dc 17 98 
Nov 24 01:04:11 p34 kernel: [749803.050776] sd 0:0:0:0: [sda] Add. Sense: Recorded entity not found
Nov 24 01:04:11 p34 kernel: [749803.050791] ata1: EH complete
Nov 24 01:04:11 p34 kernel: [749803.052931] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB)
Nov 24 01:04:11 p34 kernel: [749803.055061] sd 0:0:0:0: [sda] Write Protect is off
Nov 24 01:04:11 p34 kernel: [749803.059299] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 24 01:04:32 p34 kernel: [749824.418104] ata1.00: configured for UDMA/133
Nov 24 01:04:32 p34 kernel: [749824.418115] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Nov 24 01:04:32 p34 kernel: [749824.418119] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Nov 24 01:04:32 p34 kernel: [749824.418123] Descriptor sense data with sense descriptors (in hex):
Nov 24 01:04:32 p34 kernel: [749824.418124]		 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Nov 24 01:04:32 p34 kernel: [749824.418131]		 0e 00 4b 28 
Nov 24 01:04:32 p34 kernel: [749824.418138] sd 0:0:0:0: [sda] Add. Sense: Recorded entity not found
Nov 24 01:04:32 p34 kernel: [749824.418151] ata1: EH complete
Nov 24 01:04:32 p34 kernel: [749824.420284] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB)
Nov 24 01:04:32 p34 kernel: [749824.422418] sd 0:0:0:0: [sda] Write Protect is off
Nov 24 01:04:32 p34 kernel: [749824.426658] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 24 01:04:58 p34 kernel: [749849.757166] ata1.00: configured for UDMA/133
Nov 24 01:04:58 p34 kernel: [749849.757175] ata1: EH complete
Nov 24 01:04:58 p34 kernel: [749849.757351] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB)
Nov 24 01:04:58 p34 kernel: [749849.767482] xfs_force_shutdown(sda,0x2) called from line 1056 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff803b67f3
Nov 24 01:04:58 p34 kernel: [749849.767503] sd 0:0:0:0: [sda] Write Protect is off
Nov 24 01:04:58 p34 kernel: [749849.771767] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Nov 24 01:05:03 p34 kernel: [749854.805736] Filesystem "sda": xfs_log_force: error 5 returned.
Nov 24 01:05:07 p34 kernel: [749859.583192] ata1: hard resetting link
Nov 24 01:05:08 p34 kernel: [749859.888177] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 24 01:05:08 p34 kernel: [749859.910650] ata1.00: configured for UDMA/133
Nov 24 01:05:08 p34 kernel: [749859.910659] ata1: EH complete
Nov 24 01:05:08 p34 kernel: [749859.910733] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB)
Nov 24 01:05:08 p34 kernel: [749859.910777] sd 0:0:0:0: [sda] Write Protect is off
Nov 24 01:05:08 p34 kernel: [749859.910850] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

#7 troubleshooter

troubleshooter

    Member

  • Member
  • 2 posts

Posted 24 November 2008 - 03:34 PM

Might you still have a record of the Power_On_Hours from the previous eight failed drives?

#8 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 24 November 2008 - 03:46 PM

Might you still have a record of the Power_On_Hours from the previous eight failed drives?


Checking..

20080618-raptor300
20080715-raptor300
20080925-raptor300
20081018-raptor300
20081029-raptor300
20081111-raptor300
20081121-raptor300
sdd
sdl

Of which, I have the following info (dont have it for all of them):

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 50% 270 586070865


SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 10% 3049 586068047

(but began erroring shortly after, even though it says OK)
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 3720 -

(but began erroring shortly after, even though it says OK)
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 818 -


SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Selective offline Interrupted (host reset) 90% 2312 -

#9 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 25 November 2008 - 04:22 AM

Here is what it looks like if you try to keep using the drive:

[843251.690450] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB)
[843251.690609] sd 0:0:0:0: [sda] Write Protect is off
[843251.690611] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[843251.691055] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[843262.667482] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[843262.667488] ata1.00: irq_stat 0x40000001
[843262.667494] ata1.00: cmd c8/00:08:00:00:64/00:00:00:00:00/ef tag 0 dma 4096 in
[843262.667495]		  res 51/40:08:00:00:64/00:00:00:64:00/ef Emask 0x9 (media error)
[843262.667500] ata1.00: status: { DRDY ERR }
[843262.667503] ata1.00: error: { UNC }
[843262.688510] ata1.00: configured for UDMA/133
[843262.688519] ata1: EH complete
[843262.695693] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB)
[843262.696166] sd 0:0:0:0: [sda] Write Protect is off
[843262.696169] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[843262.696767] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[843289.797086] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[843289.797093] ata1.00: irq_stat 0x40000001
[843289.797099] ata1.00: cmd ca/00:08:b0:15:63/00:00:00:00:00/ef tag 0 dma 4096 out
[843289.797101]		  res 51/10:08:b0:15:63/00:00:00:63:00/ef Emask 0x81 (invalid argument)
[843289.797106] ata1.00: status: { DRDY ERR }
[843289.797109] ata1.00: error: { IDNF }
[843289.816289] ata1.00: configured for UDMA/133
[843289.816299] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
[843289.816304] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
[843289.816309] Descriptor sense data with sense descriptors (in hex):
[843289.816312]		 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
[843289.816340]		 0f 63 15 b0 
[843289.816350] sd 0:0:0:0: [sda] Add. Sense: Recorded entity not found
[843289.816358] end_request: I/O error, dev sda, sector 258151856
[843289.816363] Buffer I/O error on device sda, logical block 32268982
[843289.816366] lost page write due to I/O error on sda
[843289.816378] ata1: EH complete
[843289.816566] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB)
[843289.816735] sd 0:0:0:0: [sda] Write Protect is off
[843289.816740] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[843289.817090] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[843326.116002] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[843326.116008] ata1.00: irq_stat 0x40000001
[843326.116015] ata1.00: cmd 35/00:f0:48:93:70/00:01:11:00:00/e0 tag 0 dma 253952 out
[843326.116016]		  res 51/10:f0:48:93:70/00:01:11:00:00/e0 Emask 0x81 (invalid argument)
[843326.116021] ata1.00: status: { DRDY ERR }
[843326.116024] ata1.00: error: { IDNF }
[843326.136418] ata1.00: configured for UDMA/133
[843326.136435] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
[843326.136439] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
[843326.136444] Descriptor sense data with sense descriptors (in hex):
[843326.136447]		 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
[843326.136457]		 11 70 93 48 
[843326.136461] sd 0:0:0:0: [sda] Add. Sense: Recorded entity not found
[843326.136466] end_request: I/O error, dev sda, sector 292590408
[843326.136509] ata1: EH complete
[843326.136687] sd 0:0:0:0: [sda] 586072368 512-byte hardware sectors (300069 MB)
[843326.136856] sd 0:0:0:0: [sda] Write Protect is off
[843326.136860] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[843326.137203] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[843326.149337] Aborting journal on device sda.
[843326.149356] ext3_abort called.
[843326.149358] EXT3-fs error (device sda): ext3_journal_start_sb: Detected aborted journal
[843326.149360] Remounting filesystem read-only

And the most interesting thing of all?
No bad sectors according to smart, which is clearly wrong.

ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   199   199   051	Pre-fail  Always	   -	   5629
  3 Spin_Up_Time			0x0003   198   198   021	Pre-fail  Always	   -	   3083
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   22
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   099   099   000	Old_age   Always	   -	   916
 10 Spin_Retry_Count		0x0012   100   253   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0012   100   253   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   22
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   13
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   22
194 Temperature_Celsius	 0x0022   120   115   000	Old_age   Always	   -	   27
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0012   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	   915		 -
# 2  Extended offline	Completed without error	   00%	   900		 -
# 3  Short offline	   Completed without error	   00%	   899		 -
# 4  Extended offline	Completed without error	   00%	   893		 -
# 5  Short offline	   Completed without error	   00%	   892		 -
# 6  Short offline	   Interrupted (host reset)	  90%	   889		 -
# 7  Extended offline	Completed without error	   00%	   871		 -
# 8  Short offline	   Completed without error	   00%	   870		 -
# 9  Extended offline	Interrupted (host reset)	  30%	   866		 -
#10  Extended offline	Completed without error	   00%	   858		 -
#11  Short offline	   Completed without error	   00%	   857		 -
#12  Short offline	   Completed without error	   00%	   842		 -
#13  Extended offline	Completed without error	   00%	   823		 -
#14  Short offline	   Completed without error	   00%	   822		 -
#15  Short offline	   Completed without error	   00%	   822		 -
#16  Short offline	   Completed without error	   00%	   818		 -
#17  Short offline	   Completed without error	   00%	   794		 -
#18  Short offline	   Completed without error	   00%	   771		 -
#19  Short offline	   Completed without error	   00%	   747		 -
#20  Short offline	   Completed without error	   00%	   723		 -
#21  Extended offline	Completed without error	   00%	   701		 -



Bad sectors:

[841267.274757] end_request: I/O error, dev sda, sector 112543520
[841278.219322] end_request: I/O error, dev sda, sector 539496464
[841290.327943] end_request: I/O error, dev sda, sector 115180888
[841317.473642] end_request: I/O error, dev sda, sector 116934480
[841343.329247] end_request: I/O error, dev sda, sector 541069328
[841383.309632] end_request: I/O error, dev sda, sector 121927680
[841394.428172] end_request: I/O error, dev sda, sector 119869488
[841417.895584] end_request: I/O error, dev sda, sector 123480240
[841451.875687] end_request: I/O error, dev sda, sector 123850312
[841520.063722] end_request: I/O error, dev sda, sector 130176280
[841543.122986] end_request: I/O error, dev sda, sector 132245520
[841603.324793] end_request: I/O error, dev sda, sector 136860808
[841674.873006] end_request: I/O error, dev sda, sector 140783224
[841685.421472] end_request: I/O error, dev sda, sector 138674176
[841707.431138] end_request: I/O error, dev sda, sector 547360784
[841734.654434] end_request: I/O error, dev sda, sector 143324440
[841796.002163] end_request: I/O error, dev sda, sector 146543032
[841807.870695] end_request: I/O error, dev sda, sector 150736968
[841843.411151] end_request: I/O error, dev sda, sector 151716656
[841872.122914] end_request: I/O error, dev sda, sector 153396096
[841924.013988] end_request: I/O error, dev sda, sector 155310608
[841984.347570] end_request: I/O error, dev sda, sector 159002216
[842039.238961] end_request: I/O error, dev sda, sector 164892744
[842122.793866] end_request: I/O error, dev sda, sector 167878344
[842142.793103] end_request: I/O error, dev sda, sector 168296448
[842153.791869] end_request: I/O error, dev sda, sector 136
[842200.960651] end_request: I/O error, dev sda, sector 171962104
[842245.417221] end_request: I/O error, dev sda, sector 174691496
[842259.974116] end_request: I/O error, dev sda, sector 174585760
[842274.639224] end_request: I/O error, dev sda, sector 177053392
[842286.759675] end_request: I/O error, dev sda, sector 177458672
[842301.022746] end_request: I/O error, dev sda, sector 176623352
[842359.364140] end_request: I/O error, dev sda, sector 181149656
[842448.535608] end_request: I/O error, dev sda, sector 190926336
[842521.499963] end_request: I/O error, dev sda, sector 195301448
[842538.816849] end_request: I/O error, dev sda, sector 196720728
[842554.579943] end_request: I/O error, dev sda, sector 198450448
[842590.576004] end_request: I/O error, dev sda, sector 198221184
[842606.621194] end_request: I/O error, dev sda, sector 198429680
[842640.517090] end_request: I/O error, dev sda, sector 204072304
[842662.076440] end_request: I/O error, dev sda, sector 136
[842673.549161] end_request: I/O error, dev sda, sector 207031696
[842694.826365] end_request: I/O error, dev sda, sector 565973008
[842731.578605] end_request: I/O error, dev sda, sector 211822168
[842756.834250] end_request: I/O error, dev sda, sector 213221352
[842872.917110] end_request: I/O error, dev sda, sector 222983952
[842964.248373] end_request: I/O error, dev sda, sector 231349720
[843007.001137] end_request: I/O error, dev sda, sector 235314848
[843018.653874] end_request: I/O error, dev sda, sector 232941816
[843289.816358] end_request: I/O error, dev sda, sector 258151856
[843326.136466] end_request: I/O error, dev sda, sector 292590408

#10 datestardi

datestardi

Posted 25 November 2008 - 07:16 AM

Jpiszcz, something doesn't seem right with the data you're posting, as if something other than the drives in your system (or the test methodology) is causing the errors.

Can you move the drives to a known working system and retest them?

If you test them with Data Lifeguard (you can use a DOS CD if you're not running Windows), what is the result?

http://support.wdc.c...s...612&lang=en

How are the errors manifesting themselves, other than in your diagnostic tests? (If you're not seeing real-world data errors, that would say a lot I think.)

The customer reviews for the Velociraptor at Newegg are outstanding. I've actually never seen higher.

Edited by datestardi, 25 November 2008 - 07:19 AM.

#11 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 25 November 2008 - 07:18 AM

Jpiszcz, something doesn't seem right with the data you're posting, as if something other than the drives in your system (or the test methodology) is causing the errors.

Can you move the drives to a known working system and retest them?

If you test them with Data Lifeguard (you can use a DOS CD if you're not running Windows), what is the result?

How are the errors manifesting themselves, other than in your diagnostic tests? (If you're not seeing real-world data errors, that would say a lot I think.)

The customer reviews for the Velociraptor at Newegg are outstanding. I've actually never seen higher.


They are real world errors, the system is not the issue, recall, the system is the same one used with raptor 150s with no problems for 2-3 years FYI. Soon I will have a 3ware RAID card and I cannot wait to see how it handles these drives, if they are crap on the raid card as well I will go with new disks, its more or less a last-ditch effort.

Newegg? Sort by lowest rating and see how many bad drives/sectors/etc there are, quite a few.


Justin.

#12 datestardi

datestardi

Posted 25 November 2008 - 07:20 AM

They are real world errors, the system is not the issue,

Can you try Data Lifeguard?

http://support.wdc.c...s...612&lang=en

Edited by datestardi, 25 November 2008 - 07:21 AM.

#13 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 25 November 2008 - 08:06 AM

They are real world errors, the system is not the issue,

Can you try Data Lifeguard?

http://support.wdc.c...s...612&lang=en


If I have time I can try to give this a spin before I replace the drive.

Justin.

#14 6_6_6

6_6_6

    Member

  • Member
  • 590 posts

Posted 25 November 2008 - 08:19 AM

No bad sectors according to smart, which is clearly wrong.


How do you know?

You don't seem to have bad sectors but you have other problems.

Probably corrupt firmware/corrupt controller/broken motherboard. Uncorrectables / Index Not Founds / DMA errors, etc... So many of them... clearly something wrong.

If i were you, i would stop wasting time and do as suggested:

Plug this drive on a working system. Drop to DOS or boot from manufacturer's CD and run their diagnostics program there. If you drop to DOS, try running MHDD. You would have a better idea if there are bad sectors or IDNFs/UNCs, etc.

#15 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 25 November 2008 - 08:23 AM

No bad sectors according to smart, which is clearly wrong.


How do you know?

You don't seem to have bad sectors but you have other problems.

Probably corrupt firmware/corrupt controller/broken motherboard. Uncorrectables / Index Not Founds / DMA errors, etc... So many of them... clearly something wrong.

If i were you, i would stop wasting time and do as suggested:

Plug this drive on a working system. Drop to DOS or boot from manufacturer's CD and run their diagnostics program there. If you drop to DOS, try running MHDD. You would have a better idea if there are bad sectors or IDNFs/UNCs, etc.


How do I know? On 60-70% of the disks, they started having bad sector errors. On the other 30-40% they report uncorrectable sectors to the OS (when I turned on TLER). All of these disks are being used in either RAID1 or RAID6-based configurations.

It appears when you enable TLER it reports the error to the OS and does not track it in SMART, for the last few disks, TLER had been enabled and the error and reporting was the same.

Without TLER, you see the offline_uncorrectable and pending sectors creep up etc, very interesting!

#16 Atamido

Atamido

    Member

  • Member
  • 288 posts

Posted 25 November 2008 - 01:38 PM

Still if you did it all at the same time then they're all from the same batch, and if they were all from a single vendor they were all subject to the vendor's handling (or lack thereof). I would strongly suspect a mishandled batch, poor handling by the vendor

I remember a company I worked for where a bunch of systems shipped closely to each other experienced a high drive failure rate. Internally they said it was due to mass-in-air-de-pallet-ization, a word they coined to indicate that there was trouble on the plane used to ship the pallet of drives, causing the pallets to fall apart. The idea being that the sudden hard jolts to all of the drives caused a high failure rate.

Officially, the company had no drive problem.

#17 continuum

continuum

    Mod

  • Mod
  • 3,540 posts

Posted 25 November 2008 - 09:06 PM

We've had similar issues with shipping problems, poor packaging, or both...

does not track it in SMART

SMART only predicts about 50% of drive failures if memory serves...

sorry, don't have anything else worthwhile to add at the moment. Good luck! Definitely try a few in a different system at least, when you go scream at WD you'll have more evidence to throw at 'em.

#18 6_6_6

6_6_6

    Member

  • Member
  • 590 posts

Posted 26 November 2008 - 12:50 AM

No bad sectors according to smart, which is clearly wrong.


How do you know?

You don't seem to have bad sectors but you have other problems.

Probably corrupt firmware/corrupt controller/broken motherboard. Uncorrectables / Index Not Founds / DMA errors, etc... So many of them... clearly something wrong.

If i were you, i would stop wasting time and do as suggested:

Plug this drive on a working system. Drop to DOS or boot from manufacturer's CD and run their diagnostics program there. If you drop to DOS, try running MHDD. You would have a better idea if there are bad sectors or IDNFs/UNCs, etc.


How do I know? On 60-70% of the disks, they started having bad sector errors. On the other 30-40% they report uncorrectable sectors to the OS (when I turned on TLER). All of these disks are being used in either RAID1 or RAID6-based configurations.

It appears when you enable TLER it reports the error to the OS and does not track it in SMART, for the last few disks, TLER had been enabled and the error and reporting was the same.

Without TLER, you see the offline_uncorrectable and pending sectors creep up etc, very interesting!


Where is it? I don't see pending sector count anywhere. They all show 0 above. I have no idea what TLER is. But if you are having so many drives failing, how come you never ran the manufacturer's utility and MHDD on these drives in a different stable system and how come you are forwarding us everything from the OS? This way, you can at least eliminate all the other unknowns in the equation and narrow it down to the drive. You have lots of IDNFs. I doubt you have something wrong with the media itself... probably firmware corruption, or motherboard/controller/OS/carbon footprint... well... sorry...

Proper troubleshooting for me:

1. Move drive to a different working stable system.

2. Run manufacturer's utily from CD/DOS. Record SMART values.

3. Do a Short DST... Long DST... Zero-fill / Full erase.

4. Run manufaturer's utility again. Check SMART values.

5. Run MHDD from DOS and see how drive surface is sector by sector.

#19 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 26 November 2008 - 04:17 AM

For this particular disk, it seems to be OK in another system BUT I did zero the disk out before I tested it in another system, next time I will not zero it out.. In addition, the system is old (ICH5), does not support NCQ or AHCI; however:

WD tools -> Check, Short+Extended = OK
Spinite -> 10 hour test = OK

Currently, I am testing the next drive with its mirrored disk (raid1) and running lots of I/O bound process around the RAID1.

1. I have disabled all SMART-related testing on the disk.
2. I have disabled hddtemp testing on the disk.

I will note one thing that was interesting, no matter what system the disk was connected to, it kept continually grinding away, even when IDLE or in the bios, almost like it was stuck on some kind of internal offline test, even though, I had disabled all relevant offline tests on the disk etc..

I want to find the root cause of this problem as much as everyone else, for this specific disk, thus far, seems to be a false positive in terms of tests, but it still acts quite weird (regarding constant accesses etc) so with the new one I continue to run disk benchmarks etc (but no smart/hddtemp) stuff for awhile and I will see if the problem recurs.

Justin.

#20 datestardi

datestardi

Posted 26 November 2008 - 06:48 AM

Spinite -> 10 hour test = OK


Spinite? Or Spinrite?

Jpiszcz, I wouldn't let Spinrite touch my drives, except as a last resort data recovery before the trash bin - it can do a lot of things that you can't imagine. (I'm not saying Spinrite is bad - I'm saying it can do things to the drive that drive manufacturers don't expect, and so cause drives to behave in ways users might not expect.)

I hope you haven't been using it from the beginning... unfortunately, if you have, that could explain a lot things.

From Wikipedia:

"SpinRite is declared by its developers to have certain unique features[3], such as disabling of disk write caching, disabling of [sector] auto-relocation.... Another important feature is direct hardware-level access, whereby the drive's internal controller interacts directly with the program, rather than through the operating system. This, in turn, allows dynamic head repositioning, whereby, when reading a faulty sector, the reading head is deliberately moved backwards and forwards many times, by varying amounts, in the hope that each time it returns to the sector, it may come to rest in a slightly different position....

It should be noted that certain claims made by SpinRite's makers have proved controversial. The program's claimed ability to "refresh" ageing drives has met with particular scepticism, while its "recovery" of sectors marked as damaged by the file system controller is considered by some to be undesirable and ultimately counter-productive...."

http://en.wikipedia.org/wiki/SpinRite

#21 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 26 November 2008 - 06:58 AM

No, it was the first time I used it.

Justin.

#22 datestardi

datestardi

Posted 26 November 2008 - 07:04 AM

No, it was the first time I used it.

Good. I'd suggest not using it on your other drives, and see if they too seem to be stuck in an "internal offline test."

#23 datestardi

datestardi

Posted 26 November 2008 - 07:18 AM

P.S. The more things you do to your drives (turning TLER on/off, disabling SMART testing, disabling all relevant offline tests on the disk, disabling hddtemp testing on the disk), the more I think your situation can be affected by "user error".

You should be able to stick with the manufacturer's defaults and test the drives. Obviously, if the drive as manufactured thinks it needs to perform an "offline test", then it probably *needs* to perform the offline test, and you shouldn't be disabling it... same with SMART. WD knows more about the drive than you do, so their defaults are likely much better than your choices... I'm not talking about TLER here. (And I'm only trying to be helpful.)

#24 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 26 November 2008 - 07:36 AM

P.S. The more things you do to your drives (turning TLER on/off, disabling SMART testing, disabling all relevant offline tests on the disk, disabling hddtemp testing on the disk), the more I think your situation can be affected by "user error".

You should be able to stick with the manufacturer's defaults and test the drives. Obviously, if the drive as manufactured thinks it needs to perform an "offline test", then it probably *needs* to perform the offline test, and you shouldn't be disabling it... same with SMART. WD knows more about the drive than you do, so their defaults are likely much better than your choices... I'm not talking about TLER here. (And I'm only trying to be helpful.)


Agree in this case, but I will mention in the past 5-10 years, I've always monitored the disk temperatures, graphed them, performed daily short smart tests and weekly long smart tests and never had any problems with any other type of disks. But as I mentioned earlier, to 100% rule this out I will not do anything special other than use the drives. In addition, I'll be placing them all on a 9650SE controller as well, this will help rule out any further issues as well. The compatibility lists on the 3ware site lists all variants of the velociraptor as compatible with the board. Besides that, so far, no issues with the new disk in RAID1 and no issues with the other disks in RAID6. Although this is typical, there are usually "problems" every 1-2 weeks, so I will have to wait a bit.

Justin.

#25 jpiszcz

jpiszcz

    Member

  • Member
  • 578 posts

Posted 05 December 2008 - 04:51 AM

I swapped out my power supply, changed ALL cables and bought a $1000 raid controller with BBU, the drives are still having problems, when writing to them in a RAID10 configuration, it locks up the card:

Error 1 occurred at disk power-on lifetime: 3708 hours (154 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 00 00 8d b8 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 80 18 00 6c 91 19 08	  00:57:50.230  WRITE FPDMA QUEUED
  61 80 18 00 cd b7 19 08	  00:57:50.148  WRITE FPDMA QUEUED
  61 80 e8 80 cc b7 19 08	  00:57:50.147  WRITE FPDMA QUEUED
  61 80 18 00 cc b7 19 08	  00:57:50.147  WRITE FPDMA QUEUED
  61 80 e8 80 cb b7 19 08	  00:57:50.146  WRITE FPDMA QUEUED

Latest 3ware BIOS/Firmware etc

DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)
DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)
DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)
DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)

E=1019 T=19:57:26	 : Drive removed
task file written out : cd dh ch cl sn sc ft
					  : 61 59 B8 8E 00 80 80
E=1019 T=19:57:26 P=Bh: Hard reset drive
P=Bh: HardResetDriveWait
  task file read back : st dh ch cl sn sc er
					  : 50 00 00 00 01 01 01
E=1019 T=19:57:26 P=B : Soft reset drive
E=0207 T=19:57:26 P=B : ResetDriveWait
E=1019 T=19:57:26 P=B : Inserting Set UDMA command
E=1019 T=19:57:26 P=B : Check power mode, active
E=1019 T=19:57:26 P=B : Check drive swap, same drive
E=1019 T=19:57:26 P=B : Check power cycles, initial=57, current=57
E=1019 T=19:57:26 P=Bh: exitCode = 0
Retrying chain
DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)
DcbMgr::WriteSegment(map=0x4B7E38, segID=0x32, events=20, error=0x0)

Hm the last thing I will try I suppose is disabling NCQ and see if the problem recurs.

Justin.

Edited by jpiszcz, 05 December 2008 - 04:52 AM.




2 user(s) are reading this topic

0 members, 2 guests, 0 anonymous users