Peter Gavin

WD 500GB RE2 drives

Recommended Posts

So, over the course of the last couple years I've purchased a total of 11 of these drives. Right now 10 are in a Linux md RAID6 array, and one I keep as a cold spare (but is currently on its way to be RMAed).

This will be the 5th or 6th I've RMAed since I started buying these drives. Luckily, they come with a 5 year warranty, but that's sorta where my question lies. How the heck can WD continue to put out drives that are failing at a rate of 50% within the warranty period? I mean, srsly, shouldn't drives this bad come with a 1 year warranty?

Oh, and even worse: when I get the spare back from RMA, I'm going to have to send one of the drives in the RAID back because smart is showing 18 uncorrectable sectors. I can't think of any reason this should happen. The drives are in a decent Lian Li case on shock absorbing mounts, with plenty of fans. hddtemp says they're running at about 35-40C, which is cool enough I suppose.

Has anyone else has this bad of a reliability record on these drives (or any other drive, for that matter)? I've been throwing around the idea of building a new RAID using 1.5TB drives so I can run fewer of them (saving energy, I hope). But I've read that the really big drives are even worse! What should I do?

Share this post


Link to post
Share on other sites
What exact chassis and what power supply are you using? Are these all from the same batch initially, same shipment?

How are they failing?

Show smart stats for each disk.

How old are they?

Share this post


Link to post
Share on other sites
What exact chassis and what power supply are you using? Are these all from the same batch initially, same shipment?

How are they failing?

Show smart stats for each disk.

How old are they?

First things first. I miscounted/remembered a bit. There are actually 9 of these drives in the case. There is another smaller WD drive serving as the OS disk.

Well, I'll have to post the SMART logs when I get home tonight, but this is the case:

http://www.newegg.com/Product/Product.aspx...N82E16811112140

and this is the PSU (750W thermaltake):

http://www.newegg.com/Product/Product.aspx...N82E16817153038

The whole thing is plugged into an APC UPS (by itself).

The first 4 I bought were from the same batch & shipped at the same time. I think I only have 2 of the original 4 left. The rest have been bought 1 or two at a time, and I've always RMAd 1 at a time, so by now they're pretty mixed.

I'll start with the last disk that failed since I remember it best. That disk first exhibited signs of failure by not being completely detected by the bios. I say completely, because the bios would detect it, but not display the model or serial number for the disk like it usually did. Linux detected it fine (but had similar issues with reading the drive information from it) and it continued to work like that for several months. Datawise, the disk operated fine (no data corruption or anything like that that I could see). It finally gave out this way a couple weeks ago. I had another one fail in a similar way, but it died more suddenly.

Other disks just had lots of unreadable/uncorrectable sectors. The drives didn't fail outright, just gradually got worse and worse.

Share this post


Link to post
Share on other sites

I've had 14 of these under heavy load almost non-stop for a year, without any troubles. Oh, once the controller timed one out, but it was fine after a reboot and that was > 6 months ago.

Share this post


Link to post
Share on other sites
and it continued to work like that for several months
Uh, as soon as BIOS detection starts failing, the disk is considered dead as doornail... :o

Drives that start accumulating bad sectors... after how long does it normally take?

Share this post


Link to post
Share on other sites

Just another data point: I've had 8 of these drives running 24/7 for about 13 months with no problems at all.

4 x WD3201ABYS-01B9A0 in Linux software RAID10

4 x WD5001ABYS-01YNA0 attached to a Areca ARC-1120 (RAID10)

Bad power supply? Drives running too hot for too long? (The disks in my Linux RAID are running at 33C right now.) Maybe too many start/stop cycles?

Share this post


Link to post
Share on other sites
and it continued to work like that for several months
Uh, as soon as BIOS detection starts failing, the disk is considered dead as doornail... :o

Of course, I'm just waiting for the last RMA to come back.

Drives that start accumulating bad sectors... after how long does it normally take?

Not long. I expect this disk to be really dead in a couple weeks tops.

Oh, here's the output of smartctl -a for the disk with bad sectors:

Device Model:     WDC WD5002ABYS-01B1B0
Serial Number:    WD-WMASY5920733
Firmware Version: 02.03B02
User Capacity:    500,107,862,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri Apr 17 19:40:52 2009 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)	The previous self-test completed having
				the read element of the test failed.
Total time to complete Offline 
data collection: 		 (9480) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 112) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   238   238   021    Pre-fail  Always       -       1066
 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       53
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1219
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       51
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       38
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       53
194 Temperature_Celsius     0x0022   112   101   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       20
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       16
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      1202         976768983
# 2  Short offline       Completed: read failure       90%      1178         976768983
# 3  Short offline       Completed: read failure       90%      1154         976768983
# 4  Short offline       Completed: read failure       90%      1130         976768983
# 5  Short offline       Completed: read failure       90%      1106         976768983
# 6  Short offline       Completed: read failure       90%      1082         976768983
# 7  Extended offline    Completed: read failure       90%      1059         976768983
# 8  Short offline       Completed: read failure       90%      1058         976768983
# 9  Short offline       Completed: read failure       90%      1034         976768983
#10  Short offline       Completed: read failure       90%      1010         976768983
#11  Short offline       Completed: read failure       90%       986         976768983
#12  Short offline       Completed: read failure       90%       962         976768983
#13  Short offline       Completed: read failure       90%       938         976768983
#14  Short offline       Completed: read failure       10%       916         976768983
#15  Extended offline    Completed: read failure       90%       891         976768983
#16  Short offline       Completed: read failure       90%       890         976768983
#17  Short offline       Completed: read failure       90%       866         976768983
#18  Short offline       Completed: read failure       50%       842         976768984
#19  Short offline       Completed without error       00%       819         -
#20  Short offline       Completed without error       00%       795         -
#21  Short offline       Completed without error       00%       771         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

As you can see, this disk suddenly started having bad sectors just over 2 weeks ago (smartd runs a short test every night, and a long test once a week). I'm not entirely sure how old the disk is, but it's newer than most of the rest of them since its the WD5002ABYS model.

This is the Model line for all the disks as dumped by hdparm -i, minus the serial numbers:

Model=WDC WD5002ABYS-01B1B0                   , FwRev=02.03B02
Model=WDC WD5000ABYS-01TNA0                   , FwRev=12.01C01
Model=WDC WD5000ABYS-01TNA0                   , FwRev=12.01C01
Model=WDC WD5002ABYS-01B1B0                   , FwRev=02.03B02
Model=WDC WD5002ABYS-01B1B0                   , FwRev=02.03B02
Model=WDC WD5000YS-01MPB0                     , FwRev=07.02E07
Model=WDC WD5000YS-01MPB0                     , FwRev=07.02E07
Model=WDC WD5000ABYS-01TNA0                   , FwRev=12.01C01
Model=WDC WD5000YS-01MPB0                     , FwRev=07.02E07

Share this post


Link to post
Share on other sites

Couple of weeks? I would suspect you have either a power or a vibration problem. Vibration control in that chassis does not look particularly good.

I would try to run matching firmwares, too.

At this point it's sort of the shooting-in-the-dark stage. If you just lay the drives individually on a desk, not touching each other, and they all run fine for a few months, you can be pretty sure it's a vibration problem...

Share this post


Link to post
Share on other sites
Oh, here's the output of smartctl -a for the disk with bad sectors:

...

SMART overall-health self-assessment test result: PASSED

...

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - *0*

...

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed: read failure 90% 1202 976768983

# 2 Short offline Completed: read failure 90% 1178 976768983

# 3 Short offline Completed: read failure 90% 1154 976768983

# 4 Short offline Completed: read failure 90% 1130 976768983

# 5 Short offline Completed: read failure 90% 1106 976768983

# 6 Short offline Completed: read failure 90% 1082 976768983

# 7 Extended offline Completed: read failure 90% 1059 976768983

# 8 Short offline Completed: read failure 90% 1058 976768983

# 9 Short offline Completed: read failure 90% 1034 976768983

#10 Short offline Completed: read failure 90% 1010 976768983

#11 Short offline Completed: read failure 90% 986 976768983

#12 Short offline Completed: read failure 90% 962 976768983

#13 Short offline Completed: read failure 90% 938 976768983

#14 Short offline Completed: read failure 10% 916 976768983

#15 Extended offline Completed: read failure 90% 891 976768983

#16 Short offline Completed: read failure 90% 890 976768983

#17 Short offline Completed: read failure 90% 866 976768983

#18 Short offline Completed: read failure 50% 842 976768984

...

As you can see, this disk suddenly started having bad sectors just over 2 weeks ago (smartd runs a short test every night, and a long test once a week).

Sorry, you've lost me - I'm not seeing what you're saying. Yes, there are read errors, but there are (apparently) no bad sectors. (Unless of course I'm misreading your test data.)

I'm not acquainted with the technology of the RE2... but the most obvious guess (in my view at least) for read errors when you don't have bad sectors is vibration.

Share this post


Link to post
Share on other sites
Oh, here's the output of smartctl -a for the disk with bad sectors:

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - *0*

Sorry, you've lost me - I'm not seeing what you're saying. Yes, there are read errors, but there are (apparently) no bad sectors. (Unless of course I'm misreading your test data.)

I'm not acquainted with the technology of the RE2... but the most obvious guess (in my view at least) for read errors when you don't have bad sectors is vibration.

You skipped a line:

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 16

The sector doesn't need to have been remapped to be considered bad (AFAIK).

Share this post


Link to post
Share on other sites
The sector doesn't need to have been remapped to be considered bad (AFAIK).

Let's use some logic:

There are three potential causes to your RE2 drives' common errors:

1) the drives are bad by design - this would affect all users (and doesn't seem to be reflected by the comments here);

2) there are problems in your system (e.g., vibration, power) that are causing the errors - this would affect you alone; or

3) there is something else common to your drives (e.g., shipping, handling damage) that is causing the errors.

Hybrid causation is possible (i.e. a design flaw could interact with your specific system problems). Experientially, 2) and 3) would be considered the most probable. To determine that the disks are bad, you would need to eliminate the "2)" and "3)" possibilities (e.g., by removing the disks from your system and testing them individually).

Your drives have no bad sectors, but they have read errors. (Offline uncorrectable errors mean that uncorrectable read errors have occurred during offline self-tests. We know that. It's interesting that 18 read errors have occurred, but only 16 offline uncorrectable errors have been recorded.)

An offline uncorrectable read error does not mean there are bad sectors. The causes for offline uncorrectable read errors include:

1) an offtrack position of the head (caused e.g. by vibration or a servoing error) during reading of the data;

2) an offtrack position of the head (caused e.g. by vibration or a servoing error) during original writing of the data;

3) a bad sector due to a media defect (e.g., a thermal asperity, a defect in the media recording layer, or a defective servo sector preceding the data sector).

Vibration problems in multi-disk setups can easily cause read errors (please watch the video):

http://blogs.sun.com/brendan/entry/unusual_disk_latency

Share this post


Link to post
Share on other sites
The drives are in a decent Lian Li case on shock absorbing mounts

P.S. I hope it's the case that is on shock absorbing mounts, and not the drives themselves. If you've elastically mounted the drives themselves, I'll venture to guess we can pinpoint the problem (elastic or undamped mounting could wreak havoc on a servo system, depending on the respective resonance frequencies)....

Share this post


Link to post
Share on other sites
The drives are in a decent Lian Li case on shock absorbing mounts

P.S. I hope it's the case that is on shock absorbing mounts, and not the drives themselves. If you've elastically mounted the drives themselves, I'll venture to guess we can pinpoint the problem (elastic or undamped mounting could wreak havoc on a servo system, depending on the respective resonance frequencies)....

I have RE3 drives in a Lian Li case using the Lian Li 4-in-3 modules and have not had any issues, what type of mount setup do you have? Do you have any special settings enabled on the drives, PM2/etc? Has anything changed recently or have these worked for a long time and then all of the sudden..? Have you experienced any system instability or just drive instability? Is your system on a UPS? Are *all* drives affected in one way or another? Have you tried replacing the PSU?

Share this post


Link to post
Share on other sites
I have RE3 drives in a Lian Li case using the Lian Li 4-in-3 modules and have not had any issues

Here's some research into soft-mounting a cage that contains e.g. 3 operating HDDs (see FIG. 4):

http://www.dtc.umn.edu/publications/reports/2005_08.pdf

I believe everything depends on the damping in such a scenario - if there's any looseness in the mounting (<< 1 mm), vibrations will be basically undamped, and may cause a loss of throughput (as in the paper) and even read errors.

Edited by datestardi

Share this post


Link to post
Share on other sites

Soft mounts are generally fine for one or two drives, especially if they're single disks and not in a RAID setup.

Seagate has a white paper with their research too, and it does cause a performance hit. I'm not 100% sure this is the cause of the OP's problem, but my experience with large arrays as well as extensive customer and chassis vendor development from both Chenbro and AIC confirms that vibration is a serious issue with large arrays... hence why this does smell like a possible vibration influenced issue to me.

And indeed, it could be the PSU or something, but again, it doesn't necessarily smell that way...

Share this post


Link to post
Share on other sites

Since the end of 2006 I have bought and used about 20 drives of the WD5000YS series.

From the earlier batch bought in 2006 about 5 of 10 died. These were all drive with S/N WCANU <1130000

Everything bought after that is still under heavy load, without a hitch.

Since these are European S/N I don't know if that is any use.

I noticed everything from RE3 is noticably cooler and less prone to RMAing.

cheers,

Don

Share this post


Link to post
Share on other sites

I don't have anything to say about the RE series, but I've replaced 2x 500AAKS which is the SE series in the last 2 months.... Now these are crammed in an old P4 Dell case with 2 SCSI u320 drives, and temp readings are off the chart 100*C which I have read can be false, as that is boiling temps and would cause a quick death...

This all in a mdadm raid 1. Without Smartd, I would've never known, I don't know how I'll ever trust a windows raid again, I get alerts when the smart test fails, or has a read error, and next day it's replaced. what more could you ask for?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now