Brian

Thoughts on Reliability Database Improvements

Recommended Posts

How can the RD be made more effective? For those who used to use it a lot, please tell us what you like/dislike so we can make it better.

A few thoughts:

  • Make the results viewable to the public
  • Add top 10 reated drives feature
  • Add top 10 most rated drives feature
  • Report on monthly drive reliablity trends
  • Allow anyone to add a drive to the database (requiring admin approval)

Share this post


Link to post
Share on other sites

On that last point - integrating such a feature into the database might be more likely to get a response than having a forum thread asking for the same info. Make it easy, an online form, and put it where people who've just failed to find their drive in the reliability database will see it.

The first point is a little dangerous - I think I preferred the old "this drive is more reliable than 76% of other drives that met a minimum number of reports". Otherwise joe public will apply their own flawed reasoning to the results, and come to the wrong conclusion.

Share this post


Link to post
Share on other sites

We'll work on a submission form for the database pages. First order is to get the database updated with drives that you guys will submit reports for ;)

Share this post


Link to post
Share on other sites

I like the idea of a submission form but would say that it would have to be parsed by a 'drive enforcer' to make sure all the data is accurate, complete and follows the same format. Having many people input data just is a study in chaos.

As for reliability, one item that I would be interested in is more granular data on the failures. For example, in production environments the definition of failure is both subjective and varies over time. (i.e. [read|write] timeout errors (recoverable); [read}write] timeout errors (non recoverable); High media error count (personally I would say >0 ;) ) or SMART for SATA/IDE data fields that are query-able opposed to the comment section on the drives. A bricked (non operational/doesn't spin up et al) drive happens but is relatively rare. Most of what I see are soft errors that would elect a drive for replacement (like if I see a drive get two read errors (recoverable) in a row in a raid scrub, it gets pulled and replaced as failed).

Another item would be how the drive was run (power conditioned/UPS ed or not when it was in operation); and if it was a static install or a drive that was moved/transported (besides initial shipment for first install, I have seen numerous drives fail that were objects of shipping or transporting).

No illusions here that the data set would be statistically significant (like the old database) but it may help shed some light on general end-user habits and what people would class as a failure.

Share this post


Link to post
Share on other sites
I like the idea of a submission form but would say that it would have to be parsed by a 'drive enforcer' to make sure all the data is accurate, complete and follows the same format. Having many people input data just is a study in chaos.

Totally agree, an admin would approve it. I just want it to be friendly for people to request new drives. The forums might work okay, or a contact form. We'll see what people respond to.

As for reliability, one item that I would be interested in is more granular data on the failures. For example, in production environments the definition of failure is both subjective and varies over time. (i.e. [read|write] timeout errors (recoverable); [read}write] timeout errors (non recoverable); High media error count (personally I would say >0 ) or SMART for SATA/IDE data fields that are query-able opposed to the comment section on the drives. A bricked (non operational/doesn't spin up et al) drive happens but is relatively rare. Most of what I see are soft errors that would elect a drive for replacement (like if I see a drive get two read errors (recoverable) in a row in a raid scrub, it gets pulled and replaced as failed)

It certainly not terribly difficult to add some standardized reasons for failure. Let's definitely discuss this further when we come around for a revision of this tool. And with your other suggestions too, makes sense, you're just looking for more data points as to what and why a drive failed.

Share this post


Link to post
Share on other sites

Are there plans to include external drives in the database? If yes, with them there are also additional points of failure and it would be good to separate actual drive failures from failures of bridge chips / power supplies, in case the drive owner bothers to check which it was.

As an example, i got a WD Elements 500 GB drive which seemed to hang up totally when it reached ~40 celsius (according to smart). I opened it up, and tried the drive alone - works flawlessly. Then tried using another drive in the enclosure, and it had same problems. If i kept it actively cooled or on cold windowshelf, it remained cool enough to function. Or use it for only 10 min so it doesn't have time to heat up too much. So in that case it was clearly the usb-sata bridge chip failing (location=temperature of power supply didn't seem to make a difference). Drive itself is still in use.

Share this post


Link to post
Share on other sites

That's a great point. We're reviewing external drives and yes, I think it makes sense to include them in the database as well. We need to make a few changes to better accommodate SSDs in the database. When we do that, we'll make space for the externals as well and try to select a few common failure points to include as well.

Share this post


Link to post
Share on other sites

I look at the external HDD concept mostly the opposite way. I suspect that mostly failures are of the drive itself, and only a limited number of drives are shipped in external enclosures. I'd like to see a table of which actual HDD is in each model of "external drive", so that we can check the (probably much bigger) data sample for the actual drive used.

Edited by Kremmen

Share this post


Link to post
Share on other sites

In some cases we'll be able to get to the internal drive, in some cases we won't. In any case it does void the warranty, so we'll have to see what the manufacturers will let us get away with.

Share this post


Link to post
Share on other sites

In some cases we'll be able to get to the internal drive, in some cases we won't. In any case it does void the warranty, so we'll have to see what the manufacturers will let us get away with.

Those external drives that SMART works on can just be interrogated through software. Those that don't pass SMART commands through the interface are probably better off not being purchased anyhow. :)

Hopefully, the manufacturers would just tell you. I guess maybe not, if they're throwing their oldest and cheapest drives into the external enclosures?

Share this post


Link to post
Share on other sites

I like that, interrogation via software...is that like drive water-boarding?

We know from experience that the external drives often start off as one thing, but then supplies get low or there's an economic factor, and they switch to another drive type. Hitachi did this a while back, I have a 7K200 in an external which is actually better than the 5400 RPM drives they started with. But yes, the point is valid and we'll at least report on the drive that's in our review model and endeavor to track changes and update the review with user-reported data on teh drives being used, etc.

Share this post


Link to post
Share on other sites

What I'd like to see is an easier way to search for the drives. I've got two main ideas:

1) search by entering model number. Especially if you've not bought the drive yourself (e.g. disc is in a pre-built PC) it's easier to get to a model number such as WD20EADS. I'd rather just type in that number than try and work out which category of green power drives it belongs to.

2) (maybe a bit ambitious) have a mini-application that can be downloaded and automatically read and report back the drive models in the user's computer.

On Linux it's as easy as running hdparm - for one of the dev servers where I work, it outputs information like this:

costello:/dev # hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media

Model Number: OCZ-VERTEX

Serial Number: 99TVKC8FP0N7ID510HJ1

Firmware Revision: 1.31

costello:/dev # hdparm -I /dev/sdb1

/dev/sdb1:

ATA device, with non-removable media

Model Number: WDC WD3000HLFS-01G6U0

Serial Number: WD-WXLY08152721

Firmware Revision: 04.04V01

I'm not a Windows programmer but I presume something very similar has to be available somewhere.

Share this post


Link to post
Share on other sites

Search is going to he key. We'll definitely take this into consideration as we figure out the upgrade path for the database.

Incidentally, we're upgrading some of the core components to the site first, but the RD is very high on the to do list.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now