Theis

Failure Trends In A Large Disk Drive Population

Recommended Posts

Hi SR Readers

I just recieved a link to this pdf by a good friend of mine, and thought it might be of interest to some of you.

Its a study by some Google employees on the failure rates of disk drives deployed within their infrastructure.

Happy reading :)

Best Regards

Theis

Share this post


Link to post
Share on other sites

In general, there are around four ways a HDD will fail. Firmware zone corruption, electronic failure, mechanical failure, and logical corruption. Unfortunately, S.M.A.R.T only handles a subset of the "mechanical failure" category (mainly media failure and thermal failure), and does not protect against single instance/incident catastrophic mechanical errors (head crash, spindle/servo motor failure, stick-shun).

I give props to the google crew for their SMART correlation analysis. I would have loved to see a quick postmortem on all of the failed drives to determine cause, and a breakdown into the top four categories.

Great info, thanx.

Thank you for your time,

Frank Russo

Side note, logical corruption may lead to data loss, but it does not necessarily mean that the HDD has failed.

Share this post


Link to post
Share on other sites

Interesting article. I was surprised about the lack of correlation found between elevated temperature and expected drive reliability. The graph only went up to 50 C, though, and I would expect a much stronger correlation once the temperature reached beyond typical max temp specs.

Share this post


Link to post
Share on other sites
Interesting article. I was surprised about the lack of correlation found between elevated temperature and expected drive reliability. The graph only went up to 50 C, though, and I would expect a much stronger correlation once the temperature reached beyond typical max temp specs.

I guess the reason is that they simply dont have drives hotter than 50C in their datacenters...

So obviously the analysis of the real-world data will only show the drives used inside the specs.

Share this post


Link to post
Share on other sites
Interesting article. I was surprised about the lack of correlation found between elevated temperature and expected drive reliability. The graph only went up to 50 C, though, and I would expect a much stronger correlation once the temperature reached beyond typical max temp specs.

I'd guess that the temperature difference might be different due to different drive models and types of models. From the PDF it appears that they used SATA & PATA drives between 5400-7200 RPM. I would guess that the 5400 RPM drives would run cooler but as less expensive drives with presumably cheaper parts they would also have higher failure rates.

Share this post


Link to post
Share on other sites

3.2 Manufacturers, Models, and Vintages

Failure rates are known to be highly correlated with drive

models, manufacturers and vintages [18].

This is congruent with users experiences like those in storage review's database and polar opposite to the "insiders" position who've posted here claiming that all brands are the same. Do a little searching and you'll discover who those shills are at SR.

Share this post


Link to post
Share on other sites
This is congruent with users experiences like those in storage review's database

That rather useless database you mean?

Share this post


Link to post
Share on other sites

http://news.bbc.co.uk/1/hi/technology/6376021.stm

"lower temperatures are associated with higher failure rates" + "hard drives which are three years old and older were more likely to suffer a failure when used in warmer environments" + 7200rpm drives run warmer than 5400rpm => ...

To me it appears 5400rpm drives are less reliable due to cheaper components (for example: ball-bearings). If you cut cooling on these drives (which most are probably quite old now (possible 3+ years)), they will become even more unrealible.

Newer drives tend to run hotter as they use more power and they are more reliable as there has been constant efforts in making them more reliable (unlike 5400rpm drives which have been condemned to death already). They are more reliable despite running hotter - not because they are running hotter. Certainly assuming reducing cooling would increase HDD life of newer 7200rpm drives is "a bit daring"... or actually quite foolish.

The positive effect would be less money spent to cooling fans and electricity powering the cooling fans. (And for non-server environments it would also reduce system noise.) But to expect increase in reliability... Pretty much a no-brainer.

I don't think Google's study was worthless. I just think the conclusions made on heat-reliability relation are just more than what I would make from a sample of drives consisting of many different drive models. No matter how big (or HUGE) a sample - if there's more than one model, there's no way of generating meaningful relation between temperature and reliability as there'd be more variables than those two alone.

If they want to publish some meaningful information on this heat-reliability relation, they should publish them for some sub-samples consisting of single drive models. Instead they claim the information "proprietary"... not even publishing the data under pseudonyms (like "sample model 1", "sample model 2"). It's quite hard to trust them if they can't publish the proof. Do they have the proof?

Google's study did confirm SMART accuracy to be around 50% which is inline with previous studies conducted with smaller samples.

Share this post


Link to post
Share on other sites
I don't think Google's study was worthless. I just think the conclusions made on heat-reliability relation are just more than what I would make from a sample of drives consisting of many different drive models. No matter how big (or HUGE) a sample - if there's more than one model, there's no way of generating meaningful relation between temperature and reliability as there'd be more variables than those two alone.

FDB's have a fairly wide temperature threshold. They perform extremely poorly (in terms of platter vibration) when under or over temperature. In a DC, drives are rarely under temperature (not that many cold boots), and intelligent controllers will usually fail a drive once it's exceeded it's temperature threshold. If you stay within the threshold, there shouldn't be a problem.

Frank

Share this post


Link to post
Share on other sites
3.2 Manufacturers, Models, and Vintages

Failure rates are known to be highly correlated with drive

models, manufacturers and vintages [18].

This is congruent with users experiences like those in storage review's database and polar opposite to the "insiders" position who've posted here claiming that all brands are the same. Do a little searching and you'll discover who those shills are at SR.

It's most definitely NOT the polar opposite. All makes have produced some bad batches and occasionally entire models that were bad. However, as a rule, (the generalization!), most drives from any maker will be just fine.

Their temperature data is within standard drive operating limits, which partially explains the lack of elevated failures.... although I would have expected to see a bit more as well.

They also do not address vibration at all, which is not surprising since most of Google's server systems are much closer to regular desktops than what most companies use.

Can't see the article links yet, so I will refrain from comment there.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now