Jump to content


Photo

RAID 5 with bad / fixed sectors


  • You cannot start a new topic
  • Please log in to reply
6 replies to this topic

#1 MarkDS

MarkDS

    Member

  • Member
  • 2 posts

Posted 19 February 2013 - 11:05 AM

Hi all,

been googling for a long time, contacted hw vendors for explanation and solutions, nobody could give me a consistent answer, so I'm trying here, after being the top result on "storage forum" on google :)

I have a synology 4-bay NAS with 4 WD 3TB drives. 1 by 1 three of my drives started to have some bad or pending sectors. After a LOT of research I bought hdd regenerator, took the server offline, attached the drives to a spare comp, ran the program, put the drives back into the nas and booted, fingers crossed.
system booted, did an extended SMART test on all drives, and they all tested ok now. (magnetic errors can be repaired by hdd reg, or spinrite it appears).
since those 4 drives are in a software raid 5 setup, and like 500 sectors were bad on the drives, I'm guessing something is wrong now under the hood, but I just don't see it yet.
I'm thinking the raid array should protect me from data-loss, but I'm looking for a way to have it check all data that is on the volumes/raid array.
question is, I don't know at what level this is taking place.
Should I find some way to have the raid-array verify and fix any errors that are present? or do I need to do a filesystem check and will that fix whatever sectors were changed / reactivated? Other things I am missing?
Basically I'm thinking either the raid array or the filesystem don't know that some sectors have changed, want to fix it before it becomes an issue

thanks for all input and your time reading through this!

Mark

Edited by MarkDS, 19 February 2013 - 11:07 AM.

#2 sub.mesa

sub.mesa

    Member

  • Member
  • 25 posts

Posted 19 February 2013 - 11:49 AM

I'm thinking the raid array should protect me from data-loss

RAID protects against failed disks. The problem is that it offers no protection against bad sectors. From the RAID engine's view, a drive with a bad sector is a defective drive. And that is one of the major shortcomings of most RAID systems; they treat harddrives very binary. Either your harddrive will work fine without bad sectors, or your drive is going to be kicked out after a few bad sectors.

Even worse, beyond the RAID layer lies a very oldfashioned filesystem that belongs to a different era. Today's filesystems are very outdated in the sense that they offer no protection at all to your files. No protection for metadata, no protection for corruption, no protection for misordered writes, virtually nothing.

In your case I'm not sure how you fixed all those bad sectors. If you used SpinRite like you said, you should have the original contents of the bad sector and thus no damage. But somehow I think that HDD Regenerator simply overwrites bad sectors with zeroes. In that case the data corruption is permanent. You can run a filesystem check (fsck) on your filesystem and see what damage it reveals. Files not detected as bad mean the metadata structure in intact but it is very possible and even likely that the files themselves are corrupted in varying degree. You can notice this as corrupt archives or 'bleeps' and 'artifacts' in audio/video files. Corruption is potentially very dangerous.

If you desire a solution that is highly tolerant to bad sectors and virtually impossible to lose data just because of bad sectors, then ZFS is your man. It offers protection against the dangers you are exposed to at this very moment, and would have prevented the damage caused by bad sectors in your case. If you want to provide good protection to your data, migrating to ZFS is a possibility. You can always keep what you have and use it as backup solution or sell it off.

#3 continuum

continuum

    Mod

  • Mod
  • 3,572 posts

Posted 21 February 2013 - 05:04 PM

The problem is that it offers no protection against bad sectors. From the RAID engine's view, a drive with a bad sector is a defective drive.

Not quite... a parity or RAID1 setup will still have good data to rebuild a stripe affected by a bad sector. Technically speaking they do tolerate bad sectors. However, the rest of your post is more or less correct:

And that is one of the major shortcomings of most RAID systems; they treat harddrives very binary. Either your harddrive will work fine without bad sectors, or your drive is going to be kicked out after a few bad sectors.

Most RAID setups give drives only a short timeframe (a few seconds) to recover from sector errors, then requiring the drive to go on and and mark the sector as bad. Proper drives will do this, desktop drives usually don't (since desktop users are ok with extended error detection and recovery attempts).

And yes, a proper filesystem with built-in protection against things like this and bit-rot such as ZFS is something the rest of the world needs to hurry up and start using. *sigh*

#4 MarkDS

MarkDS

    Member

  • Member
  • 2 posts

Posted 04 March 2013 - 06:00 AM

Not quite... a parity or RAID1 setup will still have good data to rebuild a stripe affected by a bad sector. Technically speaking they do tolerate bad sectors. However, the rest of your post is more or less correct:

Most RAID setups give drives only a short timeframe (a few seconds) to recover from sector errors, then requiring the drive to go on and and mark the sector as bad. Proper drives will do this, desktop drives usually don't (since desktop users are ok with extended error detection and recovery attempts).

And yes, a proper filesystem with built-in protection against things like this and bit-rot such as ZFS is something the rest of the world needs to hurry up and start using. *sigh*


Continuum, any idea how I can instruct the software raid to do a parity check or something and basically make the data consistent again on the 4 drives? Since atm I have 1 drive that is working again, but has some sectors that are not "in line" with the correct data on the 3 other drives... the drives are in a synology nas box, so ZFS is out of the question...

greets

#5 dietrc70

dietrc70

    Member

  • Member
  • 106 posts

Posted 07 March 2013 - 05:37 PM

Check the synology documentation to see if there is a "verify and fix" or "verify" option.

#6 sub.mesa

sub.mesa

    Member

  • Member
  • 25 posts

Posted 11 March 2013 - 01:19 PM

Not quite... a parity or RAID1 setup will still have good data to rebuild a stripe affected by a bad sector. Technically speaking they do tolerate bad sectors.

But, can you provide at least one controller/firmware combination that actually does this? Until this date, no one has been able to confirm to me any actual product that uses redundancy in case of an unreadable sector. Only ZFS does that.

But you are right, it is at least theoretically possible. However, conventional RAID can never know whether the parity or mirrored copy is correct or stale. RAID can not distinguish between corrupt data and valid data. It lacks the facilities required to make that determination, such as checksums. It can only recognise that the data and parity are not in sync. Virtually all RAID controllers will rebuild the parity, blindly assuming that the data is good and the parity is bad.

Most RAID setups give drives only a short timeframe (a few seconds) to recover from sector errors

Well by convention, this timeout value cannot be less than 10 seconds. TLER is typically set at 7 seconds, to cope with even the most strict controllers that employ 10 second timeouts. If the harddrive did not provide the requested data within 10 seconds, it is detached and marked as failed. This pretty much means such RAIDs are extremely sensitive and virtually incompatible with modern disks that by design produce bad sectors due to insufficient ECC errorcorrection.

requiring the drive to go on and and mark the sector as bad. Proper drives will do this, desktop drives usually don't

Mark the sector as bad? You mean marking it as Current Pending Sector in the SMART output? All drives do this; the consumer drives simply spend more time on recovery before giving up, 120 seconds typically. Any good technology would be able to cope with this; as it is easy to send a reset command and go on with life. Only primitive firmware RAID systems appear to have problems with such kind of drives. Generally this means you need TLER drives for old-fashioned RAID controllers, while modern implementations of software RAID under Linux and BSD platforms as well as ZFS do not require special disks with TLER support and work just fine with casual consumer drives.

In fact, TLER feature can be dangerous and is nothing more than an ugly hack. Assume you have a RAID5 where one drive is completely failed. This means you run degraded - basically a RAID0. In this circumstance, where you lost your redundancy, you are at the mercy of bad sectors. It is extremely common to encounter these, during the rebuild of the RAID5 with a new disk. What happens is that one or more disk members will encounter bad sectors. If you have TLER disks and lose your redundancy, this pretty much means data corruption or even a failed array - as many controllers kick out disks with bad sectors even if they return I/O errors.

Without TLER, you will leave the recovery methods of the harddrive intact. This means that in degraded conditions you still have a last line of defence; which otherwise would have been killed by TLER.

a proper filesystem with built-in protection against things like this and bit-rot such as ZFS is something the rest of the world needs to hurry up and start using. *sigh*

It pleases me to read this. I have helped many people with broken RAIDs; hardware RAID like Areca and software RAIDs like Intel driverRAID. So many people lose their data due to incompetent software engineering. The whole TLER issue is just sad; basically an incompatibility between hardware and software. Even ordinary consumers deserve better protection for their data!

#7 continuum

continuum

    Mod

  • Mod
  • 3,572 posts

Posted 18 March 2013 - 04:17 PM

But, can you provide at least one controller/firmware combination that actually does this?

I don't have any current production that runs such configurations, so unfortunately no, I cannot.

And we discourage RAID5 in production systems due to the lack of parity protection during a rebuild; the unrecoverable bit error rate gets ugly when doing multiple-TB rebuilds. We strongly recommend RAID6 if you must do a parity RAID.

TLER isn't quite as evil as you make it out to be, if so we would have a far, far higher failure rate in our prototype and production systems for our customers than what we actually see. Modern hardware RAID controllers seem to handle things fairly well, ten years ago it was a great deal worse.

And let's not even get into the fact that uncorrectable bit error rate is (factory spec'ed) at an order of magnitude WORSE on consumer disks for the most part (than nearline disks). ;)



1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users