Sign in to follow this  
ZFS lover

How to use "desktop" drives in RAID without TLER/ERC/CCTL

Recommended Posts

I don't have a solution yet. I'm not meaning to lead you on but to stimulate conversation so we can come up with the solution together.

Short background:

Hard drive manufacturers are drawing a distinction between "desktop" grade and "enterprise" grade drives. The "desktop" grade drives can take a long time (~2 minutes) to respond when they find an error, which causes most RAID systems to label them as failed and drop them from the array. The solution provided by the manufacturers is for us to purchase the "enterprise" grade drives, at twice the cost, which report errors promptly enough so that this isn't a problem. This "enterprise" feature is called TLER, ERC, and CCTL.

The Problem:

There are three problems with this situation:

The first is that it flies in the face of the word Inexpensive in the acronym Redundant Arrays of Inexpensive Disks (RAID).

The second is that when a drive starts to fail, you want to know about it, as Miles Nordin wrote in a long thread:

I was mostly thinking of the google paper:

http://labs.google.com/papers/disk_failures.html

...

But the interesting result for TLER/ERC is on page 7 figure 7, where

you see within the first two years the effect of reallocation on

expected life is very pronounced, and they say ``after their first

reallocation, drives are over 14 times more likely to fail within 60

days than drives without reallocation counts, making the critical

thereshold for this parameter also '1'.''

It also says drives which fail the 'smartctl -t long' test ...,

which checks that every sector on the medium is

readable, are ``39 times more likely to fail within 60 days than

drives without scan errors.'' so...this suggests to me that read

errors are not so much things that happen from time to time even with

good drives, and therefore there is not much point in trying to write

data into an unreadable sector (to remap it) or to worry about

squeezing one marginal sector out of an unredundant desktop drive (the

drive's bad---warn OS, recover data, replace it).

The third is that other attributes of consumer grade drives are attractive, as r.g. wrote:

My issue is this: I *want* the attributes of consumer-level drives other than the infinite retries. I want slow spin speed for low vibration and low power consumption, am willing to deal with the slower transfer/access speeds to get it. I can pay for (but resent being forced to!) raid-rated drives, but I don't like the extra power consumption needed to get them to be very fast in access and transfers. I'm fine with whipping in a new drive when one of the existing ones gets flaky. I find that I may be in the curious position of being forced to pay twice the price and expend twice the power to get drives that have many features I don't want or need and don't have what I do need, except for the one issue which may (infrequently!) tear up whatever data I have built.

Possible Solutions:

For a while, Western Digital released a program (WDTLER.EXE) that made it possible to enable TLER on desktop grade drives. This no longer works.

Quindor created a heroic thread that attempts to identify which exact drives on the market are compliant with the ATA standard and allow a software command to enable ERC temporarily. A problem with this, discussed in the thread, is how to verify that it works. Just because a drive tells you that ERC is enabled doesn't necessarily mean that it's true.

The best solution I've seen, described at length by qasdfdsaq in the same thread, is for the computer to compensate for the drive behavior like this:

Rather than relying on the drive to report an error within 7 seconds, or attempting to "fix" the drive so that it will, when any drive doesn't report back within 7 seconds, treat it as an error and cancel the operation.

What's nice about this solution is that it will work with any drive on the market.

We just need to figure out how to configure our controller or operating systems to behave this way. qasdfdsaq says that his Solaris system already does this by default.

According to SmallNetBuilder, the manufacturers of NAS boxes have already figured this out too:

The responses I received from Synology, QNAP, NETGEAR and Buffalo all indicated that their NAS RAID controllers don't depend on or even listen to TLER, CCTL, ERC or any other similar error recovery signal from their drives. Instead, their software RAID controllers have their own criteria for drive timeouts, retries and when a drive is finally marked bad.

These NAS boxes are all (to my knowledge) running Linux.

I run Linux and FreeBSD, so I'm interested in knowing how to configure those operating systems in such a way that I feel safe using "desktop" grade drives. Let's figure out the settings for all the common operating systems/controllers and post them here, so we can finally put this issue to rest and go back to using Inexpensive drives in our Redundant Arrays of Inexpensive Disks.

Edited by ZFS lover

Share this post


Link to post
Share on other sites

I'm going to start using this thread to store my notes as I research this issue. Please feel free to chime in.

Linux:

This message implies that it's impossible to tell a drive to cancel its bad read operation:

You can set the ERC values of your drives. Then they'll stop processing

their internal error recovery procedure after the timeout and continue

to react. Without ERC-timeout, the drive tries to correct the error on

its own (not reacting on any requests), mdraid assumes an error after a

while and tries to rewrite the "missing" sector (assembled from the

other disks). But the drive will still not react to the write request

as it is still doing its internal recovery procedure. Now mdraid

assumes the disk to be bad and kicks it.

There's nothing you can do about this viscious circle except either

enabling ERC or using Raid-Edition disk (which have ERC enabled by default).

Evidence that using ATA ERC commands don't always work:

The ERC commands took, in so far as I was able to read them back with what I set them to. This didn't seem to help much with the issues I was having, however.

Hope for improvement:

As for the read errors/kicking drives from the array, I'm not sure why

it gets kicked reading some sectors and not others, however I know there

were changes to the md stuff which handled that more gracefully earlier

this year. I had the same problem -- on my 2.6.32 kernel, a rebuild of

one drive would hit a bad sector on another and drop the drive, then hit

another bad sector on a different drive and drop it as well, making the

array unusable. However, with a 2.6.35 kernel it recovers gracefully and

keeps going with the rebuild. (I can't find the exact patch, but Neil

had it in an earlier email to me on the list; maybe a month or two ago?)

So again, I'd suggest trying a newer kernel if you're having trouble.

I'm not sure what patch he's referring to, but it might be bad block list management for md and RAID1.

Edited by ZFS lover

Share this post


Link to post
Share on other sites

FreeBSD:

It looks like there's a tunable parameter for how long before a drive is dropped from a RAID array, and it's set by default to 4 seconds:

kern.geom.mirror.timeout: 4

This could be set to a high number to prevent consumer drives from dropping from the array when they hit a bad sector, but doesn't solve the issue of poor performance.

Share this post


Link to post
Share on other sites

You guys are on the right track, but don't realise that you already got what you want!

Both Linux and FreeBSD can use normal desktop drives without TLER, and in fact you would not even want TLER in such a case, since TLER can be dangerous in some circumstances. Read on.

What is TLER/CCTL/ERC?

TLER (Time-Limited Error Recovery

CCTL (Command Completion Time Limit)

ERC (Error Recovery Control)

These basically mean the same thing: limit the number of seconds the harddrive spends on trying to recover a weak or bad sector. TLER and the other variants are typically configured to 7 seconds, meaning that if the drive has not managed to recover that sector within 7 seconds, it will give up and forfeit recovery, and return an I/O error to the host instead.

The behavior without TLER is that up to 120 seconds (20-60 is more frequent) may pass before a disk gives up recovery. This behavior causes haywire on all Hardware RAID and Windows-based software/onboard/driver RAIDs. The RAID consider typically is configured to consider disks that don't respond in 10 seconds as completely failed; which is bizarre to say the least! This smells like the vendors have some sort of deal causing you to buy HDDs at twice the price just for a simple firmware fix. LOL!! Don't get yourself buttraped; read on!

When do i need TLER?

You need TLER-capable disks when using any Hardware RAID or any Windows-based software RAID; bummer if you're on Windows platform! But this also means Hardware RAID on any OS (FreeBSD/Linux) would also need TLER disks; even when configured to run as 'JBOD' array. There may be controllers with different firmware that allow you to set the timeout limit for I/O; but i've not yet heard about specific products, except some LSI 1068E in IR mode; but reputable vendors like Areca (FW1.43) certainly require TLER-enabled disks or they will drop-out like candy whenever you encounter a bad/weak sector that needs longer recovery than 10 seconds.

Basically, if you use a RAID platform that DEMANDS the disks to respond within 10 seconds, and will KICK OUT disks that do not respond in time, then you need TLER.

When don't i need TLER?

When using FreeBSD/Linux software RAID on a HBA controller; which is a RAID-less controller. Areca HW RAID running in JBOD mode is still a RAID controller; it controls whether the disks are detached, not the OS. With a true HBA like LSI 1068E (Intel SASUC8i) your OS would have control about whether to detach the disk or not; and Linux/BSD won't, at least not for a simple bad sector. Not sure about Apple OSX actually, but since it's based on FreeBSD i could speculate that it would have the same behavior as FreeBSD; perhaps tuned differently.

Why don't you want TLER even if your disks are capable?

If you don't need TLER, then you don't want TLER! Why? Well because TLER is dangerous! Nonesense? Consider this:

1. You have a nice RAID5 array on Hardware RAID, being a valuable customer you spent the premium price on TLER capable disks.

2. Now one of your disk dies; oh bummer! But hey i have RAID5; i' protected, RIGHT?

3. So i buy a new disk, and replace the failed one! So easy, ha ha!

4. Oh noooes! A bad sector on of the remaining member disks, and it caused TLER to forfeit; now i got an I/O error during rebuilding my degraded array and the rebuild stopped and i lost access to my data! Arrrrgh!!

The danger in TLER lies that if you lost your redundancy, then if a weak sector occurs that COULD be recovered, TLER will force the drive to STOP TRYING after 7 seconds. If it didn't fix it by then, and you lost your redundancy, then TLER is a harmful property instead of a useful one.

TLER works best when you got alot of redundancy and can swap disks easily, and want disks that show any sign of weakness - if even just a fart - to be kicked out and replaced ASAP, without causing hickups which are unacceptable to a heavy-duty online money transaction server, for example. So TLER can be useful, but for consumers this is more like an interesting way for vendors to make some more money from you poor souls!

What is Bit-Error Rate and how does it relate to TLER?

uBER or Uncorrectable Bit-Error Rate, has been steady at 10^-14, but capacities are growing and the BER rate stays the same. That means that modern high-capacity harddrives now are more likely to be affected by amnesia; they sometimes really cannot read a sector. This could be physical damage to the sector itself, or just a weak charge meaning no physical damage to that sector but just unreadable.

So 2TB 512-byte sector disks have a relative high BER rate. This makes them even more susceptible to dropping out of conventional Windows/Hardware RAIDs, and is why the TLER feature has become more important. But i consider it to be rather a curse than a blessing.

So.. explain again please; Why don't i need TLER on Linux/BSD?

Simple: the OS does not detach a disk that times out, but resets the interface and re-tries the I/O. Also when using ZFS, it will write to a bad sector, causing that bad sector to be instantly fixed/healed/corrected since writing to a bad sector makes the disk perform a sector swap right away. In the SMART data, the "Current Pending Sector" (active bad sector) would then become "Reallocated Sector Count" (passive bad sector which no longer causes harm and cannot be seen or used by the host Operating System anymore).

That includes ZFS?

Yes. ZFS is, of course, the most reliable and advanced filesystem you can use to store your files, right now. It's free, it's available, it's hot. So use it whenever you can.

Feel free to comment.

Share this post


Link to post
Share on other sites

Cool, that's good news. So if I understand you correctly, I can use consumer grade drives without TLER/* on Linux and FreeBSD without setting any special parameters, and sleep well at night?

Here's a message I saw on the OpenSolaris forum about ZFS that caused me some concern:

Any timeouts in ZFS are annoyingly based on the ``desktop''

storage stack underneath it which is unaware of redundancy and of the

possibility of reading data from elsewhere in a redundant stripe

rather than waiting 7, 30, or 180 seconds for it. ZFS will bang away

on a slow drive for hours, bringing the whole system down with it,

rather than read redundant data from elsewhere in the stripe, so you

don't have to worry about drives dropping out randomly. Every last

bit will be squeezed from the first place ZFS tried to read it, even

if this takes years. however you will get all kinds of analysis and

log data generated during those years (assuming the system stays up

enough to write the logs which it probably won't:

http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSFailmodeProblem

) Maybe it's getting better, but there's a fundamental philosophical

position of what piece of code's responsible for what sort of blocking

all this IMHO.

Any thoughts?

Share this post


Link to post
Share on other sites
So if I understand you correctly, I can use consumer grade drives without TLER/* on Linux and FreeBSD without setting any special parameters

Yes, a normal Samsung F4 work great, and is the drive i recommend for mass storage ZFS home NAS. Do apply the firmware fix, though, that addresses a corruption issue on this drive. You also need aligned partitions on these 4K sector disks. And the controller is very important as well; it has to be a non-RAID controller called a HBA. I prefer the Intel SASUC8i since it can with in IT-mode (non-RAID) firmware and work on alot of OS including Solaris and FreeBSD and Linux and Windows. I use the SuperMicro USAS-L8i which is the same chip (LSI 1068E) but was cheaper than the Intel; it has a non-ATX bracket though, so that's why i recommend the Intel SASUC8i.

You would need to use a modern Linux and FreeBSD as well; older FreeBSD and Linux have bad timeout behavior just like current Hardware RAID. On FreeBSD, with anything beyond 8.x you should be fine; the storage stack is very advanced where both SCSI and ATA (what people call "IDE") are unified and share alot of code in a clean stackable infrastructure, making things like TRIM-over-RAID possible, and the GEOM framework adds an abundance of pluggable and stackable storage modules which grant you software RAID, compression, virtualization, encryption and alot more.

Here's a message I saw on the OpenSolaris forum about ZFS that caused me some concern:

I can't say anything about Solaris and derived products, but on FreeBSD the ATA/CAM stack controls the timeouts, and is set to progressively increase the timeouts as they occur; before the disk will be detached. This means that your disk should not be detached due to a simple bad sector timeout, as common on desktop systems. It doesn't use a fixed timeout value; but rather keeps initial timeout low to report to ZFS and anything that lies 'beyond' that disk in the GEOM framework, while retrying with a higher timeout value as they occur; until finally failing if recovery time has expired.

I also observed behavior on a degraded RAID-Z, with one disk lost its power. zpool status yielded a very high write error rate for that disk, even when just reading data instead of writing it. This could only be explained that ZFS tried to write to the failed device, even though it could not read from it. Why? Because Because ZFS could not read from this device, it still tried to supply the failed disk with the data that could not be read from that disk, and thus was retrieved from alternative (redundant) source instead. Writing this information to a drive that has a bad sector, will cause it to swap the bad sector and all problems go away! This is the desired behavior for error control.

Share this post


Link to post
Share on other sites
4. Oh noooes! A bad sector on of the remaining member disks, and it caused TLER to forfeit; now i got an I/O error during rebuilding my degraded array and the rebuild stopped and i lost access to my data! Arrrrgh!!
Let's not forget that with the size of disks today, a rebuild failure is far more likely than it used to be. With a 1*10^15 BER things get scary considering how easy it have to a 5TB or 10TB RAID5.

So not necessarily a major factor, but definitely a contributing factor. We much prefer to run RAID6 over RAID5 for this reason.

Share this post


Link to post
Share on other sites

ok so we set up the software raid with zfs

how do we align it ?

I don't think you need to if you use whole disk. If you slice your disks, then you have to do maths, but ZFS loves raw disks.

Share this post


Link to post
Share on other sites

Alignment can be achieved in two ways:

Option 1: not use any partitions at all; just bare disks, like dilidolo said. This option is simple, but has some limitations:

limitation a: you don't have any real 'names' for your drives. GPT partitions allow you to give your disk a name like "samsung4" if you have 8 samsung disks. This makes managing and identifying your disks easier.

limitation b: if you need to replace a disk, it must be of at least the same size. HDD sizes between brands and types could vary by a small amount. If your new disk is just 50 kilobytes smaller, then you can't use it to replace a failed disk. For that reason, reserving 1MB - 100MB of space on your disk located at the end of their capacity would give you the ability to add disks that are slightly smaller as well.

limitation c: if you use FakeRAID that stores metadata in the last sector, like Silicon Image, Promise, JMicron, and "Gigabyte" branded (Jmicron) controllers, then you must avoid writing to this last sector or you could have corruption and boot issues. In this case, you should use partitions and reserve the last few sectors and for reasons above preferably more.

Option 2: use aligned partitions. When we say aligned we usually mean a multiple of 4KiB, but typically alignment is done with 1MiB (1 binary megabyte) offset; also referred to as 2048 sectors, since each sector is 512-bytes, even on current 4K disks like Samsung F4 since they still report as having 512-byte sectors; they are native 4K drives with 512-byte sector emulation. To create an aligned partition, you typically should use a method that supports creating that. I'm not sure FreeNAS does, but my ZFSguru project does, and manual commands would work as well. Either MBR or GPT partitions can be used, but i suggest GPT since this allows you to give names (labels) to your disks, identifying them for easy management.

The only limitation of option 2 is that portability of your ZFS pool would be limited to platforms that fully support GPT; Solaris does not support freebsd-zfs GPT partitions, sadly. So if you want to retain compatibility between Solaris and FreeBSD, your only option would be bare disks (option 1) or old-fashioned MBR partitions.

Cheers.

Edited by sub.mesa

Share this post


Link to post
Share on other sites

And the controller is very important as well; it has to be a non-RAID controller called a HBA. I prefer the Intel SASUC8i since it can with in IT-mode (non-RAID) firmware and work on alot of OS including Solaris and FreeBSD and Linux and Windows. I use the SuperMicro USAS-L8i which is the same chip (LSI 1068E) but was cheaper than the Intel; it has a non-ATX bracket though, so that's why i recommend the Intel SASUC8i.

How about embedded non-RAID SATA ports on the motherboard? For example, I'm using the motherboard SATA ports on the HP MicroServer for both FreeBSD and Linux. I can't tell from their spec sheet what chip they're using. Are they sufficient?

I can't say anything about Solaris and derived products, but on FreeBSD the ATA/CAM stack controls the timeouts, and is set to progressively increase the timeouts as they occur; before the disk will be detached. This means that your disk should not be detached due to a simple bad sector timeout, as common on desktop systems. It doesn't use a fixed timeout value; but rather keeps initial timeout low to report to ZFS and anything that lies 'beyond' that disk in the GEOM framework, while retrying with a higher timeout value as they occur; until finally failing if recovery time has expired.

I also observed behavior on a degraded RAID-Z, with one disk lost its power. zpool status yielded a very high write error rate for that disk, even when just reading data instead of writing it. This could only be explained that ZFS tried to write to the failed device, even though it could not read from it. Why? Because Because ZFS could not read from this device, it still tried to supply the failed disk with the data that could not be read from that disk, and thus was retrieved from alternative (redundant) source instead. Writing this information to a drive that has a bad sector, will cause it to swap the bad sector and all problems go away! This is the desired behavior for error control.

Awesome! Great information, thank you!

Share this post


Link to post
Share on other sites

You guys are on the right track, but don't realise that you already got what you want!

...

Feel free to comment.

Some good points. However there are a few missing bits of information.

#1: RAID doesn't really mean inexpensive disks. http://en.wikipedia.org/wiki/RAID

#2: Always use "Advanced Format" drives (4k sector). There's no reason to buy 512b sector drives anymore. This will improve your uBER to <1 in 1^15

#3: Nearline (sometimes called enterprise) drives do have one physical advantage over desktop drives. The have an additional vibration sensor (rotation) that allows them to work in sets without the seeks from one drive causing seek errors (harmless, but performance degrading) in another.

These days you can buy very nice low power nearline drives like the WD RE4-GP.

Share this post


Link to post
Share on other sites

The answer is simple : 3rd party firmware for drives!

This has been done for many things in the past, now it is time for some folks to write/re-write firmware/microcode for the drives. Not all drives, say 1TB and up drives, the popular and cheap models.

There are really no physical limits, the drive makers make the FW so we can;t easily use the drives in arrays, so we have to spend more just for a drive with a few 1's and 0' in the FW that don;t make the RAID controllers kick them out of the array.

Option #2: The 1st person to make a RAID controller, either software or hardware, that can account and compensate for TLER or whatever it is being called, will do well for a period of time.

I once took some 750GB Dekstop drives and flashed the Enterprise firmware, and guess what? The array ran w/o drive failure for 15 months, not bad considering before the flash I had to rebuild every other month? Same drive, just a flash fixed me up just fine.

Regards,

Dave

Share this post


Link to post
Share on other sites

This is a great thread. Very useful to me.

#1: RAID doesn't really mean inexpensive disks. http://en.wikipedia.org/wiki/RAID

That was the original name. And that was the original point: take cheap and nasty low end drives and synthesize a system which has performance characteristics (bandwidth and reliability) of higher cost drives. At the time there were several tiers of disks with limited crossover: at least PC (ST506 evolving towards IDE), workstation (SCSI), departmental, and data centre.

It is fair to say that in 25 years circumstances have changed. But the point you are trying to quibble with is bang on.

The lack of TLER in consumer drives is clearly intended to create market segmentation. I cannot imagine any marginal cost in adding the feature to firmware. If the market were really competitive, then every manufacturer would offer it in their consumer drives.

Share this post


Link to post
Share on other sites

Both Linux and FreeBSD can use normal desktop drives without TLER, and in fact you would not even want TLER in such a case, since TLER can be dangerous in some circumstances. Read on.

That's really useful to know. I use Linux.

I assume that there must be settings somewhere for how patient the driver should be but I don't know where they are.

Why don't you want TLER even if your disks are capable?

If you don't need TLER, then you don't want TLER! Why? Well because TLER is dangerous! Nonesense? Consider this:

...

The danger in TLER lies that if you lost your redundance, then if a weak sector occurs that COULD be recovered, TLER will force the drive to STOP TRYING after 7 seconds. If it didn't fix it by then, and you lost your redundancy, then TLER is a harmful property instead of a useful one.

Quite true. When there is no redundancy (i.e. no RAID or degraded RAID), TLER should be turned off. One would hope that the RAID controller (firmware or software) would be smart enough to do this. But during normal running of a RAID system, TLER is definitely a plus. I would expect that it would reduce the duration horrible latency bursts that might affect other parts of the computer system.

Share this post


Link to post
Share on other sites

I was using desktop grade Seagate/Hitachi drives successfully for years on a hardware RAID 6 controller. I used to get a timeout error once every couple of months, it would rebuild pretty quickly and since I'm on RAID 6 it wasn't much of a concern.

Just recently switched to WD SE "enterprise grade" drives and despite having this TLER feature, right after a successful migration, I've had several timeouts from multiple drives.

This whole TLER thing just feels like a bit of a placebo to me, after what I've experienced anyway. So much for "enterprise" grade.

Edited by E71

Share this post


Link to post
Share on other sites

I was using desktop grade Seagate/Hitachi drives successfully for years on a hardware RAID 6 controller. I used to get a timeout error once every couple of months, it would rebuild pretty quickly and since I'm on RAID 6 it wasn't much of a concern.

Just recently switched to WD SE "enterprise grade" drives and despite having this TLER feature, right after a successful migration, I've had several timeouts from multiple drives.

This whole TLER thing just feels like a bit of a placebo to me, after what I've experienced anyway. So much for "enterprise" grade.

I tend to agree with you. I buy RE's because they obviously perform better in RAID, but due to my budget limitations I have also used WD Blacks in RAID for years. They have not been any less reliable than the RE's.

In my experience if a drive doesn't respond for seven or more seconds, then that drive probably has a serious problem and is close to failure. I agree that TLER is important for keeping a high-availability array running even if a drive is becoming glitchy, but the emphasis on TLER implies that quality non-RAID drives routinely time out for extended periods. This is not the case, and Adaptec even lists the WD Black on my controller's compatibility list.

While TLER is worth having, I think the main benefits of enterprise drives are the superior vibration tolerance (which affects speed a great deal), the lower probability of uncorrectable errors, and the large command queue/RAID optimized firmware.

PS--The SE's are fairly new. Be sure to check your adapter's compatibility list and get the current firmware. Also check your cables.

Edited by dietrc70

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this