Eugene

Tcq, Raid, Scsi, And Sata

Recommended Posts

1) Is RAID 10 striping two pairs of mirrored drives or mirroring two pairs of striped drives? Just curious on this one. I don't think it would make a lot of difference.

Here's the answer:RAID 0+1 vs. RAID 1+0

The Guide contains a good, detailed discussion of the various flavors of RAID.

RAID 0+1 or 1+0 can provide higher performance (esp. writes; no parity calculation or write) than RAID 5, but at a higher equipment cost.

Share this post


Link to post
Share on other sites

Oh man, this article broke me in two. At least I'm less ignorant about the true capabilities of RAID-0 with SATA II. And I was so juiced about the Promise TX4200 and two Seagate Barracuda 7200.7's or .8's with RAID-0 and NCQ.

What will it take then to vastly improve storage device performance for a single-user box? Forget solid-state, cause I'm poor and I've heard its not progressing.

Is there going to be a "re-test" when optimizations have been made to hardware controller, drives and drivers (if any)?

I guess I had gotten my hopes up with this other article:

http://www.bjorn3d.com/read.php?cID=734

Thanks for your professional reviews.

Share this post


Link to post
Share on other sites

At my previous job I did a lot of comparisons of caching, raid, and disk-drive type performances for estimating which configurations would provide the best application response for our product, which incorporated multiple databases.

One database used a single large file, and that was very much unfazed by the differing hardware I tossed underneath it.

The other database used many, many, many separate files. Transactions would flush to disk at start and finish. Guarantees of solidly written data, but at a massive expense in performance.

In the end, what I found mattered was spindle rpm. 15K drives have faster seek times, end of story. I had a 5400 rpm PATA-133 drive that was spanking the 15K rpm scsi-160 drive. It was lying about writing data to disk, and was caching it instead. After using the utility from MS to turn of write caches, I got the expected spindle-speed limited performance on writes.

However, with flushing multiple files to disk, the command queuing can make a big difference, as it can order the file flushes correctly, as multiple small files being accessed simultaneously start to look more like a "multi-user" scenario.

I'd be interested in seeing similar tests done with real raid systems, though. RAID5, on a real controller, and with controllers that can do write-through caching and write-back caching, and use battery-backed ram for cache, so that they can respond to the filesystem, and work the disks as needed later.

But then, this was more an article about the TCQ and NCQ, etc, so it's going to be about I/O, and not about throughput of data.

Share this post


Link to post
Share on other sites

Rankdisc , Command Queueing & I/O reordering

From the discription of Testbed3 (see testbed3 article):-

Introduction to IPEAK SPT

IPEAK SPT delivers a wide-ranging suite of utilities that assist in assessing both drive performance and workload characteristics. It consists of five primary components:

WinTrace32 - IPEAK SPT's fundamental tool is WinTrace32, a memory-resident background program that captures all OS and file system calls to a disk controller's driver. It permits the capture of any arbitrary workload and dumps the results into a raw "trace file" that may then be utilized by several other IPEAK components.

[snip]

RankDisk - In our view, this is the most exciting tool of all. RankDisk takes raw trace files generated through WinTrace32 and systematically plays back the requests from the controller on downwards. By doing so, this benchmark permits comparative evaluation of the storage subsystem (driver, controller, and drive) while avoiding WinBench 99's drawbacks.

If I read this correctly, WinTrace takes a simple linear trace of disc activity across all threads & processes, and RankDisc replays it in the same order, ie in a single thread.

If I am wrong, and RankDisk replays each process/thread's I/O in a separate thread then what follows is complete *******s, and should be completely ignored.

The Desktop DriveMark traces were recorded on a Maxtor DiamondMax D740X through a Intel 82801BA, neither of which I believe support CQ.

The point of Command Queueing (TCQ or NCQ) in a disc drive is to allow I/O's to be serviced out-of-order. In the real (single user) world, certain processes would therefore proceed faster because of CQ, but RankDisc replays the I/O requests in a fixed order, so it cripples their performance just as if the disc didn't support CQ.

An example of what I mean.

Assume two processes are performing disc accesses, one highly random and the other completely sequential. Both are completely I/O bound (ie very low CPU load).

Once one process issues an I/O, it has to wait for it to complete, so the other process then executes, and has a chance to issue it's own I/O. Therefore we should see a pattern of alternating random & sequential I/Os. Of course, many of the sequential I/Os will be fullfulled from the disc buffer (due to read-ahead), but the disc still has to perform the I/Os in the order requested. This is the pattern that will be recorded by WinTrace on a non-CQ drive.

Now consider the same pattern on a machine with command queueing. The random & sequential processes each issue their first I/O, and the disc will decide what order to complete them. The difference comes when the processes issue their second I/Os. The disc will have to seek for the random I/O, but the sequential process will be fulfilled immediately from the read-ahead buffer. It can continue to accept & complete sequential requests as quickly as the command overhead will allow, until the buffer is exhausted. The second Random I/O may still not have completed.

As far as I can tell, though, the benchmarks do not simulate this behaviour. The trace has alternating random & sequential I/Os. The first I/O from each thread is issued, and completes. The second I/O's are then issued, and the disc re-orders them to complete the sequential I/O first. RankDisk then waits for the random I/O to complete before it issues the next sequential I/O, because that's the order in which they were recorded. The sequential I/O has been crippled.

It's a shame that this article didn't include WinBench results (even though the applications are so old), because that would have been a real-world test, rather than the synthetic Desktop DriveMarks.

Conclusion from the TCQ article (see http://www.storagereview.com/articles/2004...5TCQ_sp.html):-

2. Command queuing is meant to assist multi-user situations, not single-user setups. With the recent release of Intel's 9xx chipsets, pundits and enthusiasts everywhere have been proclaiming that command queuing is the next big thing for the desktop. Wrong. As evidenced by the disparities between the FastTrak S150 TX4 and TX4200 (otherwise identical except for the latter's added TCQ functionality), command queuing introduces significant overhead that fails to pay for itself performance-wise in the highly-localized, lower-depth instances that even the heaviest single-user multitasking generates. It is becoming clear, in fact, that the maturity and across-the-board implementation of TCQ in the SCSI world is one of the principal reasons why otherwise mechanically superior SCSI drives stumble when compared to ATA units. Consider that out of the 24 combinations yielded from the four single-user access patterns, one-to-four drive RAID0 arrays, and RAID1/10 mirrored arrays presented above, the non-TCQ S150 TX4 comes

out on top in every case by a large margin. TCQ is only meant for servers, much like the technology mentioned just below.

I wonder if the comment above should say "command queuing introduces significant overhead whilst Desktop Drivemark doesn't allow any performance benefits to be demonstrated"?

Conclusion from the Testbed3 article:-

More often than not, however, SCSI drives simply do not exhibit the proportional gains in desktop performance that one would expect from their superior spindle speeds, access times, and transfer rates because the manufacturer has invested its resources, time, and effort into crafting the drive into a server-destined design.

Once again, Desktop DriveMark doesn't allow the highly optimised command queueing in these drives to give any benefit. Is this the reason for this conclusion?

I suspect that the IOMeter tests which are used to give the Server DriveMarks generate the requests on-the-fly in such a way that the benefits of TCQ are fully realised.

Thoughts, anyone?

cheers, Martin

Share this post


Link to post
Share on other sites
If I am wrong, and RankDisk replays each process/thread's I/O in a separate thread then what follows is complete *******s, and should be completely ignored.

WinTrace32 and RankDisk are perfectly capable of tracing and replaying I/Os with multiple queue depths. Here are a few examples of scenarios with high and low queue depths:

Average queue depths:

1) Exchange High Concurrency: 72.31 I/Os

2) Business Winstone 2004 Multi-tasking Test: 5.23 I/Os

3) NFS Underground: 1.85 I/Os

full.png

I had to use different hard drives or different controllers because it is impossible to disable NCQ on the Silicon Image Sil 3124-1 .

Share this post


Link to post
Share on other sites
WinTrace32 and RankDisk are perfectly capable of tracing and replaying I/Os with multiple queue depths.

Thank you for your reply. I thought this thread had died!

However, I have to point out that although queueing is a pre-requisite for NCQ, it is I/O reordering within those queues which actually delivers any performance benefit.

SR claims that NCQ delivers no real benefit to the end user, which is refuted by your benchmarks.

The SR benchmarks replay their I/O's in a fixed order, so no reordering, so no performance benefit.

cheers, Martin

Share this post


Link to post
Share on other sites
WinTrace32 and RankDisk are perfectly capable of tracing and replaying I/Os with multiple queue depths.

Thank you for your reply. I thought this thread had died!

However, I have to point out that although queueing is a pre-requisite for NCQ, it is I/O reordering within those queues which actually delivers any performance benefit.

SR claims that NCQ delivers no real benefit to the end user, which is refuted by your benchmarks.

The SR benchmarks replay their I/O's in a fixed order, so no reordering, so no performance benefit.

cheers, Martin

203365[/snapback]

The quote you have provided from FemmeT actually provides the necessary response to your comments here. But I think you are missing what is being said, and how your comments are incorrect.

By being able to play back a trace including accurate reproduction of the queue depth, you are in fact enabling CQ to provide it's benefits. Any time there is disk activity that produces a queue with a depth greater than one, this is captured in the trace. When this trace is played back, the queue depth is maintained to match the depth recorded during the trace. So if at a particular point during the capture the queue depth went up to 5, when it's played back those 5 i/o's will be issued in sequence without waiting for the first or subsequent i/o's to complete. This will therefore enable any CQ performance benefits to be realised in the hardware being benchmarked.

Further, by restricting playback to queue depths of one when the trace capture records that it was all that the queue depth reached at that time, you won't realise any performance benefits from CQ. This is accurate when benchmarking CQ enabled hardware due to the fact that the CQ hardware would not have been able to apply it's benefits in the circumstances that the trace capture occurred in.

Share this post


Link to post
Share on other sites
The quote you have provided from FemmeT actually provides the necessary response to your comments here.  But I think you are missing what is being said, and how your comments are incorrect.

203635[/snapback]

No - I still believe that I have failed to get my point across, and you prove it with the statement that:-

By being able to play back a trace including accurate reproduction of the queue depth, you are in fact enabling CQ

(Of course)

to provide it's benefits.

Which is where we start to disagree.

Any time there is disk activity that produces a queue with a depth greater than one, this is captured in the trace.  When this trace is played back, the queue depth is maintained to match the depth recorded during the trace.  So if at a particular point during the capture the queue depth went up to 5, when it's played back those 5 i/o's will be issued in sequence without waiting for the first or subsequent i/o's to complete.  This will therefore enable any CQ performance benefits to be realised in the hardware being benchmarked.

When you say that they are issued in sequence, you are simply confirming my point.

There are five I/O's in the queue. That is presumably from five different applications or threads.

The benchmark must watch which I/O of those five completes AND THEN ISSUE THE NEXT I/O FROM THAT THREAD, not whichever one it happened to record next when it captured the trace.

NCQ offers the possibility that some threads will run faster than others. The current benchmark says that every thread must run a slow as the slowest one, because it cannot issue I/Os into the queue in a different order than they were captured.

cheers, Martin

Edited by MartinP

Share this post


Link to post
Share on other sites

Can I hope gaining at least 1% in Write operations over a single drive, assuming I have a 4 disks T7K250-250 RAID5-128k stripped controled by a Promise EX8350? Single user/desktop usage.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now