Guest Eugene

And now... High-End DriveMark 2006 Results

Recommended Posts

The point of running traces is to isolate the disk subsystem. Isolating this subsystem has been a high priority in SR testing methodologies for some time.

Share this post


Link to post
Share on other sites

Can of worms.

What is "the disk subsystem"?

HDD, naturally. Plus HBA? Plus Driver? Plus OS? Plus "editor's choice" of application(s)?

The SR DiskMark benchmarks are affected by all of that, plus "editor's choice" of application use / workload, plus characteristics of the drive used to record the trace.

From replaying the same trace on TB3 and TB4 we know that HBA + driver + OS affect DiskMark values for a particular drive by at least 5%.

That may not seem much, but it's still five times the apparent margin of error and thus not exactly negligible.

The actual impact of the other factors is basically unknown yet, but most likely even more significant (just look at how much ranking was perturbed by trace captures from TB3 vs. TB4).

Even if all these issues could be solved the usefulness of the DriveMark value seems questionable: ratio of fastest drive (say A) vs. slowest drive (B) is about 2:1 here, but I highly doubt that means when using drive A I can do the same work in half the time as when using drive B.

As far as I can tell the drive idle time from the trace capture is not factored into the DriveMark value. So not a "real world" benchmark, sorry.

Share this post


Link to post
Share on other sites
And just to throw some more fuel on the fire, I have to imagine that the buffer size of the drive on which the trace was recorded significantly affects the timing of requests as well.  Consider a trace recorded on a drive with an 8MB buffer.  If a request misses the cache, that particular drive will take a few milliseconds to fetch the data from the platter, and so the trace will record a delay before the next request is issued.  However, if the same request is serviced instantly from a 16MB buffer drive because its in the cache, then the next request won't have to wait.  The trace adds an artificial delay and skews the results toward drives with the same buffer size as the reference drive. 

Mmmm, life is complicated.

214165[/snapback]

The trace doesn't record a delay as you have described. Whether the request is serviced by a cache hit or not, it is only recorded in the trace as a request of the particular data block(s) along with the current queue depth. The slower response of the drive with the smaller cache in your example doesn't change the capture or playback stage of the process. Well, except for the fact that the slower drive will playback the trace slower, which is the whole idea of the trace/playback method of benchmarking.

Share this post


Link to post
Share on other sites
From replaying the same trace on TB3 and TB4 we know that HBA + driver + OS affect DiskMark values for a particular drive by at least 5%.

That may not seem much, but it's still five times the apparent margin of error and thus not exactly negligible.

I've seen this mentioned a couple of times, but I must have missed where this was revealed. Can somebody point me to it?

What is "the disk subsystem"?

HDD, naturally. Plus HBA? Plus Driver? Plus OS? Plus "editor's choice" of application(s)?

The SR DiskMark benchmarks are affected by all of that, plus "editor's choice" of application use / workload, plus characteristics of the drive used to record the trace.

The actual impact of the other factors is basically unknown yet, but most likely even more significant (just look at how much ranking was perturbed by trace captures from TB3 vs. TB4).

The "editor's choice" seems to have had a significant effect on the results. I think part of the problem here is the extremely long period (in computer terms) between benchmark refreshes. It's been 3 or 4 years since TB3 was put together, and SR have been benchmarking new drives, that have likely been tuned to perform for today's applications, on benchmarks based on older applications.

Having said that, for that possible explanation to ring true, you'd want to see newer drives benefitting under TB4 and the reverse for older drives. This doesn't seem to be the case.

So it starts to seem a case of whether the drives have been tuned better or worse for the particular applications used in a benchmark. This would explain why other review sites provide end results that aren't always consistent with SR's results. While most other review sites don't use a testing methodology that matches SR's, many of them are still done well enough that their results are valid, while possibly contradictory to SR's. It seems to me that to provide the most applicable, general use benchmarks, you need to use as large an application set as possible. SR have always said that the one point they would consider conceding as inaccurate in their methodology is the applications used (my wording) - perhaps there was more to this than we previously expected?

Even if all these issues could be solved the usefulness of the DriveMark value seems questionable: ratio of fastest drive (say A) vs. slowest drive (B) is about 2:1 here, but I highly doubt that means when using drive A I can do the same work in half the time as when using drive B.

As far as I can tell the drive idle time from the trace capture is not factored into the DriveMark value. So not a "real world" benchmark, sorry.

214175[/snapback]

The DriveMarks have been designed as a way to measure the performance of the drives in 'real world' usage, but in isolation from all other factors. It's never been implied that a drive that's twice as fast in one of these benchmarks will result in the entire system running twice as fast. If a trace is taken of activity that only has disk activity during 20% of that time period, a drive that plays back that trace twice as fast obviously doesn't make any difference to the other 80% of that time period.

What might be interesting, along those lines, is the overall time elapsed during the trace captures. And in addition to that, the time it takes to play back the trace. This would help put the benchmarks in to proper context. The DriveMark numbers are derived by simply dividing the total number of IOs in a trace by the number of seconds it takes to play back the trace, so the information is probably available, just not published. Eugene?

Share this post


Link to post
Share on other sites
From replaying the same trace on TB3 and TB4 we know that HBA + driver + OS affect DiskMark values for a particular drive by at least 5%.

That may not seem much, but it's still five times the apparent margin of error and thus not exactly negligible.

I've seen this mentioned a couple of times, but I must have missed where this was revealed. Can somebody point me to it?

214178[/snapback]

http://forums.storagereview.net/index.php?...ndpost&p=212643

Share this post


Link to post
Share on other sites

I'd be interested in seeing the actual numbers for that chart. It doesn't look like 'at least 5%' to me, with perhaps the exception of Gaming on the MAT. Certainly not the 8-9% CityK stated.

It would also be interesting to see the results on a larger sample of drives. The Raptor results seem essentially unaffected by the hardware platform, whereas there is some variation on the MAT. Is one the exception and one the standard? Perhaps there is variation in all SCSI drives, pointing to the controller rather than the drives?

I seem to having a growing wish list that Eugene certainly will never have time to complete. Sorry Eugene :unsure: Just trying to assist with my thoughts on what can be done to ensure the validity of the testing methodology while there appears to be an opportunity to do so (ie. before TB4 project is finalised).

Share this post


Link to post
Share on other sites

I used an image viewer to read pixel locations and calculated the difference from there. For the MAT3300 hi-end it's 255 pixels vs. 244 pixels, or 4.51%.

Share this post


Link to post
Share on other sites
From replaying the same trace on TB3 and TB4 we know that HBA + driver + OS affect DiskMark values for a particular drive by at least 5%.

That may not seem much, but it's still five times the apparent margin of error and thus not exactly negligible.

214175[/snapback]

Hence, all those components are kept constant. The benchmarking software isolates the disks subsystem. It's application in a stable testbed isolates the disk.

Share this post


Link to post
Share on other sites
Certainly not the 8-9% CityK stated.
I've no clue where I got that from now. Probably looking at the wrong colors or mental error.

Share this post


Link to post
Share on other sites
And just to throw some more fuel on the fire, I have to imagine that the buffer size of the drive on which the trace was recorded significantly affects the timing of requests as well.  Consider a trace recorded on a drive with an 8MB buffer.  If a request misses the cache, that particular drive will take a few milliseconds to fetch the data from the platter, and so the trace will record a delay before the next request is issued.  However, if the same request is serviced instantly from a 16MB buffer drive because its in the cache, then the next request won't have to wait.  The trace adds an artificial delay and skews the results toward drives with the same buffer size as the reference drive. 

Mmmm, life is complicated.

214165[/snapback]

The trace doesn't record a delay as you have described. Whether the request is serviced by a cache hit or not, it is only recorded in the trace as a request of the particular data block(s) along with the current queue depth. The slower response of the drive with the smaller cache in your example doesn't change the capture or playback stage of the process. Well, except for the fact that the slower drive will playback the trace slower, which is the whole idea of the trace/playback method of benchmarking.

214177[/snapback]

Then what does this mean?

I presume that the new benchmarks still enforce the same order of submission of I/Os to the drives, even when CQ is enabled?

cheers, Martin

212544[/snapback]

That's correct, request order and interarrival times are properly preserved.

212562[/snapback]

Share this post


Link to post
Share on other sites

Have you sorted out the Server DriveMarks 2002 IO/Sec results yet ? (which really are a score not IO/sec rate)

How can I drive perform more IO/s than the rate it can with a load of 256 outstanding IO's. Every result of yours breaks this rule. Really the value is IO's per 1.4 to 1.44 seconds depending on the test (fileserver, web server).

Testbed 3 : The StorageReview.com Server DriveMarks

The solution? Normalization. Multiply up the single I/O score by a coefficient that equalizes the average single I/O score with 64 I/O results

The correct method would be to calculate the rate based on a workload.

For example,

Calculate time to complete IO's for each load.
100 IO's @  1 IO load
200 IO's @  4 IO load
250 IO's @ 16 IO load
300 IO's @ 64 IO load
Take the total number of IO's completed and divide by the total time it would take. Now we have a statistically & mathematically correct result.

You need to keep the denominator as IO's per second. The multiplying of the 1 IO load rate by 1.9# totally destroys the result as a IO/sec rate.

If this is your standard of mathematical skills, then no wonder people question the new DriveMark 2006 results.

Edited by tygrus

Share this post


Link to post
Share on other sites
If this is your standard of mathematical skills, then no wonder people question the new DriveMark 2006 results.
I think commentary like this is completely uncalled for, nor productive. If you want to jump on the bandwagon, then at least do so constructively.
Have you sorted out the Server DriveMarks 2002 IO/Sec results yet ? (which really are a score not IO/sec rate)
It's an index score, whose value is measured in IO/s
How can I drive perform more IO/s than the rate it can with a load of 256 outstanding IO's.
Once again, its an index value -- one which has been derived by equal weighting the constituent elements (whose real IO/s values, BTW, are also given in the performance database for all to see). An index is simply a measure or benchmark -- SR index could have just as well formed the index by using a base multiple of 1000, or any other number for that matter.
The correct method ...
As I just described, an index can be formed by any number of different methods.
You need to keep the denominator as IO's per second. The multiplying of the 1 IO load rate by 1.9# totally destroys the result as a IO/sec rate.
This sentence would seem to indicate that you have completely failed to understand how the index has been derived.
If this is your standard of mathematical skills
Suggest you withhold such criticism if your own self demonstrated understanding of some rather basic math is lacking.

Share this post


Link to post
Share on other sites
I used an image viewer to read pixel locations and calculated the difference from there. For the MAT3300 hi-end it's 255 pixels vs. 244 pixels, or 4.51%.

214185[/snapback]

I guess that would be +/- 2.26%

cheers, Martin

Share this post


Link to post
Share on other sites

Not really, please note that the same drive shows a difference w/ opposite sign for the bootup drivemark. Whatever, number of samples is way too small for any accurate and reliable analysis.

Share this post


Link to post
Share on other sites
Guest Eugene
Not really, please note that the same drive shows a difference w/ opposite sign for the bootup drivemark. Whatever, number of samples is way too small for any accurate and reliable analysis.

214676[/snapback]

I've run a few more drives through the same setup, still not an all-encompassing sample by any means but perhaps enought to affirm some loose conclusions?

TB3vTB4_2002.png

Share this post


Link to post
Share on other sites
Guest Eugene
There ARE obvious candidates for drives who's performance I really don't understand in the new TB4 - such as the Deskstar.  Maybe IBM knew what they were doing when the sold to Hitachi after all.

For those interested, regarding this very closely scrutinized outlier, it turns out that the Deskstar 7K400 has run through all our tests with its legacy ATA tagged command queuing enabled. The high-end chart released here as well as its office drivemark counterpart show all CQ-capable ATA drives with TCQ or NCQ enabled and disabled. With the Deskstar, only TCQ-enabled results have been presented thus far. As demonstrated by the Raptor WD740GD, TCQ has a detrimental effect on single-user performance. I'll update these graphs shortly to include non-CQ results for the Deskstar.

Share this post


Link to post
Share on other sites
Guest Eugene
I'll update these graphs shortly to include non-CQ results for the Deskstar.

drivemark_high-end.png

change_high-end_2002-2006.png

As these revised graphs demonstrate, the Deskstar goes from being a big loser (as referenced by many in this thread) to a modest gainer. Note that since Testbed3 never featured any results for the 7K400 with TCQ enabled, there isn't an entry for the % change with TCQ active.

Share this post


Link to post
Share on other sites
I'll update these graphs shortly to include non-CQ results for the Deskstar.

http://www.storagereview.com/benchimages/d...rk_high-end.png

http://www.storagereview.com/benchimages/c...d_2002-2006.png

215787[/snapback]

Wouldn't it be handy to include whether it's about a PATA or SATA drive?

Edited by Olaf van der Spek

Share this post


Link to post
Share on other sites
Guest Eugene
It's SATA... all ATA drives listed are serial.

215825[/snapback]

I'm confused. Does the SATA 7K400 support TCQ?

215828[/snapback]

In the past, IBM has publicized their drive's ATA-4 style tagged command queuing. This in effect disappeared with the 7K400. I remember, in fact, some readers lamenting the fact that Hitachi seemed to have "removed" TCQ from their latest drive. With Testbed3 (and the Promise SATA150TX4), I informally attempted to gauge whether TCQ still existed as an unpublished feature but couldn't get results that confirmed its presence.

With Testbed4 and the SI3124-2 controller, however, I noticed that IOMeter results that scaled as if CQ were present. Further, the Deskstar's relatively poor single-user showing suggested that the drive may in fact have some kind of TCQ ability enabled (recall that the Raptor WD740GD never does well in non-server tests with TCQ active).

I thus took standard measures to try and defeat TCQ with the Deskstar 7K400, hence the improved scores you see in this as well as the office thread.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now