Guest Eugene

An Overview of Testbed4

Recommended Posts

Guest Eugene

One of StorageReview's hallmarks has been our consistent testbeds that enable direct comparison of a wide variety of drives, not just those found within a given review. Our third-generation Testbed has carried us for more than 3.5 years. Testbed4's era now dawns. The hardware has been updated. Software has been revised. Temperature assessment has been overhauled. There are winners and there are losers. Join us as we take a look at SR's updated hard drive test suite and see how your favorite disk stacks up!

Testbed4 Overview

Share this post


Link to post
Share on other sites

I find it a little unfortunate that no testing is done on more 'typical' machines, primarily for the office use type measurements. I think everyone would be very surprised to find someone running office apps on a machine like this.

A much more typical 'fast' office machine would be at best perhaps a 955x motherboard running the standard ATA/sATA interfaces, and since PIIX controllers are what control a majority of all the worlds IDE drives it would have been nice to see figures on them.

The degree of change in some of these drives is to say the least interesting, perhaps indicating that the particular choice of hardware is having a notable effect on the outcomes, therefore indicating that some thought should be put into what type of hardware is used to test which benchmark.

Share this post


Link to post
Share on other sites

Hi Eugene

First off, great article I've been looking forward to it.

Question:

Could you post the queue depth distribution for the game benchmarks ?

Best Regards

Theis

Share this post


Link to post
Share on other sites
These three drives, however, remain constrained to their initial single-I/O score straight through a load of 32 outsanding operations. It is only the significantly heavier 64 and 128 queue depths that permit these drives an increase in I/Os per second delivered.

Does anyone know the cause for this?

Share this post


Link to post
Share on other sites
Guest Eugene
These three drives, however, remain constrained to their initial single-I/O score straight through a load of 32 outsanding operations. It is only the significantly heavier 64 and 128 queue depths that permit these drives an increase in I/Os per second delivered.

Does anyone know the cause for this?

216444[/snapback]

It's an artifact of the SI3124 driver.

Share this post


Link to post
Share on other sites

Given the requirements of SR's Testbed4 for SCSI and SAS support, the choice of workstation-class hardware was well known. With dual core on the desktop, dual Xeons are more alike to their desktop companions than before...

Share this post


Link to post
Share on other sites
Guest Eugene
I find it a little unfortunate that no testing is done on more 'typical' machines, primarily for the office use type measurements. I think everyone would be very surprised to find someone running office apps on a machine like this.

This way discussed here way back when we were mulling over the initial hardware.

The principal requirements (because we're going to expand significantly into multi-drive arrays) were PCI-X, PCIe (>1x), and an NCQ-capable SATA controller built in to the southbridge.

Satisfying all three requirements (at least back then) with a single board was impossible. Hence, we chose to emphasize the first two. Few consumer boards come with PCI-X. As a result, we found ourselves with the xeon mobo. Anyway, check out that thread for more detail.

Share this post


Link to post
Share on other sites
I find it a little unfortunate that no testing is done on more 'typical' machines, primarily for the office use type measurements. I think everyone would be very surprised to find someone running office apps on a machine like this.

A much more typical 'fast' office machine would be at best perhaps a 955x motherboard running the standard ATA/sATA interfaces, and since PIIX controllers are what control a majority of all the worlds IDE drives it would have been nice to see figures on them.

The degree of change in some of these drives is to say the least interesting, perhaps indicating that the particular choice of hardware is having a notable effect on the outcomes, therefore indicating that some thought should be put into what type of hardware is used to test which benchmark.

216427[/snapback]

In short: The benchmarks aren't "bootup takes 30 seconds, loading a level in FarCry takes 10 seconds, loading a Word document takes 2 seconds"... They're numbers that are used to compare one drive to another.

Yes, having only 256MB of memory will limit the overall speed of your computer; as will having a single-core, single-threaded 2.4GHz Celeron. And changing your video card will affect gaming performance significantly. But none of this matters in a hard drive review that produces 'abstract' scores.

What this new testing does is exaggerate the differences between drives. Much like the U.S. EPA's fuel economy tests. (There's a reason it says 'these numbers are for comparison only' on car window stickers.) They're meant so you know how all other things being equal, one drive compares to another. If a Western Digital drive scores higher in the Desktop test than a Maxtor drive; then regardless of what computer you put it in, the Western Digital should be faster.

If you want 'pure' measurements, then just go by the drive seek times and max data transfer rates. They don't mean squat in terms of real-world performance, but they're nice hard numbers for you.

P.S. Thanks Eugene, we FINALLY have new content! I was beginning to get worried. Hopefully we'll see drive reviews start pouring out now.

Share this post


Link to post
Share on other sites
Guest Eugene
P.S. Thanks Eugene, we FINALLY have new content!  I was beginning to get worried.  Hopefully we'll see drive reviews start pouring out now.

216617[/snapback]

At least one a week :)

Share this post


Link to post
Share on other sites
P.S. Thanks Eugene, we FINALLY have new content!  I was beginning to get worried.  Hopefully we'll see drive reviews start pouring out now.

216617[/snapback]

At least one a week :)

216618[/snapback]

HALLELUJAH!!!

Share this post


Link to post
Share on other sites

Thanks for all the details about your new testbed, Eugene. Obviously a great system now even better. B)

There is only one disappointment for me: Acoustic measurement distance changed from 18mm to 3mm. I know this is to allow readings with an SLM that is not sensitive enough to go farther back, possibly in an environment that is not quiet enough. But such close distance is completely at odds with all known & accepted SPL measurement practices.

The main issue here is boundary / coupling effects which boost readings, particularly at the lower frequencies. You can hear this any morning while driving to work. The typical radio announcer/DJ speaks with his mouth close enough to the mic so that his voice is unnaturally chesty & deep sounding. This is caused by boundary / coupling effects. The very same effect applies to a mic placed so close to any noise source -- like a HDD. I hear it in audio recordings in my lab when HDDs and fans are placed too close to the mic.

An acoustics prof at the Univ of BC recommended I avoid getting any closer than about 1/2 meter when taking SPL measurements. (And also to stay at least a meter away from walls, especially corners where audio reflections reinforce each other.) Any closer, and the decibel reading cannot be trusted to be accurate. Taking noise measurements at 3mm would tend to compress the data range -- ie, the louder ones and quieter ones will both tend to be measured as louder, and the differences between them will tend to be smaller because all of them would have artificial boosts in the lower freq.

Any serious acoustics handbook will at least mention this issue of measurement distance. Here is one references on the web I found in a quick search:

http://www.prosoundweb.com/studyhall/lastu...slm/slm_5.shtml

MikeC, editor, silentpcreview.com

Share this post


Link to post
Share on other sites
One of StorageReview's hallmarks has been our consistent testbeds that enable direct comparison of a wide variety of drives, not just those found within a given review. Our third-generation Testbed has carried us for more than 3.5 years. Testbed4's era now dawns. The hardware has been updated. Software has been revised. Temperature assessment has been overhauled. There are winners and there are losers. Join us as we take a look at SR's updated hard drive test suite and see how your favorite disk stacks up!

Testbed4 Overview

216421[/snapback]

I suppose Eugene has answered some these questions in other threads, but restating them here would be a nice central reference point to link to them?

1. I see repeated reference to 'second generation 300MB/s SATA interface'; has it been determined whether or not simplicity or abbreviation of said standard could be represented as SATA-2 or eSATA? It would be nice to mention that SATA-2 standard allows for port-multiplication. And if you can get enough drives, and drivers for your controller to support it, will you do a comparison test of port-multiplication (say 4 drives per port) vs and 8 channel, single drive per channel raid setup?

2. I'm still going to ask as results for the Samsung Spinpoint is much better in the new suite of tests, is it worth relying on small percentages of differences between drive scores, as a reliable indicator of performance, when we know that controller/driver implementations can account for that much of a difference, in a given setup? In other words, while it maybe valid to say that it's a repeatable and reliable test of performance (assuming those who need that extra small amount of performance can actually notice a difference in the 'real world') for this "Test bed4", a caveat should be that a certain plus or minus performance difference could occur with any given drive, using the assumption that no other parts of the system have any significant effect in actual usage with respect to the hard drives, in another system of different controllers, and OS system dependent driver implementations--- would you not agree?

3. Any ETA on laptop drive tests? Before the end of the year...please?

4. Will there be both ATA and SATA interface tests for blade server 2.5in drives (SATA is currently limited to those uses, as the vast majority of laptops do not support SATA 2.5in drives as yet) in raid applications, port multiplication too? Not many laptop drives come in similar SATA versions at present, but that will likely change over the next year.

5. SATA drive specs from the manufacturers show substantial increases in power consumption vs the ATA versions. If you can get your hands on an SATA version, I'd would be interested in seeing any SR TB4 results here.

6. If we look as the spec's the manufacturers supply, it appears that these are even less reliable that than which are supplied with 3.5in drives. Take a look at this chart on www.barefeats.com showing the new Hitachi 7.2k drive consuming the same amount of power as the Seagate Momentus 4200.2 4.2k/rpm drive...huh, can this be for real? Similarly how is it that a 4.2k/rpm Hitachi Travelstar 4k120 could possibly have an aver. seek time of 11.0ms, when the 7.2k drives are at 10.5ms. Give us some SR laptop reviews so we can sort out this BS!

7. We already have 2 sites that are comparing the newest generation of 7.2k drives for laptops (see SR thread link 7k100 vs 7200.1, New comparative review, and we know that in 2006, both faster CPU's and GPU's, along with 7.2k drives will be standard for High-Def video in performance laptops that will outsell desktops for sure. So where are the SR test results for these new generation of faster laptop drives that consume the same amount of power as a 4.2k drive???

Share this post


Link to post
Share on other sites
I find it a little unfortunate that no testing is done on more 'typical' machines

216427[/snapback]

I pointed out here: http://forums.storagereview.net/index.php?showtopic=18634& that xbitlabs used a P3B-F until it was six years old, and uses i865 now if you are looking for "typical." I also lamented at the time that the Xeon platform did not represent what the "power user" was likely to use for the next few years, but after seeing how Nvidia still suffers problems with NCQ in its drivers: http://www.nforcershq.com/forum/thu-sep-15...m-vp510306.html as well as data corruption problems: http://forums.nvidia.com/lofiversion/index.php?t8171.html a whole freakin year after the introduction of the chipset, I guess picking Intel for the testbed turned out to be a pretty good move.

Share this post


Link to post
Share on other sites

I'm surprised I have to do this, maybe when the test bed was being prepared Tyan didn't have these boards, but check out the Tyan K8WE and the K8SE. I'm not sure if they've got NCQ support on the integrated SATA controllers, but they certainly fit all your other criteria. spilled milk at this point I suppose.

Share this post


Link to post
Share on other sites

Hi Eugene,

One question about the test methodology.

How are the test drives run? You have the standard boot on the Raptor, and the test drive is plugged on as a 2nd drive?

If so, how do the test results represent a single-drive setup? Where Windows is constantly messing with sytem files and the pagefile on the same single HDD?

Thanks for your input :)

Share this post


Link to post
Share on other sites

Welcome to SR!

Short answer: the tests are playbacks of traces that were made on a single disk system.

Long answer: Eugene sets up a system (Tesdbed 4), starts a trace, runs his benchmarks/apps/games/whatever, then stops the trace. The trace records all accesses to the hard disk; i.e. get this bit from there, wait 2.34 seconds, then get that bit and that bit from there and there (I'm paraphrasing!)

That becomes the "SR FooMark 2005". To test individual drives, he can just play back the trace on that drive, and it'll run through exactly the same set of disk accesses as before, without needing to interact with the current OS session on the testbed. Usually, with the gaps between commands minimised so that we're just looking at drive performance, not the speed of the rest of the system while the drive was waiting for instructions.

The main advantage is that the benchmark is repeatable, yet based on real world application usage. It's a shame you can't get this sort of benchmark for graphics cards or processors.

I'm sure Eugene will correct me if the above is inaccurate, but that's my understanding of how it works.

Edited by Spod

Share this post


Link to post
Share on other sites

Hi,

Thanks for the answer - I'm familiar with the methods, I've been a long time SR reader :)

What you say is logical - the traces include everything, Windows and pagefile and whatever.

The reason I asked is that so far NCQ seems to even hurt desktop use, which is quite counter-intuitive. My guess was that the Windows system files plus the swapfile would lead to the most random accesses in a desktop system, where NCQ should help - but I guess Windows' precaching systems actually group together most of the frequently used files.

Just a train of thought I had - thanks for the input.

Cheers :)

Share this post


Link to post
Share on other sites

System files are comparatively localised, as is the pagefile - at most, each are likely to be spread over maybe a gigabyte of disk. Multi user / random seeks actually means random across the whole platter, which is only generated by having lots of users accessing their own little areas simultaneously. Even multi-tasking single users might only be accessing 5 or 6 areas of localised files.

Still, SR's tests do fully recreate the whole environment, including pagefile and OS file accesses, so any benefit that CQ might have is fully realised. The fact that it doesn't help single user scenarios is a reflection of real life, not the test methodology. It has to do with the overhead that CQ brings. SCSI CQ is the most efficient, NCQ is quite good, and TCQ is the least efficient. When there isn't a queue forming for the CQ algorithm to reorder, CQ is just an extra overhead.

The SR tests minimise the gaps between commands, but they don't allow commands to be re-ordered if one completed before the next started, when the trace was recorded. So CQ doesn't kick in artificially during playback.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now