Jump to content


Photo

How important is the hard drive cache?


  • This topic is locked This topic is locked
15 replies to this topic

#1 cas

cas

    Member

  • Member
  • 777 posts

Posted 13 May 2002 - 04:42 PM

Ever since drives first integrated their own controllers, there has been speculation of every kind regarding the value of the hard drive cache. Fairly recently, Western Digital released a drive with significantly more cache than others in its class, and the speculation has reached new levels of absurdity. I thought it was time to put my position on record.

First, let us remember that it wasn’t too long ago, that hard drives had fixed, and fairly simple geometries. A set of CHS values could describe a drive well enough, to implement many valuable performance enhancements. Cylinder groups found in Berkeley’s Fast File System are a perfect example of this type of optimization, found deep within operating system.

A very simple example of how accurate geometry information leads to better performance is the buffer cache. Imagine that you have developed a program that parses a file by reading each character from beginning to end. Since you can’t really read 1 byte from a disk, a naïve implementation would read the file 512 bytes at a time, resulting in 20,480 reads for a 10MB file. Since there are other applications running, that also require the attention of the disk, this could result in as many as 20k seeks. This would be very slow. This process goes a great deal faster, if the IO system (which part, depends on the OS and version) were to read ahead, prefetching the file in to the buffer cache. Subsequent reads could then come from the cache, instead of disk.

Everything I have described can be done without accurate geometry information. The difference is in how the read ahead is handled. The buffer cache of a modern system, mated to an LBA drive, would typically prefetch some fixed number of blocks, following the user request. In this case, the OS has no idea if the prefetch will reduce the number of IOs the drive can service, or not.

Think of it this way. You are on a scavenger hunt, and you have a list of items to pick up all over the city. As you progress, you are given new lists. In some cases, items which you suspect may be on future lists, are genuinely on the way. You can pick up these items, without slowing down the process. If something is out of the way, you probably shouldn’t pick it up, until you have been asked to. After all, though you anticipate that it might appear on a future list, you could be wrong.

In the case of buffer cache read ahead, prefetching is typically “on the way”, if the blocks all reside in the same track, or at least the same cylinder. If the heads have to be moved for a cylinder-to-cylinder seek, this may be out of the way. Of course, even blocks on the same track may be out of the way, if the delay forces the disk to wait a full revolution before it can satisfy the next IO.

Only software with accurate geometry information can make this determination. In modern LBA systems, this probably means the firmware in the drive. In some cases, even more detailed drive characterization in the form of head acceleration and other figures can lead to additional optimizations. Point one for the drive cache.

Even before hard drives had integrated controllers, and cache, they had buffers. As was pointed out recently by an esteemed member of our community, on another board, there is a subtle difference between a buffer and a cache. The issue is somewhat confused, because while a cache may be used as a buffer, not all buffers are caches.

Buffers are important, because there are two essential characteristics which determine the transfer speed of a disk over its host interface: bandwidth, and latency. If the device lacks sufficient buffer space, to deal with the round-trip latency over the interface, the transfer speed will be limited to a value far below its nominal bandwidth. Try using xmodem-128 over a fast link, and you will see what I mean. This is also a serious problem for TCP connections over satellite or other high latency link. Without support for large windows, the transfer speed is limited by the standard 64K window, not the bandwidth of the link. Point two for the drive cache(buffer).

It would seem so far, that I am reinforcing the importance of the drive cache. In fact, I feel that the significance of large drive caches has been seriously over stated.

As I have described, a sufficiently large buffer for transfer is essential. Increasing the size of the buffer, beyond the level necessary to hide the round trip latency however, adds no value. All modern drives have caches(buffers) which are sufficiently large for this purpose.

Although I have nothing against large drive caches per se, one has to consider the opportunity costs. Will I see better performance from an 8MB drive cache, or a system buffer cache that is 32MB larger? Even if we don’t consider that adding commodity dram to your system is cheaper than increasing the drive’s built-in cache, are the 8MB really best used inside the drive?

Consider that modern PCs have bandwidth to their primary dram measured at somewhere around 2.5GB/s. The latency to this dram is measured in nanoseconds. The drive cache however, can be accessed no faster than 320MB/s(theoretical, not measured), and in most cases is much slower. Furthermore, an application may not simply reference this data. Rather, it must issue a system call, walk down the driver stack, wait for an interrupt, and walk back up the drive stack. This takes time, and loads the system.

But what of the advantages of accurate drive characterization? After all, I described scenarios where the firmware can clearly make better caching choices, than an operating system hidden behind an LBA interface.

The assertion that all LBA sectors are mapped in order, turns out to be sufficiently detailed for the vast majority of cases. As I observed here, “The difference between servicing IOs in fifo order, and servicing them in logically sorted order is substantial. The difference between servicing them in logically sorted order, and using some lower level knowledge is much less so.”

The question is, are the advantages of accurate drive characterization sufficient to overcome the disadvantage of being farther from the processor? I believe that for most workloads, the answer is a resounding no!

So why do drives with large caches do so well in benchmarks?

The primary reason is that benchmarks work hard to isolate the drive from its surrounding system. Indeed, drives with larger caches do outperform those with smaller caches, when tested in isolation. When viewed at the system level however, the drive cache is essentially a tiny, mostly inclusive slice of the system buffer cache. How valuable would we consider the inclusive 512K L2 cache of a processor, if its L1 cache was 32MB?

Drives tests are convenient, because we assume that they extrapolate fairly easily to application performance(not always true). Since we are considering opportunity costs and general system performance however, direct application testing is required.

I for one, would like to see more of this type of testing.

#2 russofris

russofris

    Member

  • Member
  • 2,120 posts

Posted 13 May 2002 - 05:42 PM

The primary reason is that benchmarks work hard to isolate the drive from its surrounding system.  Indeed, drives with larger caches do outperform those with smaller caches, when tested in isolation.  When viewed at the system level however, the drive cache is essentially a tiny, mostly inclusive slice of the system buffer cache.  How valuable would we consider the inclusive 512K L2 cache of a processor, if its L1 cache was 32MB?



While I wholeheartedly agree on all of your points, you actually reinforce a counterpoint for me. In applications where the storage subsystem is "isolated", a larger cache will yield better performance. Take, for example, a multitrack digital audio recorder.

Using Logic VS and a 1200BB, I can simultaneously record 16 tracks @ 44.1x16 while playing back 16 tracks of 44.1x16. If I add another track to either recording (or playback) things begin to stutter and go to hell. With the 1200JB, I can record 16 tracks while playing back 24.

The WD1200BB has a RTR and WTR capability sufficient for the 16 x 24 situation, but since it cannot store enough "recording" information in cache to offset the amount of time required for reads/seeks, it fails for this application. This is the area where the JB excels, and where it was designed to excel.

In my opinion, the JB series was targeted as an AV drive. Only because of HW enthusiasts did it find itself in the consumer market. Historically, HDD manufacuurers have always offered drives that were physically identical, with the exception of cache, and marketed them as "AV" drives. I own a WD enterprise 9100E/AV with 2MB cache. The regular 9100 has only 512K. My Roland VS-1680 DAW has a Toshiba 2.5" AV drive w/4MB of cache.

The benefits of larger caches may very well be limited to increasing performance when streaming reads/writes to multiple files simultaneously. I was always under the impression that it was to reduce the performance impact of slow seeks by implementing a "lazy write", whereas the drive has more time to do reads before the write buffer is full.

To me, this isn't a true "performance" increase, but it certainly does improve the functionality of the drive under numerous conditions.

A good example of an AV drive would be the Seagate 73LP. It's 10,000rpm with 4.6ms access time, 4mb cache, whereas the AV version has 4.7ms access and 16mb cache. Specs from seagate below.


the LCV
Internal Transfer Rate, ZBR (Mbits/sec) 399-671 399-671
Internal Formatted Transfer Rate (Mbytes/sec) 38.4-63.9 38.4-63.9
External Transfer Rate (mbytes/sec)

Ultra 8,16 bit/Ultra2/Fibre Channel (per loop) 200
200 160

Track-to-track Seek Read/Write (msec) 0.4/0.6 0.4/0.6
Average Seek Read/Write (msec) 4.7/5.2 4.7/5.2
Average Latency (msec) 2.99 2.99
Spindle Speed (RPM) 10041 10041


The LC
Internal Transfer Rate, ZBR (Mbits/sec) 399-671 399-671
Internal Formatted Transfer Rate (Mbytes/sec) 38.4-63.9 38.4-63.9
External Transfer Rate (mbytes/sec)

Ultra 8,16 bit/Ultra2/Fibre Channel (per loop) 200
200 160

Track-to-track Seek Read/Write (msec) 0.4/0.6 0.4/0.6
Average Seek Read/Write (msec) 4.7/5.2 4.7/5.2
Average Latency (msec) 2.99 2.99
Spindle Speed (RPM) 10041 10041



Thank you for your time,
Frank Russo

#3 cas

cas

    Member

  • Member
  • 777 posts

Posted 13 May 2002 - 10:31 PM

The benefits of larger caches may very well be limited to increasing performance when streaming reads/writes to multiple files simultaneously.  I was always under the impression that it was to reduce the performance impact of slow seeks by implementing a "lazy write", whereas the drive has more time to do reads before the write buffer is full.

As I described above, the fundamentals of caching are the same for both the host and the drive. All modern operating systems support lazy writing, even if the hard drive does not.

The difference is that the drive firmware tends to have a better understanding of the drive's characteristics.

Of course, the operating system has a better idea of what it needs. While this separation does make systems more modular, it makes some interesting techniques more difficult to implement.

To reuse my metaphor introduced above, if you spend all day driving around town, you will eventually visit every neighborhood in town.

Properly applied to disks, it might be possible to maintain your filesystem in a completely defragmented state, with essentially zero overhead.

To support the efficient implementation of these types of techniques, and to provide greater intelligence to OS caching software, I would be very interested to see drives export their physical characteristics in a uniform way. This would be treated as a heuristic, and could be ignored by any uninterested host.

All I am asking for is a bit of flash space.

#4 Jeff Poulin

Jeff Poulin

    Member

  • Member
  • 544 posts

Posted 14 May 2002 - 12:29 AM

Cas, excellent post! Reading your post reminds me of my grad school days. :)

Anyway, I agree with you if the drive in question is used in a single user workstation / desktop environment. 2MB cache is probably more than enough for the read aheads. However, the 1200JB is getting some press as a "SCSI replacement" (whatever that means). To me, that suggests it's suitable for small-department file servers (I'm trying to be realistic, there's no way I'd plop one in one of my production database servers). Okay, so if you take a departmental Samba server running EXT3 (which supports ordered queueing similar to TCQ) then is it possible for several simultaneous requests to gather enough cache data to make use of the 2MB-8MB range while it's seeking across the drive? Also (and this part is just a guess), given that platter densities are on the rise, might it be reasonable that more data is cached per cylinder?

#5 1_smack_for_2

1_smack_for_2

    Member

  • Member
  • 637 posts

Posted 14 May 2002 - 01:31 AM

wouldn't it be better then to have a large cache but each dedicated to a specific range of sectors/platters?

#6 russofris

russofris

    Member

  • Member
  • 2,120 posts

Posted 14 May 2002 - 01:58 AM

  All modern operating systems support lazy writing, even if the hard drive does not.


Yes, you are assuming that DVR's and DAW's have an advanced OS. My Roland does not, nor does the Amiga video booth down the road, nor does the Cannon RIP at my brothers print shop, nor does the GE OpenMRI in Albany, NY (or is it Schenectady, NY?).

What you are talking about is along the lines of what Direct X did (does). DX took this further (for the CD-ROM at least) and took caching and read-ahead away from the OS and put it into the hands of the application. This eliminates a lot of the guesswork associated with read-ahead.

VirtualDub is another great example of this. It took write buffering away from the OS and uses its own scheme. The result is better capture performance on low performing drives.

An app certainly knows what info it has to cache for itself better than the OS. Perhaps we need a new API? Does System.IO no longer seem appropriate?

Thank you for your time,
Frank Russo

#7 [ETA]MrSpadge

[ETA]MrSpadge

    Member

  • Member
  • 744 posts

Posted 14 May 2002 - 06:54 AM

This sounds all good and convincing what you are saying.

But I have some practical experiences whichspeak for larger caches. The application I'm after is moving files from one partition to another on the same drive, mp3s for example. They are all about 3 - 5MB.
On my IBM 8 and 15GB deskstar, both with 512k cache, it was rather slow. They read a bit, it lseemed to be about 400 - 500k, there was a break,then they continued.
Obvisously my norton commander showed only what amount of the file was written. In the break the drive read the next part of the file.

Then I switched to a 30GB 2MB maxtor. And suddenly it wrote about half of the file one time, short break, then the rest. It went so MUCH faster!

Now, shouldn't it get even faster with a larger cache when the entire file can be read/written without interruption?
Or maybe the gain in performance was just that high because the maxtor was 7.2k rpm and had higher sequential transfer rates? I think it's a mix of both.

And something else comes to my mind regarding this issue. Maxtor used PC100 SDRAM for their drives caches about 1 or 2 years ago. What do 6 more MB of SDRAM cost the drive manufacterer? 2$? 5$?
I guess it's worth the performance increase.

MrS

#8 Cliptin

Cliptin

Posted 14 May 2002 - 07:19 AM

cas, Thanks for your input and time.

It does seem that speculative read caching can only get more and more speculative, with included lower probablilty of "hits", without higher knowledge of the OS. Research into drive scheduling algorithms in the OS would not hurt either.

Now, I'm not a firmware designer but I think there are still advancements to be made within the firmware algorithm itself. If not, drive advancement is up to the Mech. Es, Mat. Es & Chem. Es and the EEs and Comp Scis may as well go home. Or maybe it's just time for the shoe to be on the other foot.

To support the efficient implementation of these types of techniques, and to provide greater intelligence to OS caching software, I would be very interested to see drives export their physical characteristics in a uniform way. This would be treated as a heuristic, and could be ignored by any uninterested host.


This is actually some of what I had in mind. I'm surprised there is not more of this happening already. Our arguments seem contrdictory because of your knowlege of the subsystems; I just assumed that it was already being done and could benefit from more room.

At the very least the OS should be able to tell the firmware that certain locations are more important and should not follow the regular retirement scheme.

cas, Could you talk a little bit about LBA and the CHS conversion? Does the OS "see" the drive as a long continuous string of data storage, something like fast tape?

#9 cas

cas

    Member

  • Member
  • 777 posts

Posted 14 May 2002 - 12:52 PM

Yes, you are assuming that DVR's and DAW's have an advanced OS. 

Fair enough.

For certain types of devices, the available 'system' ram, may indeed be smaller than the drive cache of newer drives. In this case, it matters little whether the OS supports write behind or not, the drive cache will represent a significant percentage of the total space available.

These systems may also have local dram that is slower or narrower than the drive interface itself.

Despite years of wishful thinking, and a fair amount of my own hard work, embedded devices still represent a minority of hard drive shipments.

An app certainly knows what info it has to cache for itself better than the OS.  Perhaps we need a new API?  Does System.IO no longer seem appropriate?

In theory apps do know better, but their programmers often lack the sophistication to outperform the OS. Although I am unfamiliar with the common language runtime, NtCreateFile, NtReadFile, and NtReadFileScatter upon which all NT IO is based, support unbuffered IO.

In recent years, many UNIX implementations have started to offer unbuffered IO as well.

Serious IO bound applications like SQL Server, always use unbuffered IO.

#10 cas

cas

    Member

  • Member
  • 777 posts

Posted 14 May 2002 - 12:58 PM

wouldn't it be better then to have a large cache but each dedicated to a specific range of sectors/platters?

No.

What you are describing is a direct mapped cache. They are popular as memory caches because they are fast, and easy to implement. Unfortunately, they offer poor performance relative to the fully associative caches implemented by operating systems and drive firmware.

#11 cas

cas

    Member

  • Member
  • 777 posts

Posted 14 May 2002 - 01:21 PM

MrSpadge"]
But I have some practical experiences whichspeak for larger caches. 
...
Or maybe the gain in performance was just that high because the maxtor was 7.2k rpm and had higher sequential transfer rates? 

Hmm. Hardly apples to apples wouldn't you say?

Your host interface should be faster than your media rate. If you want to copy 3-5MB files quickly, the key is to read 5MB at once, and then right 5MB at once.

My system has enough memory to do this with 800MB files. It will be a very long time before any hard drive has that kind of cache (if ever).

#12 cas

cas

    Member

  • Member
  • 777 posts

Posted 14 May 2002 - 01:45 PM

Cas, excellent post!  Reading your post reminds me of my grad school days. :)

Why thank you. Although I don’t know if reminding you of school is a good thing.

…is it possible for several simultaneous requests to gather enough cache data to make use of the 2MB-8MB range while it's seeking across the drive?  Also (and this part is just a guess), given that platter densities are on the rise, might it be reasonable that more data is cached per cylinder?

I am not arguing against the importance of disk caching. I am just questioning whether it is best done on the drive itself.

In my original post, I gave a very detailed example of how drive firmware can outperform OS caching in some cases, due to its more detailed knowledge of the drive. Allow me to give an example of what I mean, when I say that the operating system has a better idea of what it needs.

The file is entirely a construct of the operating system. A hard drive, has no notion of a file. Sophisticated operating systems like NT, that cache files, not disk blocks, will never prefetch blocks beyond the end of a file. The drive firmware will, and as a result will discard more useful blocks, reducing the cache hit ratio.

Further, the operating system is often given heuristics when file is opened, to help tune it’s caching for that file. Flags like FILE_FLAG_RANDOM_ACCESS and FILE_FLAG_SEQUENTIAL_SCAN can increase the hit ratio of the OS cache. Again, since the drive has no notion of a file, it must handle all requests in a uniform way.

Of course, interplay between the caches can make things worse.

Imagine that you open a 60K file for sequential scan. When you read the first byte, let us suppose that the OS goes ahead and reads the entire file in to the disk cache. Seeing a 60K read, the drive firmware prefetches another 24K(of some other file).

In this case, the data in the drive cache is identical to the data in the OS cache, except for the 24K which will never be used. In this case, the drive cache adds zero value.

#13 cas

cas

    Member

  • Member
  • 777 posts

Posted 14 May 2002 - 03:24 PM

Research into drive scheduling algorithms in the OS would not hurt either.

Programmers have been playing tricks with tapes, drums, and drives, for as long as the devices have been around. Check out Knuth’s sexy centerfold in The Art of Computer Programming volume 3.

The problem was that many of these techniques tied the software too tightly to the hardware. Even something as simple as a CHS change used to require recompilation of the UNIX kernel.

The move to logical block addressing has significantly enhanced the modularity of modern systems. It has however, left some performance on the table.

The Freeblock scheduling project at CMU, linked above, explores the possibility of returning some of this information to the OS.

The industry has been slow to embrace these techniques because, as I mentioned before, LBA mapping is good enough for the lion’s share of common cases.

cas, Could you talk a little bit about LBA and the CHS conversion? Does the OS "see" the drive as a long continuous string of data storage, something like fast tape?

Correct.

Drives, arrays of drives, tapes, ramdisks, flash, you name it. They are all exposed as a linear run of bytes, up to 2^64 bytes long (for most OSes). This is the interface ‘standard’ filesystem drivers expect to see, so it is very common.

Because the devices themselves can have very different characteristics, there are some exceptions. Intel’s Persistent Storage Manager for example, interfaces with the OS at the filesystem level, rather than as a device. This allows for better performance on NOR flash, which is not particularly ‘disk like’.

#14 [ETA]MrSpadge

[ETA]MrSpadge

    Member

  • Member
  • 744 posts

Posted 14 May 2002 - 05:10 PM

OK cas, I think I missed the point before and got it now - no more complaining from my side ;-)

MrS

#15 Olaf van der Spek

Olaf van der Spek

    Member

  • Member
  • 1,958 posts

Posted 11 March 2003 - 10:40 AM

To solve one problem Cas mentioned, the drive could be told the maximum amount to read ahead for each request.

And I was wondering, if the drive did read ahead, is the data transfered to the system asap so the buffer/cache is freed or is it transfered when the system happens to ask for it?

#16 Trinary

Trinary

    Member

  • Member
  • 1,115 posts

Posted 20 March 2003 - 03:50 PM

cas:

Very well-stated.

One of the items mentioned that a disk does not understand a file, and this leads to firmware prefetching beyond end-of-file, which the OS will never do.

I seem to recall there was some work being done on so-called "object-oriented" disk drives, which would understand files, above and beyond simple clusters and sectors. Any idea where work in that area stands or how it might be progressing?
Trinary



0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users