Sign in to follow this  
Big Buck Hunter

New long block standard

Recommended Posts

This has been discussed by IDEMA and all HDD manufacturer's since... well, for a very long time. Motivation to increase block size increases, but problems that increasing block size cause remain.

So WHEN is it REALLY going to happen? I have a feeling it has become like one of those features that will be implemented in "next version of Windows" since... for a very long time (and for numerous versions of Windows as well). It's like one of those things that are about to happen N years from now, no matter what year we are currently living. There's lots of those technologies in the storage industry, take holographic storage for example: it's always a few years from wide scale availability but it's never closer than that. The only thing that changes is the promised capacity: 1TB in just a few years, 10TB in just a few years, etc. Soon they'll promise 100TB in just a few years, yet there's not even 1TB holographic storage available to public.

I wish larger blocks will someday become reality. There needs to be BIOS, OS and software support for it first. Old software is not compatible, but of course not all software needs to be compatible (as OS may offer quite a bit transparency), it would merely be some low-access I/O software that would become outdated, right? Need to make BIOS and OS compatible remains. Is Vista compatible? If not, that already means there's going to be a significant delay in implementation.

Share this post


Link to post
Share on other sites

I read that sys-con.com article again. It claims the standard is now complete... yet IDEMA.org has the following list:

Long Data Block Standards

Approved Standards

There are currently no approved standards for this committee

Proposed Standards

blah blah blah blah blah

Maybe IDEMA.org hasn't been updated? That sys-con.com article is quite new (April 30, year 2007 presumably but not stated).

Share this post


Link to post
Share on other sites

That sys-con.com article isn't very descriptive about the nature of the "new standard". Is it derived from ATA/ATAPI-7 standard that includes support for bigger than 512 byte physical sectors while retaining the current LBA sector size at 512? This practically means write operations are performed like with flash memory: read whole physical sector, replace overwritten part physical sector of physical sector and write the result back to physical sector (along with new ECC for whole physical sector). This causes severe reduction of write performance on old systems that don't have OS support for 4 KiB physical sectors but would provide full backwards compatibility and even boot support on BIOSes that don't support non-512 byte sectors. It would also make it possible to use of 4 KiB sectors when OS has booted while using compatibility only to get past BIOS limitations, thus suffer no performance loss after OS's own drivers are loaded... IF the driver's support 4 KiB sectors, that is. And only IF the HDD if formatted specifically with 4 KiB sectors in mind (i.e cluster size of 4 KiB, 8 KiB, etc. and with no offset).

Is this about ATA/ATAPI-7?

Share this post


Link to post
Share on other sites
Is this about ATA/ATAPI-7?

Not that I am aware of, since ATA/ATAPI only define the interface, and not the geometry of the device you are using. I have an old 420MB drive in NY that has a LLF of 528bytes/block, and modern ATA adapters do not have a problem with it. Also, t13.org handles ATA/ATAPI.

I think 4K blocks will be a good thing, since most CPU architectures use 4K pages, it seems to make mapping memory to disk easier (sparc64 and a couple of other arches use 8K pages).

Frank

Share this post


Link to post
Share on other sites

"Not that I am aware of, since ATA/ATAPI only define the interface, and not the geometry of the device you are using."

Don't blame me, I'm just a messenger. I read it from IDEMA.org -> Patents -> Long Data Block Standards -> Proposed Standards -> HGST ATA Standard Proposal. It links to a file named Colegrove4k_sector_paper_r0.doc (author Dan Colegrove from Hitachi GST).

Quoting preamble of that document: "The ATA/ATAPI-7 Disk drive interface standard includes support for drives with 4 kilobyte physical sectors. Before ATA/ATAPI-7 the disk drive interface supported only 512 byte sectors."

Share this post


Link to post
Share on other sites
I think 4K blocks will be a good thing, since most CPU architectures use 4K pages, it seems to make mapping memory to disk easier (sparc64 and a couple of other arches use 8K pages).

I really don't understand the advantages. Why would it be cheaper to do a single request for 1 4 kbyte block than a single request for 8 sequential 1/2 kbyte blocks?

Share this post


Link to post
Share on other sites
"Not that I am aware of, since ATA/ATAPI only define the interface, and not the geometry of the device you are using."

Don't blame me, I'm just a messenger. I read it from IDEMA.org -> Patents -> Long Data Block Standards -> Proposed Standards -> HGST ATA Standard Proposal. It links to a file named Colegrove4k_sector_paper_r0.doc (author Dan Colegrove from Hitachi GST).

Quoting preamble of that document: "The ATA/ATAPI-7 Disk drive interface standard includes support for drives with 4 kilobyte physical sectors. Before ATA/ATAPI-7 the disk drive interface supported only 512 byte sectors."

Neat doc. I'll ping Mr Landis and see if I can get the straight dope on the situation. If you find any other docs, toss em on this thread.

Frank

Share this post


Link to post
Share on other sites

Additional info.

1: I believe that CDRoms are 2048 bytes per block. This may contradict the "Before ATA/ATAPI-7 the disk drive interface supported only 512 byte sectors." comment depending on what the definition of "disk drive interface" is. Who knows.

2: From a slashdot entry (I could not find the original question though)

I can speak with some authority on this - I work for one of those aforementioned hard-drive manufacturers, and have been doing a small amount of work on this exact thing.

The easy answer is this: in order to do ECC-like data checking on a larger set of data (say, a group of eight 512-byte sectors), it means that if you want to write sector three of that eight, you end up having to re-read the whole thing before you do anything else - thus basically giving you 4,096-byte "sector" anyway.

The other half of that answer is this: do you know what the "real" storage capacity of a CD is, without all the error checking? It's a bit less than double. Even most of the enterprise folks wouldn't accept a 40% hit in data density in return for what works out to not that big an increase in reliability (data redundancy doesn't buy you that much unless that data is on different spindles). They'd just rather get the whole data space and do a RAID, especially since that's what they're going to do anyway.

3: Do we have a math major here that can show me how to apply Shannon's source coding theorem to see if larger block sizes actually improve ECC (or reduce the space needed by the current level of ECC)?

4: It appears that my old 420MB drive is SCSI

5: A good doc on filesize distribution on a windows PC: http://research.microsoft.com/~lorch/papers/fast07-final.pdf . In the case the wasted space argument comes up. We're probably going to see more filesystems add tail packing and block suballocation to their features pretty soon, so I do not believe this to be a concern. To find out how many files under linux are below a certain size, use... find ~ -type f -size -512c |wc -l (replace 512 with a size and ~ with a directory). Does anyone have a windows equiv for this?

6: Do we now get 4096 bytes for our bootloader (grub/lilo)? Can Dos/Windows now have more than 4 primary partitions? How does this affect GPT or other partition schemes?

I'm really looking forward to this change.

Frank

Share this post


Link to post
Share on other sites

Well, doesnt shannons therorem actually _demand_ an infinite block size?

And thus, the shorter the real world blocks, the worse the applyability (is this a word?)?

As a rule of thump: Just think about a random distributions of errors, and an encoding that can correct one error in X bits. And another that can correct 8 errors in 8X bits.

Its obvious which one will be more effective.

(just take the case of 8 errors. The 2nd one will always be able to correct it, the first one will fail in >50% of all possible error distributions)

Share this post


Link to post
Share on other sites
Well, doesnt shannons therorem actually _demand_ an infinite block size?

And thus, the shorter the real world blocks, the worse the applyability (is this a word?)?

That would be "applicability" I think. As in, "applicable".

In simple terms, what does this (the transition to 4096 byte blocks) mean?

More efficient ECC?

More efficient accessing/storage/indexing of files?

Potential for more/bigger partitions or files?

Less command overhead?

Less fragmentation in certain filesystems? Or is that when you have larger sectors?

A simple yes/no answer will do, I don't expect to understand the mathematical justification for the answers just now. It sounds like it should be exciting, if I could just work out why!

Share this post


Link to post
Share on other sites

More efficient ECC? We are lead to believe that we will either get the same amount of ECC with more usable space, or better ECC with the same amount of usable space.

More efficient accessing/storage/indexing of files? This is the job of the file system. Most file systems have 4K sectors already.

Potential for more/bigger partitions or files? No, as this is dependent on the number of sectors a file system can have, and is not dependent on the number of blocks.

Less command overhead? Yes. "Up to" 1/8th the chatter.

Less fragmentation in certain filesystems? Or is that when you have larger sectors? Not directly. This is more a file system problem than anything else. There will be some reduction file fragmentation for some file systems, but is more of a nice side effect than the intention of increased block sizes.

Frank

Share this post


Link to post
Share on other sites
From : 	Hale Landis <xxxx@xxxx.xxxx>
Reply-To : 	xxxx@xxxx.xxxx
Sent : 	Thursday, May 3, 2007 12:35 PM
To : 	"Frank Russo" <xxxx@xxxx.xxxx>
Subject : 	Re: New 4K block size


Go to previous message	|	Go to next message	|	Delete	|	Inbox

> We were having a discussion on the Storagereview message boards about the
> following article and were wondering if you had any further insight.

I'll reply to you and you can post any or all of this to the forum...

> How soon should we expect to see the 1st implementations arrive?

It's here today in some pre-production devices. But I don't expect this to
catch on any time soon. There are a lot of host software performance
issues that the OS people need to address, perhaps even develop new file
system architectures that can take full advantage of larger blocks.

> What benefits will larger blocks offer in terms of performance and data
> safety?

Assuming an OS and its file system understand that the device is using
large blocks it probably has little or no impact on performance. One of
the only reasons for having larger blocks is the theory that using larger
blocks will allow more flexible and better error correction algorithms.

> Will 4K devices be compatible with ATA6 controllers (probably not?) and
> "current" ATA7 implementations?

Traditional ATA host controllers don't have any restrictions on the size
of 'data blocks' (PIO DRQ data blocks, DMA data blocks or total command
data transfer sizes).

> Why was 4K chosen?  (my guess is that it is the same as the memory page
> size on most CPU arches)

Generally when T13 talks about large blocks (sectors) they are talking
about any size that is a power of 2 multiple (1024, 2048, 4096, etc). But
see below.

> Any pointers to additional documentation or comments would be greatly
> appreciated.

See ATA/ATAPI-7 volume 1, or ATA-8 ACS, available at www.t13.org (free),
use the Projects>LastDrafts tab to locate the documents.

Background... Since ATA/ATAPI-7, ATA hard disks can have logical sector
sizes (a logical sector is what a logical block address (LBA) represents)
of nearly any size equal to or larger than 512 bytes (but it must be an
even number, 514, 516, ...). 520 and 528 are popular sizes used by some
non-PC systems (IBM AS 400 for example). Since ATA/ATAPI-7 ATA hard disks
can have physical blocks (blocks recorded on the media) that are a power
of two multiple of the logical sector size. Except for a few non-PC
systems, all systems today use/expect a 512 byte logical and physical
block (sector).

For PC systems it is expected that the logical block (sector) size will
remain at 512 bytes for many years to come. But we may see a slow adoption
of drives with a larger physical block size (probably 4096 bytes). That
means there will be eight logical sectors in one physical block. As long
as the OS and file system(s) using the drive restrict read/write commands
to an LBA that is a multiple of 8 with a transfer length that is a
multiple of 8 there should be no performance problems. However if the OS
or file system is stupid and writes a single logical sector within a
physical block then the drive will need to internally do a
read-modify-write of the physical block and that could be a big
performance hit.

I suggest that anyone wanting to know more read the ATA-8 ACS document
sections that talk about this feature.

Hale

Share this post


Link to post
Share on other sites
From : 	Hale Landis <xxxxx@xxxxx.xxx>
Reply-To : 	Hale Landis <xxxxx@xxxxx.xxx>
Sent : 	Thursday, May 3, 2007 9:12 PM
To : 	Frank Russo <xxxxx@xxxxx.xxx>
Subject : 	Re: New 4K block size


Go to previous message	|	Go to next message	|	Delete	|	Inbox
	 	We were having a discussion on the Storagereview message boards about the following article and were wondering if you had any further insight.

Additional info...

For the next few years I would expect the logical sector size to remain 512 bytes while a drive's physical sectors will most likely remain 512 bytes with a very slow migration to physical sectors of 1024, 2048 or 4096.

But note that ATA/ATAPI-7 and ATA-8 also allow the logical sector size to be equal to or larger than 512 bytes.

Microsoft claims that Vista can support: 1) 512 byte logical sectors in physical sectors of 512, 1024, 2048 or 4096; and 2) power of 2 logical sector sizes that match the physical sector size (512, 1024, 2048 or 4096 bytes). I think they are recommending that 1024 and 2048 should be skipped and logical/physical sector sizes should move from 512 directly to 4096. Of course any move like this will affect lots of software!

So for now an LBA will remain the address of a 512 byte logical sector. Some year in the future and LBA could be the address of a 4096 byte logical sector.

Hale

Share this post


Link to post
Share on other sites
More efficient ECC? We are lead to believe that we will either get the same amount of ECC with more usable space, or better ECC with the same amount of usable space.

Or something in between. :)

Less command overhead? Yes. "Up to" 1/8th the chatter.

I doubt it. The sector count of each request will just be divided by 8 (simply put), so the number of requests will stay equal.

Share this post


Link to post
Share on other sites

Microsoft have had a long history of incompatibility and limited disk tools. Changes to dblspace, NTFS compression & defrag only working with default 4K sector, disk space (API) incorrectly reported, MB/GB limits etc. I just wish they could get it right the first time in a flexible way instead of hard coded variables and poorly chosen default behaviours.

Share this post


Link to post
Share on other sites
Microsoft have had a long history of incompatibility and limited disk tools. Changes to dblspace, NTFS compression & defrag only working with default 4K sector, disk space (API) incorrectly reported, MB/GB limits etc. I just wish they could get it right the first time in a flexible way instead of hard coded variables and poorly chosen default behaviours.

In a way, MS has the most to gain from this change. Since the memory page size is also 4K, they have the opportunity to toss out a lot of the translation/handling of data going to the pagefile. They could actually get rid of the pagefile and go with a page "partition" (kinda, but not totally like linux) and get rid of the filesystem overhead as well.

I guess we'll have to see what innovative ways the five major OS camps take advantage of the larger blocks.

Frank

Share this post


Link to post
Share on other sites
Microsoft have had a long history of incompatibility and limited disk tools. Changes to dblspace, NTFS compression & defrag only working with default 4K sector, disk space (API) incorrectly reported, MB/GB limits etc. I just wish they could get it right the first time in a flexible way instead of hard coded variables and poorly chosen default behaviours.

In a way, MS has the most to gain from this change. Since the memory page size is also 4K, they have the opportunity to toss out a lot of the translation/handling of data going to the pagefile. They could actually get rid of the pagefile and go with a page "partition" (kinda, but not totally like linux) and get rid of the filesystem overhead as well.

I guess we'll have to see what innovative ways the five major OS camps take advantage of the larger blocks.

Frank

A page partition could already be done, but it wouldn't be faster than a properly implemented page file.

I don't think the other part is true either considering you can easily do multi-sector reads of 4 kbyte already.

Share this post


Link to post
Share on other sites
A page partition could already be done, but it wouldn't be faster than a properly implemented page file.

Agree'd. Currently, a page partition offers no benefit over a contiguous page file.

I don't think the other part is true either considering you can easily do multi-sector reads of 4 kbyte already.

Yes, but with 4K blocks, you can effectively cut out the file system (or greatly simplify it). For example, why do memory pages (on disk) need to be journaled? Why does NT cache (in ram) reads and writes to the page file, isn't this redundant? Going direct to disk would remove this overhead, since you could use the disk as memory in a fairly direct and efficient manner.

Frank

Share this post


Link to post
Share on other sites
Yes, but with 4K blocks, you can effectively cut out the file system (or greatly simplify it).

I don't see how the block size is relevant for that.

For example, why do memory pages (on disk) need to be journaled?

I'm not aware of them being journaled.

Why does NT cache (in ram) reads and writes to the page file, isn't this redundant? Going direct to disk would remove this overhead, since you could use the disk as memory in a fairly direct and efficient manner.

I don't follow you. You can't write a CPU register or cache line directly to disk, you have to go through memory. And an app can't write to the page file at all, it can just write to (virtual) memory.

Share this post


Link to post
Share on other sites
For example, why do memory pages (on disk) need to be journaled?

I'm not aware of them being journaled.

If it resides on an ntfs filesystem, it is journaled.

Why does NT cache (in ram) reads and writes to the page file, isn't this redundant? Going direct to disk would remove this overhead, since you could use the disk as memory in a fairly direct and efficient manner.
I don't follow you. You can't write a CPU register or cache line directly to disk, you have to go through memory. And an app can't write to the page file at all, it can just write to (virtual) memory.

But some of that virtual memory still resides in the windows disk cache. Why not just leave it in RAM in the first place, or not cache disk reads/writes to the pagefile.sys?

My fourth try at getting the quotes correct. Must need coffee....

Frank

Share this post


Link to post
Share on other sites
If it resides on an ntfs filesystem, it is journaled.

But that's not relevant. As far as I know, the journal isn't involved when the size of a file doesn't change. And the size of the page file is (semi) constant.

Why does NT cache (in ram) reads and writes to the page file, isn't this redundant? Going direct to disk would remove this overhead, since you could use the disk as memory in a fairly direct and efficient manner.
I don't follow you. You can't write a CPU register or cache line directly to disk, you have to go through memory. And an app can't write to the page file at all, it can just write to (virtual) memory.

But some of that virtual memory still resides in the windows disk cache. Why not just leave it in RAM in the first place, or not cache disk reads/writes to the pagefile.sys?

The Windows disk cache is RAM. AFAIK, there's no difference between the disk cache and memory pages backed by the page file.

Share this post


Link to post
Share on other sites
The Windows disk cache is RAM. AFAIK, there's no difference between the disk cache and memory pages backed by the page file.

This doesn't strike you a redundant waste of memory and cycles? To cache memory... In RAM..... especially when there's a possibility that it's already in RAM (memory regions paged to disk are not immediately overwritten)? With the current scheme, it is possible to have the same data in three different locations simultaneously (Main memory, page file, windows disk cache). With the new scheme, you could effectively treat the HDD as memory, with minimal address translation or concatenation of blocks by the driver.

Perhaps my logic is just screwy this week.

Frank

Share this post


Link to post
Share on other sites
The Windows disk cache is RAM. AFAIK, there's no difference between the disk cache and memory pages backed by the page file.

This doesn't strike you a redundant waste of memory and cycles?

No. I think something very fundamental in our views of this doesn't match up.

To cache memory... In RAM..... especially when there's a possibility that it's already in RAM (memory regions paged to disk are not immediately overwritten)? With the current scheme, it is possible to have the same data in three different locations simultaneously (Main memory, page file, windows disk cache). With the new scheme, you could effectively treat the HDD as memory, with minimal address translation or concatenation of blocks by the driver.

Perhaps my logic is just screwy this week.

Frank

No, such duplication doesn't exist (AFAIK). Yes, the same page can be in memory and on disk, but not twice in memory. If a page is paged in from disk to memory, it's not copied a second time from disk cache to 'normal' memory. Same for page outs, it's written directly to disk.

Edited by Olaf van der Spek

Share this post


Link to post
Share on other sites

4KB page swapping will be superseded. CPU's and OS are moving to larger virtual memory pages (eg. 4MB) and thus 4KB HD sectors would still generate multi-sector write request per page.

Need to have small files grouped together to reduce space waste when using larger cluster size.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this