Sign in to follow this  
eguy

First Review of 800MB/S ioDrive from Fusion-IO

Recommended Posts

Yes, finally a drive that looks like it will actually live up to the hype. I really want one. I guess I will have to go to Vista 64 to get it, and I want to boot from it too.

That is a $3K drive for 80GB. Waste of money i believe for an end user. Good for being a guinea pig though. Each and every SSD was groundbreaking since 2 years... Yet we still have nothing that works.

Besides, you will be needing a secondary drive on your system... And for bigger file operations, that disk will be used due to its capacity (movies, entertainment, etc). Your system will feel still slow since majority of high throughput file operations will be done on the 2nd drive.

Better to wait another year while it explodes on people's systems and matures. I would pay $1000 for 320GB. Those are stellar numbers.

Share this post


Link to post
Share on other sites

Does anyone know what file system they're using in these benchmarks? It seems like this device should have maintained a high level of throughput even when the block size was very small due to its extremely low access time. Unless they were using a filesystem with a minimum cluster size much greater than their block size and the throughput was being wasted over-reading.

Share this post


Link to post
Share on other sites

Atto-IO-Drive-80GB.PNG

Notice how peak throughput is achieved at 32kb. I think this is due to them testing a volume formatted NTFS with 32kb clusters. At each block size less than 32kb the drive is over-reading and throughput nearly halves (some is reclaimed due to overlapped IO and NCQ). In spite of that the drive is doing 400mb/sec with 8kb IOs. Insane!

Share this post


Link to post
Share on other sites

I would really like some REAL benchmarks from a neutral source.

While i appreciate the link in the OP, a page which proclaims itself as the sole distributed of the reviewed product doesnt really readiate credibility.

Share this post


Link to post
Share on other sites

We finally got a Fusion IO-Drive to test and found it had two distinct modes.

When it is "Empty" it performs very well (as advertised). There are probably no Erases going on or any garbage collection.

After enough Writes to fill it up, it changes to "garbage collection" mode. Write bandwidth drops to 20 MByte/sec (5K IOPS), 90% less than when it is "Empty".

So, it seems the "Sustainable Random Write IOPS" is 5K.... similar to other SSDs.

Share this post


Link to post
Share on other sites

90% is a big performance drop. Performance dropping 90% when it hits a certain not-full capacity level would not be good. For consistent performance the required buffer space should be reserved and not displayed to the user.

How full was it when the performance dropped?

Edited by Loomy

Share this post


Link to post
Share on other sites
We finally got a Fusion IO-Drive to test and found it had two distinct modes.

When it is "Empty" it performs very well (as advertised). There are probably no Erases going on or any garbage collection.

After enough Writes to fill it up, it changes to "garbage collection" mode. Write bandwidth drops to 20 MByte/sec (5K IOPS), 90% less than when it is "Empty".

So, it seems the "Sustainable Random Write IOPS" is 5K.... similar to other SSDs.

Please post graphs and more data. If you are right, this is another snake-oil (thought so, but HP using them confused me).

On the other hand, this is your first post. You must understand that you have not established a credibility baseline so your data might be taken with a grain of salt (no offense, hard facts, might be competitor, spammer, etc).

Share this post


Link to post
Share on other sites
90% is a big performance drop. Performance dropping 90% when it hits a certain not-full capacity level would not be good. For consistent performance the required buffer space should be reserved and not displayed to the user.

How full was it when the performance dropped?

The easiest to replicate test is a continuous random write test. It starts off very well for a period and then starts to degrade and changes to the lower performance "full" mode. This transition can happen in less than 10 minutes.

The length of time to degrade seemed to correspond to all the Flash in the drive being used. In this case, it started happening shortly after 80GB of writes had been done. After this, the drive probably starts having to Erase the Flash, and performs worse, as you would expect. We've seen the same things with other page-mapping SSDs. The IO Drive is a page mapping SSD with the difference being the mapping and garbage collection are done by the CPU and memory of the server and not in the IO-Drive card.

Flash is generally fast, but Erases are slow. Fusion IO seems to have the same problems as any SSD. Before it gets to the "full" state, however, the PCIe interface does make it fast. Its probably because the PCIe interface is so fast that the performance drop of 90% is so pronounced.

Share this post


Link to post
Share on other sites

Ah I see, performance drops when it becomes full and then you continue to write to the full disk. Interesting limitation. Is the workaround "use only 70gb of space"? If so then at least the workaround is simple.

Share this post


Link to post
Share on other sites
Ah I see, performance drops when it becomes full and then you continue to write to the full disk. Interesting limitation. Is the workaround "use only 70gb of space"? If so then at least the workaround is simple.

The workaround is not to write more than 80 GB of data total. Its irrelevant how "full" the drive looks to be.

For example, if you write 1GB file and delete another 1GB file, it still counts as 1GB write. The deletes don't matter. This is the same with all page-mapping SSDs. They all need to do garbage collection at some point. Its a bit like a log structured file system.

http://en.wikipedia.org/wiki/Log-structured_file_system

Be careful with server applications with continuous write workloads. You need to assume garbage collection will be taking place while your application is running.

Share this post


Link to post
Share on other sites
The workaround is not to write more than 80 GB of data total. Its irrelevant how "full" the drive looks to be.

What are you suggesting? To buy it and not write data on it and keep it as an ornament next to the family photos?

If what you are stating is true, this is just another snake-oil then. I haven't seen anywhere any mention of this. What about the people who bought their products? Aren't they going to come back and say "hey what happenned, this is performing 1/10th of what you stated?"

What about you? Will you be returning it?

What about HP and others who announced that they will use this on their servers? They didn't have equipment and specialized folks to test it out?

Some thing is not adding up.

Edited by 6_6_6

Share this post


Link to post
Share on other sites
The workaround is not to write more than 80 GB of data total. Its irrelevant how "full" the drive looks to be.

You're saying that writing 40gb, deleting it, writing 40gb, deleting it, and then writing 1gb will be at 90% of the write speed as the last two writes. That is a bold claim that fusion-io will have to respond to :)

Share this post


Link to post
Share on other sites
The workaround is not to write more than 80 GB of data total. Its irrelevant how "full" the drive looks to be.

What are you suggesting? To buy it and not write data on it and keep it as an ornament next to the family photos?

If what you are stating is true, this is just another snake-oil then. I haven't seen anywhere any mention of this. What about the people who bought their products? Aren't they going to come back and say "hey what happenned, this is performing 1/10th of what you stated?"

What about you? Will you be returning it?

What about HP and others who announced that they will use this on their servers? They didn't have equipment and specialized folks to test it out?

Some thing is not adding up.

Most of these things will do the garbage collection in the background, so if you expect to be writing continuously, without stop, you have this problem

The Intel drive, the Fusion I/O, and the older slower ones including RiData, OCZ, Samgsung ... ALL I have tested have "2 modes" of write performance.

The Intel will write about 8K random iops for a while, before it slows down to "only" 1.5k random iops. It will stream write at 70MB/sec, then slow to 35. OCZ will cut to about 40% of its peak when it is garbage collecting. For the ones that don't totally re-map like the FusionI/O drive or Intel drive, how quickly you reach the barrier depends more strongly on whether writes are random or sequential.

However, for all of these that I have tested, if you wait for a few minutes, it gets fast again.

So no, you don't have to write only ~ disk size and then stop, you just can't write much more than half the total disk size in one sitting without it slowing down. In the background while non-write (well, not heavy writes for the good ones) happen, erase blocks are premptively prepared for future writes, in advance.

Note, no SSD 'knows' how full it is. Every write is an overwrite, and it is always 100% full. Its just that when you start out, every overwrite can be taken from a contiguous space. After it has been used a while, its 'free space map' is created by overwrites in the remapping, and that space gets fragmented. The pattern of re-writes will affect this a lot. If you run a test that keeps overwriting the same 1GB file, the write performance won't drop as fast. If you write randomly to the whole disk, it will drop faster. Most usage patterns in real applications are not at either extreme, and don't continuously write at max speeds either.

In most cases, and definitely with desktops, there is plenty of time between the big writes for a good SSD to 'prepare' the next space for writing and be at its peak speed most of the time

And in any event, even in the 'slow' mode, the good drives are still WAY faster than a physical disk. The bad drives (the ones that can't write random 4k blocks at over a few hundred iops in the best case) are the real issue here, since these magnify small random writes into large page erases.

Share this post


Link to post
Share on other sites
The workaround is not to write more than 80 GB of data total. Its irrelevant how "full" the drive looks to be.

You're saying that writing 40gb, deleting it, writing 40gb, deleting it, and then writing 1gb will be at 90% of the write speed as the last two writes. That is a bold claim that fusion-io will have to respond to :)

Deleting it doesn't matter.

All writes are overwrites, the drive has no idea you deleted anything... the drive sees the small (a couple k) written that tells the _FILE SYSTEM_ that the chunk is deleted.

What happens is that when you write the next 40GB, if it happens to overwrite a block of addresses that were written to in the past, those are now free.

Here is an example:

Write 1GB file to what the file system calls addresses '1GB to 2GB'. The SSD puts these wherever it wants internally, it is not bound to put this in a contiguous block internally. Lets call this region written to internally 'region A'

Now, you delete the 1GB file, which from the drive's perspective is a few k of writes, these go somewhere else.

Now, you write the file again. The SSD finds free blocks (has them ready before you write most likely) and puts the 1GB file there. OK, now the file system could have used the same 1GB to 2GB address range, or not.

Case 1: the file system used the same range, so the addresses are overwritten. The SSD stores the 1GB in 'region B', and notes that 'region A' is now free because the addresses that mapped to it, no longer map to it because they have been assigned to region B.

Case 2: the file system writes the file in a different range of addresses lets call this range '9GB to 10GB'-- now the SSD has region A mapped to 1GB to 2GB, and region B mapped to 9GB to 10GB. From its point of view, both regions are not available for use for writes.

A file system optimized for re-mapping SSDs would intentionally re-use address areas that have been deleted, and not worry much about fragmentation. Because SSD's have all sorts of complicated internals and restrictions, it really is best left up to their controllers and firmware to figure out where blocks go, and if you have the OS simply do its best to overwrite areas that have been deleted as soon as possible, the SSD will have the most 'breathing room' to keep remapping stuff.

One note: A database typically allocates large files and then re-writes pages, rather than allocating and deleting from the file system. An application like that will be a bit friendlier to a re-mapping SSD. Additionally, if block devices added a 'deallocate' command for an address range to notify drives that a region should be cleared, performance would go up a great deal on such devices. But OS block devices have long only had read and write, and nothing else.

http://forum.ssdworld.ch/viewtopic.php?f=1...6ae261981c9b366

Snake-oil. I don't understand why so much hype was made for this crap.

Thats nice, I guess the 4x capacity increase on my server cluster by adding a single device per server that is significantly cheaper than any of the servers is a mirage?

Share this post


Link to post
Share on other sites
http://forum.ssdworld.ch/viewtopic.php?f=1...6ae261981c9b366

Snake-oil. I don't understand why so much hype was made for this crap.

Thats nice, I guess the 4x capacity increase on my server cluster by adding a single device per server that is significantly cheaper than any of the servers is a mirage?

A copy and paste real test, 615 mixed files (5.98GB) in 3min 27sec 92: 
The result is around 500% worse,is very very bad, it start at more than 200MB/s and go down really fast and end at only 29,7 MB/s (a single MtronPRO750 or a single Memoright GT are faster in this same copy-paste real test)

I guess you can achieve better results by raiding any cheaper SSDs on the market with Adaptec 5805s as done by that site for half the price. This is not a miracle product. No one has used it, no one has benched it. It doesn't even boot!

Everyone was wow with first Mtrons... Yet it blew on everyone's servers... Everyone was wow with cheap OCZ Cores... Yet it took them a second of system freeze after sending a simple IM message... ridiculous random write times...

God knows what other quirks are there.

Share this post


Link to post
Share on other sites
http://forum.ssdworld.ch/viewtopic.php?f=1...6ae261981c9b366

Snake-oil. I don't understand why so much hype was made for this crap.

The truth!!

Thanks for sharing.

It is true that there exists a pathological case that reduces write performance on SSD's in general and on the ioDrive unless tuned correctly to compensate. To get this poor performance requires all three of the following ingredients;

1) Actively writing across the entire capacity of the drive

2) Not have any correlation-in-time of those writes

3) Not give any "recovery" time in between.

Only as these three conditions are met does performance degrade.

Generally, just synthetic benchmarks meet these requirements. It is uncommon for real-world applications to not write correlated in time, and to be actively using the full device (how many of us run our FS full), and to run constantly. But, for those that do, there is a solution (more below...)

The SPC-1 benchmark run by IBM on Quicksilver is a random 70/30 R/W mix 4K packet test and is, therefore, pathological. That's why they achieve around 30,000 IOPS per ioDrive vs. the 100,000 we advertise.

The way to improve write performance even under this pathological case is to sacrifice some formatted capacity.

With the ioDrive, the user can low-level re-format it to a smaller "usable" capacity vs. it's raw physical capacity. For example, the 80GB ioDrive actually has 100GB of physical capacity - it is simply formatted by default to 80GB usable.

With more "reserve" space (difference between physical and formatted), the garbage collector's worst-case can be improved - and performance guaranteed.

Of course, we default format the drives so as to maximize usable capacity - and keep good write performance under all but the pathological case.

We've actually found it quite rare that customers see degraded write performance under their real world applications. But, for those that do, sacrificing some physical capacity is well worth it for them (considering what they otherwise pay for write IOPS). And, that's even though it's NAND flash that isn't as cheap per GB.

No matter how you shake it, even in the pathological case, the cost per write IOP is 1/10th to 1/100th that of mechanical disks.

And, with the cost per GB of NAND dropping by more than 50% year over year, it's getting even cheaper.

So, to be clear, our claimed write bandwidth and IOPS are valid, steady-state, numbers inclusive of garbage collection activities. They are not, however, the pathological case.

It is the fact that even in a pathological case the ioDrive achieves 30,000 70/30 R/W random 4K IOPS that IBM went with ioDrives in QuickSilver. No other SSD even came close.

Interestingly enough, it's not just because our garbage collection is more efficient than others that we get so much better write performance. It's actually another dirty secret in the Flash SSD world - poor performance with mixed workloads.

Other SSD's get a small fraction of their read or write performance when doing a mix of reads and writes. You'd think that if one gets X IOPS on reads and Y IOPS on writes, one should get 50% * x + 50% * y under a 50/50 RW mix. In reality they typically get less than a quarter of that.

This is fundamentally because NAND is half-duplex and writes take much longer than reads. This makes it a bit tricky to interleave reads and writes. The ioDrive, on the other hand, mixes reads and writes with great efficiency.

Bottom line is; NAND flash is a quirky media. It's not like slapping DRAM on a board. Expertise in the mechanical HDD space, chip design space, or DRAM appliance space doesn't really help... it's more of a SW(well firmware) problem than a chip or mechanical design problem.

But, the beauty of it is; one can get 640GB of it on something the size of an 8GB DRAM DIMM, and at a price that's much less per GB than DRAM and much less per IOP than HDD's.

-David Flynn

CTO Fusion-io

david@fusion-io.com

Share this post


Link to post
Share on other sites

David,

Which extraordinary syntethic benchmark conditions are you refering to?

"A copy and paste real test, 615 mixed files (5.98GB) in 3min 27sec 92:"

That is the most basic operation... copying files. It takes me similiar amount of time on my 750 GB Seagate with similiar number of files. Or this is expected? I mean to pay that amount of money... and wait for second on each IM message as in OCZ Core or not do any better for the most basic drive operation than something that costs 100 times less?

Edited by 6_6_6

Share this post


Link to post
Share on other sites
The SPC-1 benchmark run by IBM on Quicksilver is a random 70/30 R/W mix 4K packet test and is, therefore, pathological. That's why they achieve around 30,000 IOPS per ioDrive vs. the 100,000 we advertise.

-David Flynn

Everyone knows that Flash SSDs provide great Read IOPS. The question is always WRITE IOPS and bandwidth and how do the Erases going on in the SSD impact performance.

The "around 30,000 IOPS" number from IBM is only 9,000 random writes per second. The IBM architect confirmed this was with 100 GB of useable capacity out of 200GB. This is good data and worthy of publication. (IBM indicated 300K writes on 41 IO-drives (7500 Write IOPS per drive). Barry Whyte's blog is very good.)

The specsheet on the Fusion IO website claims "101,000 (sustained random writes)"...... this is over 10 times higher than what the IO drive really sustains (assuming "sustains" doesn't allow the system to take 5 minute breather when it gets a little tired). IBM assumed the normal definition of sustained. If you are building systems that have to work, you want accurate data. If IBM used the specsheet, they would have only used 4 IO-Drives and not 41.

As a couple of others have pointed out, the trouble with just measuring WRITE IOPS is that it ignores the sequential write bandwidth issue. HDD drives with only 200 IOPS can sustain well over 100 MByte/sec and a Flash drive with less than 10K IOPS sustains less than 40 MByte/sec and so file copies on Flash drives can be slow. Hiding performance issues doesn't help. The IO-Drive has some very good characteristics, but the marketing needs to line up with reality.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this