benjamin9999

Too many years of awful 3ware performance.

Recommended Posts

Hey guys. New here, but i know this is the place to start some serious talk on this vendor's cards.

I've been using 3ware cards here and there since the first escalade cards were available.

Mainline linux drivers is a nice thing, and mostly the firmware and such is of satisfactory quality. Lets not get too carried away, they generally look good because the rest of the field is so poor.

Performance has always been poor "feeling", on any card. Even when the 9500 killer card was new, and i had this setup with 8 disks, things were not great, despite benchmarks which demonstrate massive speeds.

It's late so i'll cut to the meat.

Most people want to get better performance, and quick googling gets you to the blockdev --setra stuff. Great. So you have a massive readahead. Now you perform some DD, or bonnie++ or whatever VFS-layer operation you want, and get massive read speeds. 200, 300MB sec.

Now do some bonnie++ or dd tests for writes and see some big output. Ok great, you can fill your pagecache and linux can async write stuff in the background as long as you have memory. Depending on the ratio of free pages to disk speed you'll see some nice numbers.

But not much of any of that is of any use, unless you're purely in the business of shuffling around huge datas. And if you fill your page cache with dirty pages you'll start to see a sluggish system since the queue is deep and IO starts to block in other places -- now even the mp3 you were streaming at the same time will be in trouble.

So specs sell. And if people see 300MB sec read/write in DD you'll have the market.

Ok so enough of all that. Filesystem operations occur in 4K blocksize. And most applications do not perform async IO. Maybe postgresql, MSSQL and some smart apps like that.

Imagine...

while ( 1 ) {

c = fgetc( f )

do something with c;

}

this operation will perform an IO for each iteration, so we must be able to perform it -with the lowest latency- possible. Read-ahead 16384 in a multi-process environment? That's a big overhead for these small reads, certainly detrimental. But this type of IO pattern is happening all of the time.

Some IO systems like DRBD will perform only in 4K blocksize and with a full write sync at the same time, so this latency is critical.

Any performance gains from readahead, or async-pagecache-writes are purely a function of linux, RAM, and spindles. 3ware makes no difference here.

Now lets reveal how poor 3ware's latency is, and reevaluate all those times where we wondered -what was going on- ??

BTW these tests here are on two similar boxes. Server class boards with 2ghz SMP cpus, 1GB ram. One with 3Ware 9650SE and 14 * 500GB Raid5. The other is Areca 1261-ML with 14 * 500GB Raid5.

This issue is easy to demonstrate on previous 3ware models as well, although i do not have some setup currently to run similar comparisons.

Using the (great) linux test project (LTP) ' disktest ' we can test with pure block-io (bypass pagecache) at 4K, any number of threads, and even tweak the range of sectors. Random and Linear seek pattern is also possible.

root@storage2 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:1000000000 /dev/sde
| 2007/08/31-02:15:39 | START | 7300 | v1.2.8 | /dev/sde | Start args: -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:1000000000 (-N 1000000001) (-r) (-c) (-p u) 
| 2007/08/31-02:15:39 | INFO  | 7300 | v1.2.8 | /dev/sde | Starting pass
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | 327397376 bytes read in 79931 transfers.
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Read throughput: 23385526.9B/s (22.30MB/s), IOPS 5709.4/s.
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Read Time: 14 seconds (0h0m14s)
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Total bytes read in 79931 transfers: 327397376
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Total read throughput: 23385526.9B/s (22.30MB/s), IOPS 5709.4/s.
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Total Read Time: 14 seconds (0d0h0m14s)
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Total overall runtime: 15 seconds (0d0h0m15s)
| 2007/08/31-02:15:53 | END   | 7300 | v1.2.8 | /dev/sde | Test Done (Passed)
root@storage2 ~ # 

Ok so 4K blocksize, single thread, linear read (the -s sector range setting is because this 6TB volume is too big for disktest to handle) and we have 5709.4 IOPS !?

A single decent 7200 rpm SATA should get at least 6000 by itself.

Ok, but without reading ahead, we really can't effectively use all these spindles anyway.

Lets remove the disks completly, and just test round-trip to the 3ware card.

By setting sector limit to -s 0:8, we'll just be reading the same 4k block over and over.

root@storage2 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 /dev/sde
| 2007/08/31-02:19:56 | START | 7308 | v1.2.8 | /dev/sde | Start args: -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 (-N 9) (-r) (-c) (-p u) 
| 2007/08/31-02:19:56 | INFO  | 7308 | v1.2.8 | /dev/sde | Starting pass
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | 348639232 bytes read in 85117 transfers.
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Read throughput: 24902802.3B/s (23.75MB/s), IOPS 6079.8/s.
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Read Time: 14 seconds (0h0m14s)
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Total bytes read in 85117 transfers: 348639232
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Total read throughput: 24902802.3B/s (23.75MB/s), IOPS 6079.8/s.
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Total Read Time: 14 seconds (0d0h0m14s)
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Total overall runtime: 15 seconds (0d0h0m15s)
| 2007/08/31-02:20:11 | END   | 7308 | v1.2.8 | /dev/sde | Test Done (Passed)
root@storage2 ~ # 

great, we're up to 6079 IOPS. reading blocks that should come right from the 3ware cache every time, or at worst, the buffers on the spindle.

Lets compare with Areca's competing card...

root@storage1 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 /dev/sde    
| 2007/08/31-02:22:16 | START | 10788 | v1.2.8 | /dev/sde | Start args: -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 (-N 9) (-r) (-c) (-p u) 
| 2007/08/31-02:22:16 | INFO  | 10788 | v1.2.8 | /dev/sde | Starting pass
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | 2737975296 bytes read in 668451 transfers.
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Read throughput: 182531686.4B/s (174.08MB/s), IOPS 44563.4/s.
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Read Time: 15 seconds (0h0m15s)
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Total bytes read in 668451 transfers: 2737975296
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Total read throughput: 182531686.4B/s (174.08MB/s), IOPS 44563.4/s.
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Total Read Time: 15 seconds (0d0h0m15s)
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Total overall runtime: 15 seconds (0d0h0m15s)
| 2007/08/31-02:22:31 | END   | 10788 | v1.2.8 | /dev/sde | Test Done (Passed)
root@storage1 ~ # 

OK now we have some low latency. 44,563 IOPS reading the same 4K block.

what about the full volume (or at least which fits in disktest's limits)

root@storage1 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:1000000000 /dev/sde 
| 2007/08/31-02:23:33 | START | 10798 | v1.2.8 | /dev/sde | Start args: -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:1000000000 (-N 1000000001) (-r) (-c) (-p u) 
| 2007/08/31-02:23:33 | INFO  | 10798 | v1.2.8 | /dev/sde | Starting pass
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | 2457137152 bytes read in 599887 transfers.
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Read throughput: 175509796.6B/s (167.38MB/s), IOPS 42849.1/s.
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Read Time: 14 seconds (0h0m14s)
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Total bytes read in 599887 transfers: 2457137152
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Total read throughput: 175509796.6B/s (167.38MB/s), IOPS 42849.1/s.
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Total Read Time: 14 seconds (0d0h0m14s)
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Total overall runtime: 15 seconds (0d0h0m15s)
| 2007/08/31-02:23:48 | END   | 10798 | v1.2.8 | /dev/sde | Test Done (Passed)
root@storage1 ~ # 

So what is up, 3ware? The competing card is faster by an order of magnitude, and the advise you offer for improving performance is to increase linux readahead value?

A colleague of mine who has agreed with me for years that "something was up" has always felt that 3ware on Windows does not have such problems, so that is another subject up for testing.

Share this post


Link to post
Share on other sites
Hey guys. New here, but i know this is the place to start some serious talk on this vendor's cards.

bumping my own topic, to add some more detail...

after writing this i started to wonder if 3ware does interrupt coalescence, to minimize cpu usage - this technique is commonplace in networking chipsets like the e1000, or tg3.

so while running the same 4K same-block testing, vmstat can give Ksec vs. Interrupts for these two cards.

4K same-sectors block-direct reading on 3ware:

root@storage2 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 /dev/sde
| 2007/08/31-18:22:48 | STAT  | 9330 | v1.2.8 | /dev/sde | Total read throughput: 24936448.0B/s (23.78MB/s), IOPS 6088.0/s.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
0  1      0 949680  16428  18992    0    0 24312     0 6337 12202  0  4 50 46
0  1      0 949680  16428  18992    0    0 24312     0 6336 12194  0  2 50 48
0  1      0 949680  16428  18992    0    0 24316     0 6332 12179  0  2 50 48

6337 - 250 (linux HZ) = 6087. So 1 IRQ per 4K block occurs.

now on Areca, same testing

root@storage1 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 /dev/sde
| 2007/08/31-18:21:19 | STAT  | 13196 | v1.2.8 | /dev/sde | Total read throughput: 180495974.4B/s (172.13MB/s), IOPS 44066.4/s.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
1  1      0 314072  49372 592912    0    0 177972     0 44744 88996  0 26 50 25
1  1      0 314072  49372 592912    0    0 176900     0 44491 88469  1 25 50 24
1  0      0 314072  49372 592912    0    0 177060     0 44527 88544  2 25 50 24

44744 - 250 (linux HZ) = 44494. So also 1 IRQ per 4K block occurs.

By this measurement, the Areca card is 7.3 times faster than 3ware's competing offering.

Unless this is a serious design flaw in the hardware / silicon, It seems like this problem should be able to be fixed.

Now it looks even worse for 3ware, since it's (mostly) clear that Interrupt coalescence is not taking place in this card.

Share this post


Link to post
Share on other sites

Very nice research. I have a 3ware and get great sustained read/write numbers (with the right buffering tweaks as you mention), but the machine always felt sluggish under heavy random I/O. It'd be great if some driver improvements come out of all this. Hopefully 3ware is listening. Perhaps you should open a support case with them and share your findings.

- Chris

Share this post


Link to post
Share on other sites

Hmm. Never did any heavy random I/O testing here-- my customers' applications don't demand any heavy random I/O, and we've always been satisified with 3ware vs. the competition. We do several hundred MB/sec sustained reads and writes across multiple controllers.

I assume you are with matched firmware/driver revs per 3ware instructions? Just to check, I assume you are.

Share this post


Link to post
Share on other sites
Hmm. Never did any heavy random I/O testing here-- my customers' applications don't demand any

just to clarify, this really isn't about -random- I/O. notice in the tests that the access pattern (-p) is linear (-p l). also, this test was done with the latest bios/firmware image from the release-series, and mainline driver from 2.6.22.5.

this is about round trip latency for a single request.

IMO, this is really the most important measurement... layers built on top of a low-latency i/o path will have no problem getting high throughput with readahead, async-i/o, write-back etc. but only as long as you have free RAM. as soon as you run out, you'll be crawling.

no matter what card/interface/layer you are testing, if the seek pattern is Random, then the limits of 7200RPM, 10K, 15K will be revealed.

IIIRC, a 7200 RPM disk with random IO across the whole disk should bring you down to about 100 IOPS.

off topic, but it's useful to note that higher density drives will reduce head-travel time, and random IOPS will go up. use good partitioning schemes (LVM2) to ensure that sets of data are not spread over the entire platter.

Share this post


Link to post
Share on other sites
IMO, this is really the most important measurement... layers built on top of a low-latency i/o path will have no problem getting high throughput with readahead, async-i/o, write-back etc. but only as long as you have free RAM. as soon as you run out, you'll be crawling.

Thanks for starting this thread - I thought I was going mad and have so far tried 4 different OSes - Centos 4.4, 4.5, openSUSE 10.2 and finally RHEL AS 4 update 5 (just to be sure) and two types of disk (Maxtor/Western Digital), as well as applying 3ware's tuning 'tweaks' and experimenting with LVM/noLVM, SMP/noSMP, RAID 1 and Single Disk modes all in an attempt to get a handle on why this 9550SX-8LP seems to cause a dual Opteron 2.4GHz with 4GB RAM to go into a responsiveness nosedive under intensive IO. Latest firmware and drivers in use.

I've just finished a whole load of vmstats runs of two timed dd commands (reading 3, 4, 6 and 20GB from /dev/sda then writing 3,4,6 and 20GB from /dev/zero to a file on the 200GB+ / partition for both SMP and non-SMP kernel 2.6.9-55.EL) and the results gave me the clue to Google up "3ware vmstats" and find your notes.

Rather than detail it all here, I've uploaded 4 PDFs of the graphs to one of my own sites where I'm keeping my own notes - should you want to take a look. I find the "blocks out" figures very interesting, coupled with the number of processes in uninterruptible sleep while the card processes what's been thrown at it. No wonder the machine hits a brick wall.

As to how to get around the problem, well that's an entirely different matter - I'm all out of ideas, frankly. Chucking around a 20G logfile from time to time is quite possible on a busy webserver, last thing I need is for everything else to stop for a daydream while it happens.

S.

Edited by SimonB

Share this post


Link to post
Share on other sites
Thanks for starting this thread - I thought I was going mad and have so far tried 4 different OSes - Centos 4.4, 4.5, openSUSE 10.2 and finally RHEL AS 4 update 5 (just to be sure) and two types of disk (Maxtor/Western Digital), as well as applying 3ware's tuning 'tweaks' and experimenting with

well, i have to back off this thread a bit. i showed this to another engineer familiar with these issues, and he pointed out that what i have "proven" with my data was not exactly as i thought. instead of "poor" performance from slow round-trip-to-controller, what i have actually found is that 3ware is not doing any hardware read-ahead, and doesn't appear to be doing any read-cache either.

since the read block size (4k) is smaller than the stripe, and single threaded, then we can't get any faster than a single spindle - without read-ahead. areca is definitely reading ahead in hardware (per the config on that setup). there is no clear way to configure this with 3ware - all of the cache config options are related to write-back/thru and sync operation.

...and no read cache either? the single-block read test is also slow.

while i think the situation is not as bad as i first thought, there still seems to be some strange stuff going on here.

not sure what to get into testing yet, but i have not had time to mess with it this week either.

Share this post


Link to post
Share on other sites

Maybe its your motherboard? Have you tried a different one?

Also what do you get for sequential read and write with dd?

What filesystems have you tried? Are you specifying the -R STRIDE size (with EXT2/3) or are you using a proper sunit/swidth with XFS?

I am currently writing/using my filesystem but even with that, its not that slow as your benchmarks!

$ dd if=/dev/zero of=10gb bs=1M count=10240

10240+0 records in

10240+0 records out

10737418240 bytes (11 GB) copied, 26.752 seconds, 401 MB/s

$ dd if=10gb of=/dev/null bs=1M count=10240

10240+0 records in

10240+0 records out

10737418240 bytes (11 GB) copied, 20.4077 seconds, 526 MB/s

Share this post


Link to post
Share on other sites
Also what do you get for sequential read and write with dd?

my opinion would be that here you have already fallen into the 3ware trap

What filesystems have you tried? Are you specifying the -R STRIDE size (with EXT2/3) or are you using a proper sunit/swidth with XFS?

if you're working with a filesystem, then you'll be accessing the volume thru the pagecache, so now you are benchmarking the os(readahead+writethru)+ram(amount)+3ware. re-read first post...

I am currently writing/using my filesystem but even with that, its not that slow as your benchmarks!

$ dd if=/dev/zero of=10gb bs=1M count=10240

10240+0 records in

10240+0 records out

10737418240 bytes (11 GB) copied, 26.752 seconds, 401 MB/s

again, this is the classic test we want to avoid, but you did write 10GB which is significant. how much ram in this system and what raid config?

Share this post


Link to post
Share on other sites
Also what do you get for sequential read and write with dd?

my opinion would be that here you have already fallen into the 3ware trap

What filesystems have you tried? Are you specifying the -R STRIDE size (with EXT2/3) or are you using a proper sunit/swidth with XFS?

if you're working with a filesystem, then you'll be accessing the volume thru the pagecache, so now you are benchmarking the os(readahead+writethru)+ram(amount)+3ware. re-read first post...

I am currently writing/using my filesystem but even with that, its not that slow as your benchmarks!

$ dd if=/dev/zero of=10gb bs=1M count=10240

10240+0 records in

10240+0 records out

10737418240 bytes (11 GB) copied, 26.752 seconds, 401 MB/s

again, this is the classic test we want to avoid, but you did write 10GB which is significant. how much ram in this system and what raid config?

8GB of ram but that is irrelevant:

$ /usr/bin/time dd if=/dev/zero of=file bs=1M

dd: writing `file': No space left on device

1070704+0 records in

1070703+0 records out

1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s

I am using Linux Software RAID5, no hardware raid here.

Share this post


Link to post
Share on other sites
8GB of ram but that is irrelevant:

1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s

ok this is meaningful, and seems nice - but you're still doing some bulk transfers with a huge blocksize.

how are reads or cpu availability during this? poster above (SimonB) demonstrates that his system is useless during a scenario like this.

I am using Linux Software RAID5, no hardware raid here.

i think this is important to consider because the controller is likely in a "pass thru" state, so comparisons should be taken with a grain of salt.

i read someone's pages once where some extensive (with bonnie anyway) testing was done and demonstrated that linux software raid was faster than 3ware. unfortunately i can't come up with the URL at the moment, but i will try to track it down - you may have already read some similar info.

if your DD is new enough, try conv=direct to open the of/if with O_DIRECT. try 4k 64k and 1M blocksizes. LTP's disktest is a much more convenient tool for the job though - i'm not sure if you can use o_direct with /dev/null & zero ??

Share this post


Link to post
Share on other sites
8GB of ram but that is irrelevant:

1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s

ok this is meaningful, and seems nice - but you're still doing some bulk transfers with a huge blocksize.

how are reads or cpu availability during this? poster above (SimonB) demonstrates that his system is useless during a scenario like this.

It used to be around 15-30% with an E6300 on 1 core but I have since upgraded to a Q6600 so its largely irrelevant with 4 cores.

I am using Linux Software RAID5, no hardware raid here.

i think this is important to consider because the controller is likely in a "pass thru" state, so comparisons should be taken with a grain of salt.

i read someone's pages once where some extensive (with bonnie anyway) testing was done and demonstrated that linux software raid was faster than 3ware. unfortunately i can't come up with the URL at the moment, but i will try to track it down - you may have already read some similar info.

if your DD is new enough, try conv=direct to open the of/if with O_DIRECT. try 4k 64k and 1M blocksizes. LTP's disktest is a much more convenient tool for the job though - i'm not sure if you can use o_direct with /dev/null & zero ??

Hm I understand the direct etc but I don't believe I run my applications in that way so why would I want to limit the test to that?

Share this post


Link to post
Share on other sites
Performance has always been poor "feeling", on any card. Even when the 9500 killer card was new, and i had this setup with 8 disks, things were not great, despite benchmarks which demonstrate massive speeds.

So what is up, 3ware? The competing (Areca) card is faster by an order of magnitude, and the advise you offer for improving performance is to increase linux readahead value?

A colleague of mine who has agreed with me for years that "something was up" has always felt that 3ware on Windows does not have such problems, so that is another subject up for testing.

I completely agree that 3Ware performance is very poor on a modern Linux system.

(I think that the author of the 3Ware tuning documents realizes this. The 3Ware tuning documents carefully define "performance" to mean "sequential read throughput", and then tell you to increase Linux readahead. That works fine for some carefully-chosen benchmarks, but it paints a deceptive picture of the card's general performance. Performance under a general-purpose workload is very, very poor.)

I've been investigating the performance problem for a while now, and the bulk of evidence points to a problem with how completed I/O requests are passed back from the card's firmware to the Linux driver.

To locate the bottleneck, I instrumented the 2.6 Linux kernel to capture data about filesystem read/write activity. I collected the data using blktrace and postprocessed it with a Python program that tracked the state of each I/O request through the kernel and 3Ware driver.

To collect my data, I ran 24 iozone processes, each writing 8K chunks in O_SYNC mode to a unique file on a RAID-10 jfs filesystem served by a single 3Ware 9550SX-8LP card. This models something that postgres does quite frequently. Since each write is synchronous, the Linux buffer cache is out of the picture, making it much easier to understand what's happening at the device level.

The Linux kernel puts a block queue in front of the 3Ware driver. Linux queues have I/O schedulers that can implement sophisticated strategies for holding, splitting, and merging requests. For this test, I used the "noop" scheduler; it just inserts requests onto the queue as quickly as they arrive and does no other processing on them. Requests pass straight into the 3Ware driver as quickly as its queuecommand function will accept them.

By plotting the collected trace data, the problem is clearly visible. I'd expect the steady state to show an empty block queue, with ~24 active writes within the driver at all times. (It takes very little time for a Linux process to get a return code back from a write syscall and initiate another write request.) I'd expect to see a steady stream of completed write responses from the driver.

But this isn't what I see. Instead, write requests flow through the kernel's block queue and into the 3Ware driver, which returns them in batches at a rate of about 10Hz.

My gut tells me that this is closely related to the performance problem with the 3Ware controller: If it's presented with more than a small number of concurrent write requests, it appears to switch into a "batch mode" where it buffers completed write responses, returning them in batches at a glacially slow 10Hz rate. (Something similar may be true for reading; I haven't investigated that.)

I've been passing my findings along to 3Ware, but I'm still waiting for their response.

Share this post


Link to post
Share on other sites
I completely agree that 3Ware performance is very poor on a modern Linux system.

To collect my data, I ran 24 iozone processes, each writing 8K chunks in O_SYNC mode to a unique file on a RAID-10 jfs filesystem served by a single 3Ware 9550SX-8LP card. This models something that postgres does quite frequently. Since each write is synchronous, the Linux buffer cache is out of the picture, making it much easier to understand what's happening at the device level.

very interesting stuff. until now i had only tested reads because i was comparing with other hardware - and since reads are non-destructive i could do comparisons on other boxes i have which are in-use.

why O_SYNC and not O_DIRECT?

it would be interesting to see the instrumentation done with ltp's disktest using the -I BD bio api so that the pagecache is bypassed entirely (along with the filesystem)... quick testing with 24 threads of writing on my 9650se raid5 (same config listed above) and this bottleneck you describe doesn't seem to manifest itself.

root@storage2 ~ # disktest -B 8k -I BD -K 24 -p l -P A -w -T 120 -N 6695042048 /dev/sde
| 2007/09/17-10:04:33 | START | 18026 | v1.2.8 | /dev/sde | Start args: -B 8k -I BD -K 24 -p l -P A -w -T 120 -N 6695042048 (-c) (-p u)
| 2007/09/17-10:04:33 | INFO  | 18026 | v1.2.8 | /dev/sde | Starting pass
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | 17459126272 bytes written in 2131241 transfers.
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Write throughput: 147958697.2B/s (141.10MB/s), IOPS 18061.4/s.
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Write Time: 118 seconds (0h1m58s)
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Total bytes written in 2131241 transfers: 17459126272
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Total write throughput: 147958697.2B/s (141.10MB/s), IOPS 18061.4/s.
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Total Write Time: 118 seconds (0d0h1m58s)
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Total overall runtime: 120 seconds (0d0h2m0s)
| 2007/09/17-10:06:32 | END   | 18026 | v1.2.8 | /dev/sde | Test Done (Passed)

here is 8k writes, 24 threads, w/BD... ~18,000 IOPS, and during this, about 18,000 IRQ are generated, so the results-coalescence you describe doesn't appear to occur here.

noop scheduler is in effect here, same results for deadline.

Share this post


Link to post
Share on other sites
But this isn't what I see. Instead, write requests flow through the kernel's block queue and into the 3Ware driver, which returns them in batches at a rate of about 10Hz.

My gut tells me that this is closely related to the performance problem with the 3Ware controller: If it's presented with more than a small number of concurrent write requests, it appears to switch into a "batch mode" where it buffers completed write responses, returning them in batches at a glacially slow 10Hz rate. (Something similar may be true for reading; I haven't investigated that.)

I've been passing my findings along to 3Ware, but I'm still waiting for their response.

This is very interesting, thanks for posting it. I don't have the knowledge to do the sort of investigation you've achieved, I'm limited to running various benchmarking programs and attempting to understand the results.

Since in other places I've been asked "Why don't you try RAID-10?", I decided the experiment with different types of RAID setup to see what difference it might make. As mentioned above, I initially saw this problem with a simple RAID 1 config. The easiest alternative to try first was to convert my two hot spares into a RAID 0 array and create a default ext3 partition upon which to experiment. (Edit: in the results below, the dd tests were done prior to making the filesystem and running the bonnie++ tests)

Here are the results of some benchmarks.

The machine is a dual Opteron 2.4GHz and has 4GB RAM installed, with a 9550SX-8LP hosting four Seagate ST3250820SV drives.

First, with no 3ware-recommended kernel tweaks applied (CentOS 4.5 blockdev readahead and nr_requests are 256 and 8192 by default - it's a straightforward minimal install, with 9550SX firmware and driver from the same 3ware codeset (9.4.1.2), the driver being built-in to the 2.6.9-55.EL kernel in the distro).

Readahead = 256
nr_requests = 8192
Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
dd read 1024 MB at 123.82 MB/s in 8.27 secs
dd read 2048 MB at 125.80 MB/s in 16.28 secs
dd read 4096 MB at 125.22 MB/s in 32.71 secs
dd read 8192 MB at 125.18 MB/s in 65.44 secs

Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
dd write 1024 MB at 97.15 MB/s in 10.54 secs
dd write 2048 MB at 93.60 MB/s in 21.88 secs
dd write 4096 MB at 90.22 MB/s in 45.40 secs
dd write 8192 MB at 91.63 MB/s in 89.40 secs

Label: RA-256_NR-8192
bonnie++ -m RA-256_NR-8192 -n 0 -u 0 -r 512 -s 20480 -f -b
Version  1.03	   ------Sequential Output------ --Sequential Input- --Random-
				-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine		Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
RA-256_NR-8192  20G		   50014  18 44715  12		   126556  14 137.0   0

Then I applied the 3ware tweaks (readahead 16384, nr_requests 512) and re-tested:

Readahead = 16384
nr_requests = 512
Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
dd read 1024 MB at 149.71 MB/s in 6.84 secs
dd read 2048 MB at 151.70 MB/s in 13.50 secs
dd read 4096 MB at 152.15 MB/s in 26.92 secs
dd read 8192 MB at 153.61 MB/s in 53.33 secs

Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
dd write 1024 MB at 88.28 MB/s in 11.60 secs
dd write 2048 MB at 89.90 MB/s in 22.78 secs
dd write 4096 MB at 89.51 MB/s in 45.76 secs
dd write 8192 MB at 87.89 MB/s in 93.21 secs

Label: RA-16384_NR-512
bonnie++ -m RA-16384_NR-512 -n 0 -u 0 -r 512 -s 20480 -f -b
Version  1.03	   ------Sequential Output------ --Sequential Input- --Random-
				-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine		Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
RA-16384_NR-512 20G		   62753  22 55895  16		   150694  18 127.3   0

Next, I left the readahead at 16384 and returned nr_requests to its 8192 default and re-tested again:

Readahead = 16384
nr_requests = 8192
Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
dd read 1024 MB at 149.71 MB/s in 6.84 secs
dd read 2048 MB at 152.72 MB/s in 13.41 secs
dd read 4096 MB at 153.35 MB/s in 26.71 secs
dd read 8192 MB at 153.15 MB/s in 53.49 secs

Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
dd write 1024 MB at 97.15 MB/s in 10.54 secs
dd write 2048 MB at 97.06 MB/s in 21.10 secs
dd write 4096 MB at 93.45 MB/s in 43.83 secs
dd write 8192 MB at 91.90 MB/s in 89.14 secs

Label: RA-16384_NR-8192
bonnie++ -m RA-16384_NR-8192 -n 0 -u 0 -r 512 -s 20480 -f -b
Version  1.03	   ------Sequential Output------ --Sequential Input- --Random-
				-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine		Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
RA-16384_NR-819 20G		   57576  20 55535  16		   151212  18 126.4   0

Finally, for completeness, I set readahead back to 256 and changed nr_requests to 512:

Readahead = 256
nr_requests = 512
Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
dd read 1024 MB at 123.97 MB/s in 8.26 secs
dd read 2048 MB at 125.26 MB/s in 16.35 secs
dd read 4096 MB at 125.37 MB/s in 32.67 secs
dd read 8192 MB at 125.09 MB/s in 65.49 secs

Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
dd write 1024 MB at 91.43 MB/s in 11.20 secs
dd write 2048 MB at 90.14 MB/s in 22.72 secs
dd write 4096 MB at 89.55 MB/s in 45.74 secs
dd write 8192 MB at 92.17 MB/s in 88.88 secs

Label: RA-256_NR-512
bonnie++ -m RA-256_NR-512 -n 0 -u 0 -r 512 -s 20480 -f -b
Version  1.03	   ------Sequential Output------ --Sequential Input- --Random-
				-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine		Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
RA-256_NR-512   20G		   61808  22 46250  12		   126959  14 137.1   0

So, the 3ware tweak readahead = 16384 improves read throughput, but their recommended nr_requests = 512 reduces write throughput. Neither of these params appears to have any impact whatsoever on the underlying problem of sluggish system response, however. Frankly, I didn't expect them to.

I hope 3ware get back to you about your findings - so far I've not had much success in getting a response out of them. Please keep us posted.

S.

Edited by SimonB

Share this post


Link to post
Share on other sites

This is a great thread - I've been seeing some horrible performance on some of my boxes with 9550sx's.

One of them is currently in a jbod configuration with software raid, which is much faster and more responsive than the controller.

I've been tempted on the advice of others to get an Areca to see what benefits I can get.

In my case, I couldn't get the card to complete more than 400io/sec aggregate (Never more than 120 in a single process) (in this case, seek randomly, read 8k - mimicing postgresql's io patterns). For comparison, I get thousands on another box with an hp p600 controller and a pile of sas disks. During a lot of those times the system gets very laggy and load goes up.

So we'll see what happens.

Share this post


Link to post
Share on other sites

Same here: slow random I/O (using a database).

I published an overview and I'm in the process of refreshing all those informations after completing a bunch of tests with a 9650, for both the controller-implemented RAID (mostly RAID10) and the Linux 'md' one.

If you want me to run a given test please describe it to me, along with the reason why it is pertinent.

Share this post


Link to post
Share on other sites
Same here: slow random I/O (using a database).

I published an overview and I'm in the process of refreshing all those informations after completing a bunch of tests with a 9650, for both the controller-implemented RAID (mostly RAID10) and the Linux 'md' one.

If you want me to run a given test please describe it to me, along with the reason why it is pertinent.

Some great thorough work and research you've done. Pretty much sums up all the info and tweaks I've ever tried. There is one thing that puzzled me a bit about your tweaking. If dealing with small 64k reads a majority of the time, isn't setting read-ahead counter-productive? After all, if the reads are random, virtually nothing it reads ahead would be useful would it? To that end, perhaps setting a large read ahead with blockdev --getra is not a good idea, and you might even try turning off read-ahead on your controller (although if it's adaptive read-ahead, it should learn fairly quickly that reading ahead isn't useful and not do it).

Share this post


Link to post
Share on other sites
If dealing with small 64k reads a majority of the time, isn't setting read-ahead counter-productive?

If the read-ahead costs 'nothing' it isn't. The idea is that there is, after a useful read, some other data available without any head move, the platter spins... so let's read it and place it in a cache because some access patterns request a set of blocks, then the next one... This is a classic approach, but we have to take some parameters into account:

  • if such read-ahead mobilizes the head longer, forbiding it to move in order to serve the next request, we have a problem because we trade latency (as the next request waits) for a potential cache hit. The hit probability is the expressed by the ratio (total amount of useful data / amount of read-ahead), which is very low with a big database, therefore by taking the latency into account such read-ahead may seem a bad decision. BUT... I'm wondering, after various tests, what some/all disk logics can really do. I mean: some tests (random IO seems slightly better with a little bit of read-ahead) let me think that some disks actuators are somewhat idling for a short period after each head move. In such a case there is always a latency, I mean the head won't move, even if a request comes. It has somewhat to settle (seems strange to me that it can read then settle, albeit "settle then read" is OK) or the logic has to do some calculation or something forbids doing too many moves per unit of time(?). In such a case we grasp why reading what is under the head may be useful. That's just like the CPU cycles: use it or lose it. Warning: all this may be some misinterpretation from my part, or peculiar to some brand or models (maybe only observed on low-end drives, maybe even put there on purpose (to alleviate competition between the 'desktop' and 'server' disk ranges... well, I'm in paranoid mode!)).
  • filling the disk cache does not uses central (core) RAM, but it is AFAIK not what blockdev enables us to tweak because it acts at filesystem level, not at drive level. On drive level this is called look-ahead and tweaked for example thru hdparm or thru some low-level software tool. This is even more complicated when using a RAID controller because AFAIK it can tweak this, and does so. In fact I'm think that this may be part of the explanation of 3Ware bad performance on random I/O
  • no read-ahead must imply cross-spindle activity. I mean: a read-ahead done by reading on another spindle is often (performance-wise) catastrophic. The probability for this to happen is proportional to the size of the read-ahead

Any pertinent input will be welcome.

if it's adaptive read-ahead, it should learn fairly quickly that reading ahead isn't useful and not do it

Are you aware of any adaptive read-ahead logic on Linux? There was a kernel patch approximatively 3 years ago but I can't find anything anymore.

Thank you

Edited by natmaka

Share this post


Link to post
Share on other sites
If dealing with small 64k reads a majority of the time, isn't setting read-ahead counter-productive?

If the read-ahead costs 'nothing' it isn't. The idea is that there is, after a useful read, some other data available without any head move, the platter spins... so let's read it and place it in a cache because some access patterns request a set of blocks, then the next one... This is a classic approach, but we have to take some parameters into account:

  • if such read-ahead mobilizes the head longer, forbiding it to move in order to serve the next request, we have a problem because we trade latency (as the next request waits) for a potential cache hit. The hit probability is the expressed by the ratio (total amount of useful data / amount of read-ahead), which is very low with a big database, therefore by taking the latency into account such read-ahead may seem a bad decision. BUT... I'm wondering, after various tests, what some/all disk logics can really do. I mean: some tests (random IO seems slightly better with a little bit of read-ahead) let me think that some disks actuators are somewhat idling for a short period after each head move. In such a case there is always a latency, I mean the head won't move, even if a request comes. It has somewhat to settle (seems strange to me that it can read then settle, albeit "settle then read" is OK) or the logic has to do some calculation or something forbids doing too many moves per unit of time(?). In such a case we grasp why reading what is under the head may be useful. That's just like the CPU cycles: use it or lose it. Warning: all this may be some misinterpretation from my part, or peculiar to some brand or models (maybe only observed on low-end drives, maybe even put there on purpose (to alleviate competition between the 'desktop' and 'server' disk ranges... well, I'm in paranoid mode!)).
  • filling the disk cache does not uses central (core) RAM, but it is AFAIK not what blockdev enables us to tweak because it acts at filesystem level, not at drive level. On drive level this is called look-ahead and tweaked for example thru hdparm or thru some low-level software tool. This is even more complicated when using a RAID controller because AFAIK it can tweak this, and does so. In fact I'm think that this may be part of the explanation of 3Ware bad performance on random I/O
  • no read-ahead must imply cross-spindle activity. I mean: a read-ahead done by reading on another spindle is often (performance-wise) catastrophic. The probability for this to happen is proportional to the size of the read-ahead

Any pertinent input will be welcome.

if it's adaptive read-ahead, it should learn fairly quickly that reading ahead isn't useful and not do it

Are you aware of any adaptive read-ahead logic on Linux? There was a kernel patch approximatively 3 years ago but I can't find anything anymore.

Thank you

Well you appear to have far more means to test this stuff than I do, I was only speaking theoretically. If you set read ahead to 0 with blockdev, is IOPs performance impacted though?

As for the adaptive read-ahead, I meant the setting on the hardware card for the array, not done by linux.

Of my own interesting observations, I thought that even if I set blockdev read ahead to 0, since the card I can test supposedly has adaptive read-ahead (Areca 1280) that it would read-ahead and that performance wouldn't be affected. As in, I thought by having read-ahead by blockdev, and by the card, perhaps they would interfere, or at least that I could use one or the other - preferably the cards, since its intelligent enough to know when to do so and when not to. However, setting blockdev to 0 resulted in 140mb/s on hdparm, whilst ANY setting greater than 0 gave at least 700mb/s. So I definintely found that surprising. Also that card appears to have a limit at 829mb/s - I could not go any further regardless of whether my blockdev was 512 or 16384.

Share this post


Link to post
Share on other sites
If you set read ahead to 0 with blockdev, is IOPs performance impacted though?

With most applications it does not matter because they use some syscall or open(2) argument which disables read ahead for their requests. I tried with a (deliberately) badly-written application and read ahead slows it, albeit at a lesser extent than predicted, even if the (buffercache size / size of the file randomly read) ratio is low. Maybe because a fair part of read ahead is done during other latencies (data transfer, request creation, elevator periods, command queuing in the drive...?)

As for the adaptive read-ahead, I meant the setting on the hardware card for the array, not done by linux.

From the system viewpoint some tend to name it "look ahead" instead of "read ahead". AFAIK real smart adaptive read ahead (beyond the "enable read ahead if the last N+1 sectors read where adjacent") is only used on high-end autonomous cabinets: is there any low-end (sold for less than 5000 USD) RAID controller doing it?

I thought that even if I set blockdev read ahead to 0, since the card I can test supposedly has adaptive read-ahead (Areca 1280) that it would read-ahead and that performance wouldn't be affected. As in, I thought by having read-ahead by blockdev, and by the card, perhaps they would interfere

They won't, as the card has AFAIK no mean to know that a given request was or wasn't extended by read ahead. I mean: I could not see any provision for this in a driver. If I'm right the card will "read ahead" beyond the request, filling its own cache. Therefore it will work as intended.

However, setting blockdev to 0 resulted in 140mb/s on hdparm, whilst ANY setting greater than 0 gave at least 700mb/s.

Is 'mb' stated for 'megabyte'? Sorry to be picky but for me "Mb" is megabit and "MB" is megabyte (quite not the same!). The kernel may make weird use of the read ahead parameter by sending requests accordingly to its value (in such a case I'm afraid that there is a bug somewhere)

So I definintely found that surprising. Also that card appears to have a limit at 829mb/s

If I understand correctly (that's approx 800 MB/s) the bus or an associated chip may also be a limiting factor

Share this post


Link to post
Share on other sites
So what is up, 3ware? The competing card is faster by an order of magnitude, and the advise you offer for improving performance is to increase linux readahead value?

A colleague of mine who has agreed with me for years that "something was up" has always felt that 3ware on Windows does not have such problems, so that is another subject up for testing.

I can see this thread has been cold for a while but it has been a great help to me in trying to work out what's wrong with these bloody 3ware cards!

After reading the OP thread and following links posted by other members, I decided to do a test myself with a 9650SE 2LP on Ubuntu 8.04 server on Dell SC1435 ( 2 x dual core opterons with 7200RPM SATA 2 drives) in Raid 1 with 8GB ram.

After going through all the pain and suffering of systems just randomly rebooting (the 3ware NOAPIC problem) and also my systems 'sometimes' not detecting the 3ware bios on boot on DELL and IBM systems (the 3ware bring it in for a replacement jobbie because we messed up the bios settings on our card problem) I was left with at least a stable system but really bad performance.

So thanks to this thread, I've benchmarked the two systems, one with the 3ware 9650SE and one with a Linux software Raid 1 config and I cant believe the difference.

The 3ware machine (tweaked with the settings suggested by 3ware) was bloody awful. I'm not as technical savvy as other posters here, but all I did was a real world operation and use to gold ole' 'cp' command to copy a big file from one directory to another in the same mount point.

On a 2GB file with the 3ware card, my load shot up to 3.7 with i/o wait % peaking at about 90%.

With the same file on the software raid, my load hit 1.2 with an i/o wait % peaking at about 50%.

I didn't time it (I'll do it on the next test) but my main concern was the large load that a simple cp operation was putting on the system due to waiting around for i/o. I would thought a software based implementation would chew CPU up but this is really surprising.

All I can say is that I'm glad I've got more responsive servers and I'm not having these 'mega' systems sitting idle while the 3ware array takes it's only sweet bloody time queuing requests from the OS to disc. I'm really pissed off at 3ware for selling such a crap 'linux' supported product where all their literature and website speaks the contrary.

Thanks!

Nasir.

Share this post


Link to post
Share on other sites
All I can say is that I'm glad I've got more responsive servers and I'm not having these 'mega' systems sitting idle while the 3ware array takes it's only sweet bloody time queuing requests from the OS to disc. I'm really pissed off at 3ware for selling such a crap 'linux' supported product where all their literature and website speaks the contrary.

Thanks!

Nasir.

I feel your pain. I've read everything Google turned up on the topic including suggestions of kernel recompiling group scheduler settings, all to no avail. Our Dual Harpertown 2.33Ghz Xeon 12GB server is brought to it's knees during large file copies using a 3ware 9550SXU 4 drive RAID5. Even the LS command from an SSH prompt takes over a minute to return data. This behavior is not exhibited using the Intel 5000V onboard SATA in software RAID5, which I would settle for just fine if I hadn't already paid $400 for the 3ware controller.

One thing I can't find any information on is if this same behavior can be seen in the Windows 2003 x64 environment. Guess I'll go see for myself.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now