Storage Forums: Too many years of awful 3ware performance. - Storage Forums

Jump to content

  • (4 Pages)
  • +
  • 1
  • 2
  • 3
  • Last »
  • You cannot start a new topic
  • You cannot reply to this topic

Too many years of awful 3ware performance. Some data to demonstrate.

#1 User is offline   benjamin9999 Icon

  • Member
  • Group: Member
  • Posts: 7
  • Joined: 31-August 07

Posted 31 August 2007 - 01:35 AM

Hey guys. New here, but i know this is the place to start some serious talk on this vendor's cards.

I've been using 3ware cards here and there since the first escalade cards were available.

Mainline linux drivers is a nice thing, and mostly the firmware and such is of satisfactory quality. Lets not get too carried away, they generally look good because the rest of the field is so poor.

Performance has always been poor "feeling", on any card. Even when the 9500 killer card was new, and i had this setup with 8 disks, things were not great, despite benchmarks which demonstrate massive speeds.

It's late so i'll cut to the meat.

Most people want to get better performance, and quick googling gets you to the blockdev --setra stuff. Great. So you have a massive readahead. Now you perform some DD, or bonnie++ or whatever VFS-layer operation you want, and get massive read speeds. 200, 300MB sec.

Now do some bonnie++ or dd tests for writes and see some big output. Ok great, you can fill your pagecache and linux can async write stuff in the background as long as you have memory. Depending on the ratio of free pages to disk speed you'll see some nice numbers.

But not much of any of that is of any use, unless you're purely in the business of shuffling around huge datas. And if you fill your page cache with dirty pages you'll start to see a sluggish system since the queue is deep and IO starts to block in other places -- now even the mp3 you were streaming at the same time will be in trouble.

So specs sell. And if people see 300MB sec read/write in DD you'll have the market.

Ok so enough of all that. Filesystem operations occur in 4K blocksize. And most applications do not perform async IO. Maybe postgresql, MSSQL and some smart apps like that.

Imagine...

while ( 1 ) {
c = fgetc( f )
do something with c;
}

this operation will perform an IO for each iteration, so we must be able to perform it -with the lowest latency- possible. Read-ahead 16384 in a multi-process environment? That's a big overhead for these small reads, certainly detrimental. But this type of IO pattern is happening all of the time.

Some IO systems like DRBD will perform only in 4K blocksize and with a full write sync at the same time, so this latency is critical.

Any performance gains from readahead, or async-pagecache-writes are purely a function of linux, RAM, and spindles. 3ware makes no difference here.

Now lets reveal how poor 3ware's latency is, and reevaluate all those times where we wondered -what was going on- ??

BTW these tests here are on two similar boxes. Server class boards with 2ghz SMP cpus, 1GB ram. One with 3Ware 9650SE and 14 * 500GB Raid5. The other is Areca 1261-ML with 14 * 500GB Raid5.

This issue is easy to demonstrate on previous 3ware models as well, although i do not have some setup currently to run similar comparisons.

Using the (great) linux test project (LTP) ' disktest ' we can test with pure block-io (bypass pagecache) at 4K, any number of threads, and even tweak the range of sectors. Random and Linear seek pattern is also possible.

root@storage2 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:1000000000 /dev/sde
| 2007/08/31-02:15:39 | START | 7300 | v1.2.8 | /dev/sde | Start args: -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:1000000000 (-N 1000000001) (-r) (-c) (-p u) 
| 2007/08/31-02:15:39 | INFO  | 7300 | v1.2.8 | /dev/sde | Starting pass
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | 327397376 bytes read in 79931 transfers.
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Read throughput: 23385526.9B/s (22.30MB/s), IOPS 5709.4/s.
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Read Time: 14 seconds (0h0m14s)
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Total bytes read in 79931 transfers: 327397376
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Total read throughput: 23385526.9B/s (22.30MB/s), IOPS 5709.4/s.
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Total Read Time: 14 seconds (0d0h0m14s)
| 2007/08/31-02:15:53 | STAT  | 7300 | v1.2.8 | /dev/sde | Total overall runtime: 15 seconds (0d0h0m15s)
| 2007/08/31-02:15:53 | END   | 7300 | v1.2.8 | /dev/sde | Test Done (Passed)
root@storage2 ~ # 


Ok so 4K blocksize, single thread, linear read (the -s sector range setting is because this 6TB volume is too big for disktest to handle) and we have 5709.4 IOPS !?

A single decent 7200 rpm SATA should get at least 6000 by itself.

Ok, but without reading ahead, we really can't effectively use all these spindles anyway.

Lets remove the disks completly, and just test round-trip to the 3ware card.

By setting sector limit to -s 0:8, we'll just be reading the same 4k block over and over.

root@storage2 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 /dev/sde
| 2007/08/31-02:19:56 | START | 7308 | v1.2.8 | /dev/sde | Start args: -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 (-N 9) (-r) (-c) (-p u) 
| 2007/08/31-02:19:56 | INFO  | 7308 | v1.2.8 | /dev/sde | Starting pass
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | 348639232 bytes read in 85117 transfers.
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Read throughput: 24902802.3B/s (23.75MB/s), IOPS 6079.8/s.
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Read Time: 14 seconds (0h0m14s)
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Total bytes read in 85117 transfers: 348639232
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Total read throughput: 24902802.3B/s (23.75MB/s), IOPS 6079.8/s.
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Total Read Time: 14 seconds (0d0h0m14s)
| 2007/08/31-02:20:11 | STAT  | 7308 | v1.2.8 | /dev/sde | Total overall runtime: 15 seconds (0d0h0m15s)
| 2007/08/31-02:20:11 | END   | 7308 | v1.2.8 | /dev/sde | Test Done (Passed)
root@storage2 ~ # 


great, we're up to 6079 IOPS. reading blocks that should come right from the 3ware cache every time, or at worst, the buffers on the spindle.


Lets compare with Areca's competing card...

root@storage1 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 /dev/sde    
| 2007/08/31-02:22:16 | START | 10788 | v1.2.8 | /dev/sde | Start args: -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 (-N 9) (-r) (-c) (-p u) 
| 2007/08/31-02:22:16 | INFO  | 10788 | v1.2.8 | /dev/sde | Starting pass
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | 2737975296 bytes read in 668451 transfers.
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Read throughput: 182531686.4B/s (174.08MB/s), IOPS 44563.4/s.
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Read Time: 15 seconds (0h0m15s)
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Total bytes read in 668451 transfers: 2737975296
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Total read throughput: 182531686.4B/s (174.08MB/s), IOPS 44563.4/s.
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Total Read Time: 15 seconds (0d0h0m15s)
| 2007/08/31-02:22:31 | STAT  | 10788 | v1.2.8 | /dev/sde | Total overall runtime: 15 seconds (0d0h0m15s)
| 2007/08/31-02:22:31 | END   | 10788 | v1.2.8 | /dev/sde | Test Done (Passed)
root@storage1 ~ # 


OK now we have some low latency. 44,563 IOPS reading the same 4K block.

what about the full volume (or at least which fits in disktest's limits)

root@storage1 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:1000000000 /dev/sde 
| 2007/08/31-02:23:33 | START | 10798 | v1.2.8 | /dev/sde | Start args: -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:1000000000 (-N 1000000001) (-r) (-c) (-p u) 
| 2007/08/31-02:23:33 | INFO  | 10798 | v1.2.8 | /dev/sde | Starting pass
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | 2457137152 bytes read in 599887 transfers.
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Read throughput: 175509796.6B/s (167.38MB/s), IOPS 42849.1/s.
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Read Time: 14 seconds (0h0m14s)
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Total bytes read in 599887 transfers: 2457137152
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Total read throughput: 175509796.6B/s (167.38MB/s), IOPS 42849.1/s.
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Total Read Time: 14 seconds (0d0h0m14s)
| 2007/08/31-02:23:48 | STAT  | 10798 | v1.2.8 | /dev/sde | Total overall runtime: 15 seconds (0d0h0m15s)
| 2007/08/31-02:23:48 | END   | 10798 | v1.2.8 | /dev/sde | Test Done (Passed)
root@storage1 ~ # 


So what is up, 3ware? The competing card is faster by an order of magnitude, and the advise you offer for improving performance is to increase linux readahead value?

A colleague of mine who has agreed with me for years that "something was up" has always felt that 3ware on Windows does not have such problems, so that is another subject up for testing.

#2 User is offline   benjamin9999 Icon

  • Member
  • Group: Member
  • Posts: 7
  • Joined: 31-August 07

Posted 31 August 2007 - 06:50 PM

View Postbenjamin9999, on Aug 31 2007, 02:35 AM, said:

Hey guys. New here, but i know this is the place to start some serious talk on this vendor's cards.


bumping my own topic, to add some more detail...

after writing this i started to wonder if 3ware does interrupt coalescence, to minimize cpu usage - this technique is commonplace in networking chipsets like the e1000, or tg3.

so while running the same 4K same-block testing, vmstat can give Ksec vs. Interrupts for these two cards.

4K same-sectors block-direct reading on 3ware:
root@storage2 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 /dev/sde
| 2007/08/31-18:22:48 | STAT  | 9330 | v1.2.8 | /dev/sde | Total read throughput: 24936448.0B/s (23.78MB/s), IOPS 6088.0/s.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  1      0 949680  16428  18992    0    0 24312     0 6337 12202  0  4 50 46
 0  1      0 949680  16428  18992    0    0 24312     0 6336 12194  0  2 50 48
 0  1      0 949680  16428  18992    0    0 24316     0 6332 12179  0  2 50 48
 


6337 - 250 (linux HZ) = 6087. So 1 IRQ per 4K block occurs.

now on Areca, same testing
root@storage1 ~ # disktest -B 4k -I BD -K 1 -p l -P A -T 15 -s 0:8 /dev/sde
| 2007/08/31-18:21:19 | STAT  | 13196 | v1.2.8 | /dev/sde | Total read throughput: 180495974.4B/s (172.13MB/s), IOPS 44066.4/s.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  1      0 314072  49372 592912    0    0 177972     0 44744 88996  0 26 50 25
 1  1      0 314072  49372 592912    0    0 176900     0 44491 88469  1 25 50 24
 1  0      0 314072  49372 592912    0    0 177060     0 44527 88544  2 25 50 24


44744 - 250 (linux HZ) = 44494. So also 1 IRQ per 4K block occurs.


By this measurement, the Areca card is 7.3 times faster than 3ware's competing offering.

Unless this is a serious design flaw in the hardware / silicon, It seems like this problem should be able to be fixed.


Now it looks even worse for 3ware, since it's (mostly) clear that Interrupt coalescence is not taking place in this card.

#3 User is offline   chrispitude Icon

  • Member
  • Group: Member
  • Posts: 79
  • Joined: 26-October 02

Posted 03 September 2007 - 07:27 PM

Very nice research. I have a 3ware and get great sustained read/write numbers (with the right buffering tweaks as you mention), but the machine always felt sluggish under heavy random I/O. It'd be great if some driver improvements come out of all this. Hopefully 3ware is listening. Perhaps you should open a support case with them and share your findings.

- Chris

#4 User is offline   continuum Icon

  • Mod
  • Group: Mod
  • Posts: 2,392
  • Joined: 31-December 01

Posted 04 September 2007 - 12:16 PM

Hmm. Never did any heavy random I/O testing here-- my customers' applications don't demand any heavy random I/O, and we've always been satisified with 3ware vs. the competition. We do several hundred MB/sec sustained reads and writes across multiple controllers.

I assume you are with matched firmware/driver revs per 3ware instructions? Just to check, I assume you are.

#5 User is offline   benjamin9999 Icon

  • Member
  • Group: Member
  • Posts: 7
  • Joined: 31-August 07

Posted 05 September 2007 - 05:32 AM

View Postcontinuum, on Sep 4 2007, 01:16 PM, said:

Hmm. Never did any heavy random I/O testing here-- my customers' applications don't demand any


just to clarify, this really isn't about -random- I/O. notice in the tests that the access pattern (-p) is linear (-p l). also, this test was done with the latest bios/firmware image from the release-series, and mainline driver from 2.6.22.5.

this is about round trip latency for a single request.

IMO, this is really the most important measurement... layers built on top of a low-latency i/o path will have no problem getting high throughput with readahead, async-i/o, write-back etc. but only as long as you have free RAM. as soon as you run out, you'll be crawling.

no matter what card/interface/layer you are testing, if the seek pattern is Random, then the limits of 7200RPM, 10K, 15K will be revealed.

IIIRC, a 7200 RPM disk with random IO across the whole disk should bring you down to about 100 IOPS.

off topic, but it's useful to note that higher density drives will reduce head-travel time, and random IOPS will go up. use good partitioning schemes (LVM2) to ensure that sets of data are not spread over the entire platter.

#6 User is offline   SimonB Icon

  • Member
  • Group: Member
  • Posts: 4
  • Joined: 09-September 07

Posted 09 September 2007 - 12:21 PM

View Postbenjamin9999, on Sep 5 2007, 10:32 AM, said:

IMO, this is really the most important measurement... layers built on top of a low-latency i/o path will have no problem getting high throughput with readahead, async-i/o, write-back etc. but only as long as you have free RAM. as soon as you run out, you'll be crawling.


Thanks for starting this thread - I thought I was going mad and have so far tried 4 different OSes - Centos 4.4, 4.5, openSUSE 10.2 and finally RHEL AS 4 update 5 (just to be sure) and two types of disk (Maxtor/Western Digital), as well as applying 3ware's tuning 'tweaks' and experimenting with LVM/noLVM, SMP/noSMP, RAID 1 and Single Disk modes all in an attempt to get a handle on why this 9550SX-8LP seems to cause a dual Opteron 2.4GHz with 4GB RAM to go into a responsiveness nosedive under intensive IO. Latest firmware and drivers in use.

I've just finished a whole load of vmstats runs of two timed dd commands (reading 3, 4, 6 and 20GB from /dev/sda then writing 3,4,6 and 20GB from /dev/zero to a file on the 200GB+ / partition for both SMP and non-SMP kernel 2.6.9-55.EL) and the results gave me the clue to Google up "3ware vmstats" and find your notes.

Rather than detail it all here, I've uploaded 4 PDFs of the graphs to one of my own sites where I'm keeping my own notes - should you want to take a look. I find the "blocks out" figures very interesting, coupled with the number of processes in uninterruptible sleep while the card processes what's been thrown at it. No wonder the machine hits a brick wall.

As to how to get around the problem, well that's an entirely different matter - I'm all out of ideas, frankly. Chucking around a 20G logfile from time to time is quite possible on a busy webserver, last thing I need is for everything else to stop for a daydream while it happens.

S.

This post has been edited by SimonB: 09 September 2007 - 12:23 PM


#7 User is offline   benjamin9999 Icon

  • Member
  • Group: Member
  • Posts: 7
  • Joined: 31-August 07

Posted 10 September 2007 - 11:10 PM

View PostSimonB, on Sep 9 2007, 01:21 PM, said:

Thanks for starting this thread - I thought I was going mad and have so far tried 4 different OSes - Centos 4.4, 4.5, openSUSE 10.2 and finally RHEL AS 4 update 5 (just to be sure) and two types of disk (Maxtor/Western Digital), as well as applying 3ware's tuning 'tweaks' and experimenting with


well, i have to back off this thread a bit. i showed this to another engineer familiar with these issues, and he pointed out that what i have "proven" with my data was not exactly as i thought. instead of "poor" performance from slow round-trip-to-controller, what i have actually found is that 3ware is not doing any hardware read-ahead, and doesn't appear to be doing any read-cache either.

since the read block size (4k) is smaller than the stripe, and single threaded, then we can't get any faster than a single spindle - without read-ahead. areca is definitely reading ahead in hardware (per the config on that setup). there is no clear way to configure this with 3ware - all of the cache config options are related to write-back/thru and sync operation.

...and no read cache either? the single-block read test is also slow.

while i think the situation is not as bad as i first thought, there still seems to be some strange stuff going on here.

not sure what to get into testing yet, but i have not had time to mess with it this week either.

#8 User is offline   jpiszcz Icon

  • Member
  • Group: Member
  • Posts: 464
  • Joined: 15-January 06

Posted 11 September 2007 - 03:31 AM

Maybe its your motherboard? Have you tried a different one?

Also what do you get for sequential read and write with dd?

What filesystems have you tried? Are you specifying the -R STRIDE size (with EXT2/3) or are you using a proper sunit/swidth with XFS?

I am currently writing/using my filesystem but even with that, its not that slow as your benchmarks!

$ dd if=/dev/zero of=10gb bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 26.752 seconds, 401 MB/s

$ dd if=10gb of=/dev/null bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 20.4077 seconds, 526 MB/s

#9 User is offline   benjamin9999 Icon

  • Member
  • Group: Member
  • Posts: 7
  • Joined: 31-August 07

Posted 11 September 2007 - 08:03 AM

View Postjpiszcz, on Sep 11 2007, 04:31 AM, said:

Also what do you get for sequential read and write with dd?


my opinion would be that here you have already fallen into the 3ware trap

View Postjpiszcz, on Sep 11 2007, 04:31 AM, said:

What filesystems have you tried? Are you specifying the -R STRIDE size (with EXT2/3) or are you using a proper sunit/swidth with XFS?


if you're working with a filesystem, then you'll be accessing the volume thru the pagecache, so now you are benchmarking the os(readahead+writethru)+ram(amount)+3ware. re-read first post...

View Postjpiszcz, on Sep 11 2007, 04:31 AM, said:

I am currently writing/using my filesystem but even with that, its not that slow as your benchmarks!

$ dd if=/dev/zero of=10gb bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 26.752 seconds, 401 MB/s


again, this is the classic test we want to avoid, but you did write 10GB which is significant. how much ram in this system and what raid config?

#10 User is offline   jpiszcz Icon

  • Member
  • Group: Member
  • Posts: 464
  • Joined: 15-January 06

Posted 11 September 2007 - 08:16 AM

View Postbenjamin9999, on Sep 11 2007, 09:03 AM, said:

View Postjpiszcz, on Sep 11 2007, 04:31 AM, said:

Also what do you get for sequential read and write with dd?


my opinion would be that here you have already fallen into the 3ware trap

View Postjpiszcz, on Sep 11 2007, 04:31 AM, said:

What filesystems have you tried? Are you specifying the -R STRIDE size (with EXT2/3) or are you using a proper sunit/swidth with XFS?


if you're working with a filesystem, then you'll be accessing the volume thru the pagecache, so now you are benchmarking the os(readahead+writethru)+ram(amount)+3ware. re-read first post...

View Postjpiszcz, on Sep 11 2007, 04:31 AM, said:

I am currently writing/using my filesystem but even with that, its not that slow as your benchmarks!

$ dd if=/dev/zero of=10gb bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 26.752 seconds, 401 MB/s


again, this is the classic test we want to avoid, but you did write 10GB which is significant. how much ram in this system and what raid config?


8GB of ram but that is irrelevant:

$ /usr/bin/time dd if=/dev/zero of=file bs=1M
dd: writing `file': No space left on device
1070704+0 records in
1070703+0 records out
1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s

I am using Linux Software RAID5, no hardware raid here.

  • (4 Pages)
  • +
  • 1
  • 2
  • 3
  • Last »
  • You cannot start a new topic
  • You cannot reply to this topic

2 User(s) are reading this topic
0 members, 2 guests, 0 anonymous users