|
Too many years of awful 3ware performance.
Some data to demonstrate.

- Member
-
Group:
Member
-
Posts:
7
-
Joined:
31-August 07
Posted 11 September 2007 - 09:16 AM
jpiszcz, on Sep 11 2007, 09:16 AM, said:
8GB of ram but that is irrelevant:
1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s
ok this is meaningful, and seems nice - but you're still doing some bulk transfers with a huge blocksize.
how are reads or cpu availability during this? poster above (SimonB) demonstrates that his system is useless during a scenario like this.
jpiszcz, on Sep 11 2007, 09:16 AM, said:
I am using Linux Software RAID5, no hardware raid here.
i think this is important to consider because the controller is likely in a "pass thru" state, so comparisons should be taken with a grain of salt.
i read someone's pages once where some extensive (with bonnie anyway) testing was done and demonstrated that linux software raid was faster than 3ware. unfortunately i can't come up with the URL at the moment, but i will try to track it down - you may have already read some similar info.
if your DD is new enough, try conv=direct to open the of/if with O_DIRECT. try 4k 64k and 1M blocksizes. LTP's disktest is a much more convenient tool for the job though - i'm not sure if you can use o_direct with /dev/null & zero ??

- Member
-
Group:
Member
-
Posts:
472
-
Joined:
15-January 06
Posted 11 September 2007 - 09:44 AM
benjamin9999, on Sep 11 2007, 10:16 AM, said:
jpiszcz, on Sep 11 2007, 09:16 AM, said:
8GB of ram but that is irrelevant:
1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s
ok this is meaningful, and seems nice - but you're still doing some bulk transfers with a huge blocksize.
how are reads or cpu availability during this? poster above (SimonB) demonstrates that his system is useless during a scenario like this.
It used to be around 15-30% with an E6300 on 1 core but I have since upgraded to a Q6600 so its largely irrelevant with 4 cores.
jpiszcz, on Sep 11 2007, 09:16 AM, said:
I am using Linux Software RAID5, no hardware raid here.
i think this is important to consider because the controller is likely in a "pass thru" state, so comparisons should be taken with a grain of salt.
i read someone's pages once where some extensive (with bonnie anyway) testing was done and demonstrated that linux software raid was faster than 3ware. unfortunately i can't come up with the URL at the moment, but i will try to track it down - you may have already read some similar info.
if your DD is new enough, try conv=direct to open the of/if with O_DIRECT. try 4k 64k and 1M blocksizes. LTP's disktest is a much more convenient tool for the job though - i'm not sure if you can use o_direct with /dev/null & zero ??
Hm I understand the direct etc but I don't believe I run my applications in that way so why would I want to limit the test to that?

- Member
-
Group:
Member
-
Posts:
1
-
Joined:
19-January 03
Posted 12 September 2007 - 05:00 PM
benjamin9999, on Aug 30 2007, 11:35 PM, said:
Performance has always been poor "feeling", on any card. Even when the 9500 killer card was new, and i had this setup with 8 disks, things were not great, despite benchmarks which demonstrate massive speeds.
So what is up, 3ware? The competing (Areca) card is faster by an order of magnitude, and the advise you offer for improving performance is to increase linux readahead value?
A colleague of mine who has agreed with me for years that "something was up" has always felt that 3ware on Windows does not have such problems, so that is another subject up for testing.
I completely agree that 3Ware performance is very poor on a modern Linux system.
(I think that the author of the 3Ware tuning documents realizes this. The 3Ware tuning documents carefully define "performance" to mean "sequential read throughput", and then tell you to increase Linux readahead. That works fine for some carefully-chosen benchmarks, but it paints a deceptive picture of the card's general performance. Performance under a general-purpose workload is very, very poor.)
I've been investigating the performance problem for a while now, and the bulk of evidence points to a problem with how completed I/O requests are passed back from the card's firmware to the Linux driver.
To locate the bottleneck, I instrumented the 2.6 Linux kernel to capture data about filesystem read/write activity. I collected the data using blktrace and postprocessed it with a Python program that tracked the state of each I/O request through the kernel and 3Ware driver.
To collect my data, I ran 24 iozone processes, each writing 8K chunks in O_SYNC mode to a unique file on a RAID-10 jfs filesystem served by a single 3Ware 9550SX-8LP card. This models something that postgres does quite frequently. Since each write is synchronous, the Linux buffer cache is out of the picture, making it much easier to understand what's happening at the device level.
The Linux kernel puts a block queue in front of the 3Ware driver. Linux queues have I/O schedulers that can implement sophisticated strategies for holding, splitting, and merging requests. For this test, I used the "noop" scheduler; it just inserts requests onto the queue as quickly as they arrive and does no other processing on them. Requests pass straight into the 3Ware driver as quickly as its queuecommand function will accept them.
By plotting the collected trace data, the problem is clearly visible. I'd expect the steady state to show an empty block queue, with ~24 active writes within the driver at all times. (It takes very little time for a Linux process to get a return code back from a write syscall and initiate another write request.) I'd expect to see a steady stream of completed write responses from the driver.
But this isn't what I see. Instead, write requests flow through the kernel's block queue and into the 3Ware driver, which returns them in batches at a rate of about 10Hz.
My gut tells me that this is closely related to the performance problem with the 3Ware controller: If it's presented with more than a small number of concurrent write requests, it appears to switch into a "batch mode" where it buffers completed write responses, returning them in batches at a glacially slow 10Hz rate. (Something similar may be true for reading; I haven't investigated that.)
I've been passing my findings along to 3Ware, but I'm still waiting for their response.

- Member
-
Group:
Member
-
Posts:
7
-
Joined:
31-August 07
Posted 17 September 2007 - 09:10 AM
jeffstearns, on Sep 12 2007, 06:00 PM, said:
I completely agree that 3Ware performance is very poor on a modern Linux system.
To collect my data, I ran 24 iozone processes, each writing 8K chunks in O_SYNC mode to a unique file on a RAID-10 jfs filesystem served by a single 3Ware 9550SX-8LP card. This models something that postgres does quite frequently. Since each write is synchronous, the Linux buffer cache is out of the picture, making it much easier to understand what's happening at the device level.
very interesting stuff. until now i had only tested reads because i was comparing with other hardware - and since reads are non-destructive i could do comparisons on other boxes i have which are in-use.
why O_SYNC and not O_DIRECT?
it would be interesting to see the instrumentation done with ltp's disktest using the -I BD bio api so that the pagecache is bypassed entirely (along with the filesystem)... quick testing with 24 threads of writing on my 9650se raid5 (same config listed above) and this bottleneck you describe doesn't seem to manifest itself.
root@storage2 ~ # disktest -B 8k -I BD -K 24 -p l -P A -w -T 120 -N 6695042048 /dev/sde
| 2007/09/17-10:04:33 | START | 18026 | v1.2.8 | /dev/sde | Start args: -B 8k -I BD -K 24 -p l -P A -w -T 120 -N 6695042048 (-c) (-p u)
| 2007/09/17-10:04:33 | INFO | 18026 | v1.2.8 | /dev/sde | Starting pass
| 2007/09/17-10:06:32 | STAT | 18026 | v1.2.8 | /dev/sde | 17459126272 bytes written in 2131241 transfers.
| 2007/09/17-10:06:32 | STAT | 18026 | v1.2.8 | /dev/sde | Write throughput: 147958697.2B/s (141.10MB/s), IOPS 18061.4/s.
| 2007/09/17-10:06:32 | STAT | 18026 | v1.2.8 | /dev/sde | Write Time: 118 seconds (0h1m58s)
| 2007/09/17-10:06:32 | STAT | 18026 | v1.2.8 | /dev/sde | Total bytes written in 2131241 transfers: 17459126272
| 2007/09/17-10:06:32 | STAT | 18026 | v1.2.8 | /dev/sde | Total write throughput: 147958697.2B/s (141.10MB/s), IOPS 18061.4/s.
| 2007/09/17-10:06:32 | STAT | 18026 | v1.2.8 | /dev/sde | Total Write Time: 118 seconds (0d0h1m58s)
| 2007/09/17-10:06:32 | STAT | 18026 | v1.2.8 | /dev/sde | Total overall runtime: 120 seconds (0d0h2m0s)
| 2007/09/17-10:06:32 | END | 18026 | v1.2.8 | /dev/sde | Test Done (Passed)
here is 8k writes, 24 threads, w/BD... ~18,000 IOPS, and during this, about 18,000 IRQ are generated, so the results-coalescence you describe doesn't appear to occur here.
noop scheduler is in effect here, same results for deadline.

- Member
-
Group:
Member
-
Posts:
4
-
Joined:
09-September 07
Posted 19 September 2007 - 03:23 AM
jeffstearns, on Sep 12 2007, 10:00 PM, said:
But this isn't what I see. Instead, write requests flow through the kernel's block queue and into the 3Ware driver, which returns them in batches at a rate of about 10Hz.
My gut tells me that this is closely related to the performance problem with the 3Ware controller: If it's presented with more than a small number of concurrent write requests, it appears to switch into a "batch mode" where it buffers completed write responses, returning them in batches at a glacially slow 10Hz rate. (Something similar may be true for reading; I haven't investigated that.)
I've been passing my findings along to 3Ware, but I'm still waiting for their response.
This is very interesting, thanks for posting it. I don't have the knowledge to do the sort of investigation you've achieved, I'm limited to running various benchmarking programs and attempting to understand the results.
Since in other places I've been asked "Why don't you try RAID-10?", I decided the experiment with different types of RAID setup to see what difference it might make. As mentioned above, I initially saw this problem with a simple RAID 1 config. The easiest alternative to try first was to convert my two hot spares into a RAID 0 array and create a default ext3 partition upon which to experiment. (Edit: in the results below, the dd tests were done prior to making the filesystem and running the bonnie++ tests)
Here are the results of some benchmarks.
The machine is a dual Opteron 2.4GHz and has 4GB RAM installed, with a 9550SX-8LP hosting four Seagate ST3250820SV drives.
First, with no 3ware-recommended kernel tweaks applied (CentOS 4.5 blockdev readahead and nr_requests are 256 and 8192 by default - it's a straightforward minimal install, with 9550SX firmware and driver from the same 3ware codeset (9.4.1.2), the driver being built-in to the 2.6.9-55.EL kernel in the distro).
Readahead = 256
nr_requests = 8192
Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
dd read 1024 MB at 123.82 MB/s in 8.27 secs
dd read 2048 MB at 125.80 MB/s in 16.28 secs
dd read 4096 MB at 125.22 MB/s in 32.71 secs
dd read 8192 MB at 125.18 MB/s in 65.44 secs
Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
dd write 1024 MB at 97.15 MB/s in 10.54 secs
dd write 2048 MB at 93.60 MB/s in 21.88 secs
dd write 4096 MB at 90.22 MB/s in 45.40 secs
dd write 8192 MB at 91.63 MB/s in 89.40 secs
Label: RA-256_NR-8192
bonnie++ -m RA-256_NR-8192 -n 0 -u 0 -r 512 -s 20480 -f -b
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
RA-256_NR-8192 20G 50014 18 44715 12 126556 14 137.0 0
Then I applied the 3ware tweaks (readahead 16384, nr_requests 512) and re-tested:
Readahead = 16384
nr_requests = 512
Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
dd read 1024 MB at 149.71 MB/s in 6.84 secs
dd read 2048 MB at 151.70 MB/s in 13.50 secs
dd read 4096 MB at 152.15 MB/s in 26.92 secs
dd read 8192 MB at 153.61 MB/s in 53.33 secs
Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
dd write 1024 MB at 88.28 MB/s in 11.60 secs
dd write 2048 MB at 89.90 MB/s in 22.78 secs
dd write 4096 MB at 89.51 MB/s in 45.76 secs
dd write 8192 MB at 87.89 MB/s in 93.21 secs
Label: RA-16384_NR-512
bonnie++ -m RA-16384_NR-512 -n 0 -u 0 -r 512 -s 20480 -f -b
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
RA-16384_NR-512 20G 62753 22 55895 16 150694 18 127.3 0
Next, I left the readahead at 16384 and returned nr_requests to its 8192 default and re-tested again:
Readahead = 16384
nr_requests = 8192
Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
dd read 1024 MB at 149.71 MB/s in 6.84 secs
dd read 2048 MB at 152.72 MB/s in 13.41 secs
dd read 4096 MB at 153.35 MB/s in 26.71 secs
dd read 8192 MB at 153.15 MB/s in 53.49 secs
Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
dd write 1024 MB at 97.15 MB/s in 10.54 secs
dd write 2048 MB at 97.06 MB/s in 21.10 secs
dd write 4096 MB at 93.45 MB/s in 43.83 secs
dd write 8192 MB at 91.90 MB/s in 89.14 secs
Label: RA-16384_NR-8192
bonnie++ -m RA-16384_NR-8192 -n 0 -u 0 -r 512 -s 20480 -f -b
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
RA-16384_NR-819 20G 57576 20 55535 16 151212 18 126.4 0
Finally, for completeness, I set readahead back to 256 and changed nr_requests to 512:
Readahead = 256
nr_requests = 512
Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
dd read 1024 MB at 123.97 MB/s in 8.26 secs
dd read 2048 MB at 125.26 MB/s in 16.35 secs
dd read 4096 MB at 125.37 MB/s in 32.67 secs
dd read 8192 MB at 125.09 MB/s in 65.49 secs
Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
dd write 1024 MB at 91.43 MB/s in 11.20 secs
dd write 2048 MB at 90.14 MB/s in 22.72 secs
dd write 4096 MB at 89.55 MB/s in 45.74 secs
dd write 8192 MB at 92.17 MB/s in 88.88 secs
Label: RA-256_NR-512
bonnie++ -m RA-256_NR-512 -n 0 -u 0 -r 512 -s 20480 -f -b
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
RA-256_NR-512 20G 61808 22 46250 12 126959 14 137.1 0
So, the 3ware tweak readahead = 16384 improves read throughput, but their recommended nr_requests = 512 reduces write throughput. Neither of these params appears to have any impact whatsoever on the underlying problem of sluggish system response, however. Frankly, I didn't expect them to.
I hope 3ware get back to you about your findings - so far I've not had much success in getting a response out of them. Please keep us posted.
S.
This post has been edited by SimonB: 19 September 2007 - 03:26 AM

- Member
-
Group:
Member
-
Posts:
79
-
Joined:
26-October 02
Posted 24 January 2008 - 07:10 AM
Anything more on this front?
- Chris

- Member
-
Group:
Member
-
Posts:
1
-
Joined:
05-February 08
Posted 05 February 2008 - 04:09 PM
This is a great thread - I've been seeing some horrible performance on some of my boxes with 9550sx's.
One of them is currently in a jbod configuration with software raid, which is much faster and more responsive than the controller.
I've been tempted on the advice of others to get an Areca to see what benefits I can get.
In my case, I couldn't get the card to complete more than 400io/sec aggregate (Never more than 120 in a single process) (in this case, seek randomly, read 8k - mimicing postgresql's io patterns). For comparison, I get thousands on another box with an hp p600 controller and a pile of sas disks. During a lot of those times the system gets very laggy and load goes up.
So we'll see what happens.

- Member
-
Group:
Member
-
Posts:
1
-
Joined:
22-September 03
Posted 10 February 2008 - 02:57 PM
Is it just me or does it appear the 3ware 9.4.2 firmware release has resolved these problems?

- Member
-
Group:
Member
-
Posts:
3
-
Joined:
06-December 07
Posted 19 February 2008 - 03:17 AM
Same here: slow random I/O (using a database).
I published an overview and I'm in the process of refreshing all those informations after completing a bunch of tests with a 9650, for both the controller-implemented RAID (mostly RAID10) and the Linux 'md' one.
If you want me to run a given test please describe it to me, along with the reason why it is pertinent.

- Member
-
Group:
Member
-
Posts:
425
-
Joined:
18-March 07
Posted 19 February 2008 - 06:10 AM
natmaka, on Feb 19 2008, 08:17 AM, said:
Same here: slow random I/O (using a database).
I published an overview and I'm in the process of refreshing all those informations after completing a bunch of tests with a 9650, for both the controller-implemented RAID (mostly RAID10) and the Linux 'md' one.
If you want me to run a given test please describe it to me, along with the reason why it is pertinent.
Some great thorough work and research you've done. Pretty much sums up all the info and tweaks I've ever tried. There is one thing that puzzled me a bit about your tweaking. If dealing with small 64k reads a majority of the time, isn't setting read-ahead counter-productive? After all, if the reads are random, virtually nothing it reads ahead would be useful would it? To that end, perhaps setting a large read ahead with blockdev --getra is not a good idea, and you might even try turning off read-ahead on your controller (although if it's adaptive read-ahead, it should learn fairly quickly that reading ahead isn't useful and not do it).
3 User(s) are reading this topic
0 members, 3 guests, 0 anonymous users
|
|