Storage Forums: Too many years of awful 3ware performance. - Storage Forums

Jump to content

Advertisement

  • (4 Pages)
  • +
  • 1
  • 2
  • 3
  • 4
  • You cannot start a new topic
  • You cannot reply to this topic

Too many years of awful 3ware performance. Some data to demonstrate.

#11 User is offline   benjamin9999 Icon

  • Member
  • Group: Member
  • Posts: 7
  • Joined: 31-August 07

Posted 11 September 2007 - 09:16 AM

View Postjpiszcz, on Sep 11 2007, 09:16 AM, said:

8GB of ram but that is irrelevant:

1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s


ok this is meaningful, and seems nice - but you're still doing some bulk transfers with a huge blocksize.

how are reads or cpu availability during this? poster above (SimonB) demonstrates that his system is useless during a scenario like this.


View Postjpiszcz, on Sep 11 2007, 09:16 AM, said:

I am using Linux Software RAID5, no hardware raid here.


i think this is important to consider because the controller is likely in a "pass thru" state, so comparisons should be taken with a grain of salt.

i read someone's pages once where some extensive (with bonnie anyway) testing was done and demonstrated that linux software raid was faster than 3ware. unfortunately i can't come up with the URL at the moment, but i will try to track it down - you may have already read some similar info.

if your DD is new enough, try conv=direct to open the of/if with O_DIRECT. try 4k 64k and 1M blocksizes. LTP's disktest is a much more convenient tool for the job though - i'm not sure if you can use o_direct with /dev/null & zero ??





#12 User is offline   jpiszcz Icon

  • Member
  • Group: Member
  • Posts: 472
  • Joined: 15-January 06

Posted 11 September 2007 - 09:44 AM

View Postbenjamin9999, on Sep 11 2007, 10:16 AM, said:

View Postjpiszcz, on Sep 11 2007, 09:16 AM, said:

8GB of ram but that is irrelevant:

1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s


ok this is meaningful, and seems nice - but you're still doing some bulk transfers with a huge blocksize.

how are reads or cpu availability during this? poster above (SimonB) demonstrates that his system is useless during a scenario like this.

It used to be around 15-30% with an E6300 on 1 core but I have since upgraded to a Q6600 so its largely irrelevant with 4 cores.

View Postjpiszcz, on Sep 11 2007, 09:16 AM, said:

I am using Linux Software RAID5, no hardware raid here.


i think this is important to consider because the controller is likely in a "pass thru" state, so comparisons should be taken with a grain of salt.

i read someone's pages once where some extensive (with bonnie anyway) testing was done and demonstrated that linux software raid was faster than 3ware. unfortunately i can't come up with the URL at the moment, but i will try to track it down - you may have already read some similar info.

if your DD is new enough, try conv=direct to open the of/if with O_DIRECT. try 4k 64k and 1M blocksizes. LTP's disktest is a much more convenient tool for the job though - i'm not sure if you can use o_direct with /dev/null & zero ??


Hm I understand the direct etc but I don't believe I run my applications in that way so why would I want to limit the test to that?





#13 User is offline   jeffstearns Icon

  • Member
  • Group: Member
  • Posts: 1
  • Joined: 19-January 03

Posted 12 September 2007 - 05:00 PM

View Postbenjamin9999, on Aug 30 2007, 11:35 PM, said:

Performance has always been poor "feeling", on any card. Even when the 9500 killer card was new, and i had this setup with 8 disks, things were not great, despite benchmarks which demonstrate massive speeds.

So what is up, 3ware? The competing (Areca) card is faster by an order of magnitude, and the advise you offer for improving performance is to increase linux readahead value?

A colleague of mine who has agreed with me for years that "something was up" has always felt that 3ware on Windows does not have such problems, so that is another subject up for testing.


I completely agree that 3Ware performance is very poor on a modern Linux system.

(I think that the author of the 3Ware tuning documents realizes this. The 3Ware tuning documents carefully define "performance" to mean "sequential read throughput", and then tell you to increase Linux readahead. That works fine for some carefully-chosen benchmarks, but it paints a deceptive picture of the card's general performance. Performance under a general-purpose workload is very, very poor.)

I've been investigating the performance problem for a while now, and the bulk of evidence points to a problem with how completed I/O requests are passed back from the card's firmware to the Linux driver.

To locate the bottleneck, I instrumented the 2.6 Linux kernel to capture data about filesystem read/write activity. I collected the data using blktrace and postprocessed it with a Python program that tracked the state of each I/O request through the kernel and 3Ware driver.

To collect my data, I ran 24 iozone processes, each writing 8K chunks in O_SYNC mode to a unique file on a RAID-10 jfs filesystem served by a single 3Ware 9550SX-8LP card. This models something that postgres does quite frequently. Since each write is synchronous, the Linux buffer cache is out of the picture, making it much easier to understand what's happening at the device level.

The Linux kernel puts a block queue in front of the 3Ware driver. Linux queues have I/O schedulers that can implement sophisticated strategies for holding, splitting, and merging requests. For this test, I used the "noop" scheduler; it just inserts requests onto the queue as quickly as they arrive and does no other processing on them. Requests pass straight into the 3Ware driver as quickly as its queuecommand function will accept them.

By plotting the collected trace data, the problem is clearly visible. I'd expect the steady state to show an empty block queue, with ~24 active writes within the driver at all times. (It takes very little time for a Linux process to get a return code back from a write syscall and initiate another write request.) I'd expect to see a steady stream of completed write responses from the driver.

But this isn't what I see. Instead, write requests flow through the kernel's block queue and into the 3Ware driver, which returns them in batches at a rate of about 10Hz.

My gut tells me that this is closely related to the performance problem with the 3Ware controller: If it's presented with more than a small number of concurrent write requests, it appears to switch into a "batch mode" where it buffers completed write responses, returning them in batches at a glacially slow 10Hz rate. (Something similar may be true for reading; I haven't investigated that.)

I've been passing my findings along to 3Ware, but I'm still waiting for their response.





#14 User is offline   benjamin9999 Icon

  • Member
  • Group: Member
  • Posts: 7
  • Joined: 31-August 07

Posted 17 September 2007 - 09:10 AM

View Postjeffstearns, on Sep 12 2007, 06:00 PM, said:

I completely agree that 3Ware performance is very poor on a modern Linux system.

To collect my data, I ran 24 iozone processes, each writing 8K chunks in O_SYNC mode to a unique file on a RAID-10 jfs filesystem served by a single 3Ware 9550SX-8LP card. This models something that postgres does quite frequently. Since each write is synchronous, the Linux buffer cache is out of the picture, making it much easier to understand what's happening at the device level.


very interesting stuff. until now i had only tested reads because i was comparing with other hardware - and since reads are non-destructive i could do comparisons on other boxes i have which are in-use.

why O_SYNC and not O_DIRECT?

it would be interesting to see the instrumentation done with ltp's disktest using the -I BD bio api so that the pagecache is bypassed entirely (along with the filesystem)... quick testing with 24 threads of writing on my 9650se raid5 (same config listed above) and this bottleneck you describe doesn't seem to manifest itself.

root@storage2 ~ # disktest -B 8k -I BD -K 24 -p l -P A -w -T 120 -N 6695042048 /dev/sde
| 2007/09/17-10:04:33 | START | 18026 | v1.2.8 | /dev/sde | Start args: -B 8k -I BD -K 24 -p l -P A -w -T 120 -N 6695042048 (-c) (-p u)
| 2007/09/17-10:04:33 | INFO  | 18026 | v1.2.8 | /dev/sde | Starting pass
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | 17459126272 bytes written in 2131241 transfers.
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Write throughput: 147958697.2B/s (141.10MB/s), IOPS 18061.4/s.
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Write Time: 118 seconds (0h1m58s)
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Total bytes written in 2131241 transfers: 17459126272
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Total write throughput: 147958697.2B/s (141.10MB/s), IOPS 18061.4/s.
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Total Write Time: 118 seconds (0d0h1m58s)
| 2007/09/17-10:06:32 | STAT  | 18026 | v1.2.8 | /dev/sde | Total overall runtime: 120 seconds (0d0h2m0s)
| 2007/09/17-10:06:32 | END   | 18026 | v1.2.8 | /dev/sde | Test Done (Passed)


here is 8k writes, 24 threads, w/BD... ~18,000 IOPS, and during this, about 18,000 IRQ are generated, so the results-coalescence you describe doesn't appear to occur here.

noop scheduler is in effect here, same results for deadline.





#15 User is offline   SimonB Icon

  • Member
  • Group: Member
  • Posts: 4
  • Joined: 09-September 07

Posted 19 September 2007 - 03:23 AM

View Postjeffstearns, on Sep 12 2007, 10:00 PM, said:

But this isn't what I see. Instead, write requests flow through the kernel's block queue and into the 3Ware driver, which returns them in batches at a rate of about 10Hz.

My gut tells me that this is closely related to the performance problem with the 3Ware controller: If it's presented with more than a small number of concurrent write requests, it appears to switch into a "batch mode" where it buffers completed write responses, returning them in batches at a glacially slow 10Hz rate. (Something similar may be true for reading; I haven't investigated that.)

I've been passing my findings along to 3Ware, but I'm still waiting for their response.


This is very interesting, thanks for posting it. I don't have the knowledge to do the sort of investigation you've achieved, I'm limited to running various benchmarking programs and attempting to understand the results.

Since in other places I've been asked "Why don't you try RAID-10?", I decided the experiment with different types of RAID setup to see what difference it might make. As mentioned above, I initially saw this problem with a simple RAID 1 config. The easiest alternative to try first was to convert my two hot spares into a RAID 0 array and create a default ext3 partition upon which to experiment. (Edit: in the results below, the dd tests were done prior to making the filesystem and running the bonnie++ tests)

Here are the results of some benchmarks.

The machine is a dual Opteron 2.4GHz and has 4GB RAM installed, with a 9550SX-8LP hosting four Seagate ST3250820SV drives.

First, with no 3ware-recommended kernel tweaks applied (CentOS 4.5 blockdev readahead and nr_requests are 256 and 8192 by default - it's a straightforward minimal install, with 9550SX firmware and driver from the same 3ware codeset (9.4.1.2), the driver being built-in to the 2.6.9-55.EL kernel in the distro).

Readahead = 256
nr_requests = 8192
	Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
	dd read 1024 MB at 123.82 MB/s in 8.27 secs
	dd read 2048 MB at 125.80 MB/s in 16.28 secs
	dd read 4096 MB at 125.22 MB/s in 32.71 secs
	dd read 8192 MB at 125.18 MB/s in 65.44 secs
	
	Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
	dd write 1024 MB at 97.15 MB/s in 10.54 secs
	dd write 2048 MB at 93.60 MB/s in 21.88 secs
	dd write 4096 MB at 90.22 MB/s in 45.40 secs
	dd write 8192 MB at 91.63 MB/s in 89.40 secs

Label: RA-256_NR-8192
bonnie++ -m RA-256_NR-8192 -n 0 -u 0 -r 512 -s 20480 -f -b
Version  1.03	   ------Sequential Output------ --Sequential Input- --Random-
					-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine		Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
RA-256_NR-8192  20G		   50014  18 44715  12		   126556  14 137.0   0


Then I applied the 3ware tweaks (readahead 16384, nr_requests 512) and re-tested:

Readahead = 16384
nr_requests = 512
	Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
	dd read 1024 MB at 149.71 MB/s in 6.84 secs
	dd read 2048 MB at 151.70 MB/s in 13.50 secs
	dd read 4096 MB at 152.15 MB/s in 26.92 secs
	dd read 8192 MB at 153.61 MB/s in 53.33 secs
	
	Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
	dd write 1024 MB at 88.28 MB/s in 11.60 secs
	dd write 2048 MB at 89.90 MB/s in 22.78 secs
	dd write 4096 MB at 89.51 MB/s in 45.76 secs
	dd write 8192 MB at 87.89 MB/s in 93.21 secs

Label: RA-16384_NR-512
bonnie++ -m RA-16384_NR-512 -n 0 -u 0 -r 512 -s 20480 -f -b
Version  1.03	   ------Sequential Output------ --Sequential Input- --Random-
					-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine		Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
RA-16384_NR-512 20G		   62753  22 55895  16		   150694  18 127.3   0


Next, I left the readahead at 16384 and returned nr_requests to its 8192 default and re-tested again:

Readahead = 16384
nr_requests = 8192
	Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
	dd read 1024 MB at 149.71 MB/s in 6.84 secs
	dd read 2048 MB at 152.72 MB/s in 13.41 secs
	dd read 4096 MB at 153.35 MB/s in 26.71 secs
	dd read 8192 MB at 153.15 MB/s in 53.49 secs
	
	Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
	dd write 1024 MB at 97.15 MB/s in 10.54 secs
	dd write 2048 MB at 97.06 MB/s in 21.10 secs
	dd write 4096 MB at 93.45 MB/s in 43.83 secs
	dd write 8192 MB at 91.90 MB/s in 89.14 secs

Label: RA-16384_NR-8192
bonnie++ -m RA-16384_NR-8192 -n 0 -u 0 -r 512 -s 20480 -f -b
Version  1.03	   ------Sequential Output------ --Sequential Input- --Random-
					-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine		Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
RA-16384_NR-819 20G		   57576  20 55535  16		   151212  18 126.4   0


Finally, for completeness, I set readahead back to 256 and changed nr_requests to 512:

Readahead = 256
nr_requests = 512
	Reading: dd if=/dev/sdb of=/dev/null bs=1M count=XXXX
	dd read 1024 MB at 123.97 MB/s in 8.26 secs
	dd read 2048 MB at 125.26 MB/s in 16.35 secs
	dd read 4096 MB at 125.37 MB/s in 32.67 secs
	dd read 8192 MB at 125.09 MB/s in 65.49 secs
	
	Writing: dd if=/dev/zero of=/dev/sdb bs=1M count=XXXX
	dd write 1024 MB at 91.43 MB/s in 11.20 secs
	dd write 2048 MB at 90.14 MB/s in 22.72 secs
	dd write 4096 MB at 89.55 MB/s in 45.74 secs
	dd write 8192 MB at 92.17 MB/s in 88.88 secs
	
Label: RA-256_NR-512
bonnie++ -m RA-256_NR-512 -n 0 -u 0 -r 512 -s 20480 -f -b
Version  1.03	   ------Sequential Output------ --Sequential Input- --Random-
					-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine		Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
RA-256_NR-512   20G		   61808  22 46250  12		   126959  14 137.1   0


So, the 3ware tweak readahead = 16384 improves read throughput, but their recommended nr_requests = 512 reduces write throughput. Neither of these params appears to have any impact whatsoever on the underlying problem of sluggish system response, however. Frankly, I didn't expect them to.

I hope 3ware get back to you about your findings - so far I've not had much success in getting a response out of them. Please keep us posted.

S.

This post has been edited by SimonB: 19 September 2007 - 03:26 AM






#16 User is offline   chrispitude Icon

  • Member
  • Group: Member
  • Posts: 79
  • Joined: 26-October 02

Posted 24 January 2008 - 07:10 AM

Anything more on this front?

- Chris





#17 User is offline   threshar Icon

  • Member
  • Group: Member
  • Posts: 1
  • Joined: 05-February 08

Posted 05 February 2008 - 04:09 PM

This is a great thread - I've been seeing some horrible performance on some of my boxes with 9550sx's.

One of them is currently in a jbod configuration with software raid, which is much faster and more responsive than the controller.

I've been tempted on the advice of others to get an Areca to see what benefits I can get.

In my case, I couldn't get the card to complete more than 400io/sec aggregate (Never more than 120 in a single process) (in this case, seek randomly, read 8k - mimicing postgresql's io patterns). For comparison, I get thousands on another box with an hp p600 controller and a pile of sas disks. During a lot of those times the system gets very laggy and load goes up.

So we'll see what happens.





#18 User is offline   ccaputo Icon

  • Member
  • Group: Member
  • Posts: 1
  • Joined: 22-September 03

Posted 10 February 2008 - 02:57 PM

Is it just me or does it appear the 3ware 9.4.2 firmware release has resolved these problems?





#19 User is offline   natmaka Icon

  • Member
  • Group: Member
  • Posts: 3
  • Joined: 06-December 07

Posted 19 February 2008 - 03:17 AM

Same here: slow random I/O (using a database).

I published an overview and I'm in the process of refreshing all those informations after completing a bunch of tests with a 9650, for both the controller-implemented RAID (mostly RAID10) and the Linux 'md' one.

If you want me to run a given test please describe it to me, along with the reason why it is pertinent.





#20 User is offline   Fedor Icon

  • Member
  • Group: Member
  • Posts: 425
  • Joined: 18-March 07

Posted 19 February 2008 - 06:10 AM

View Postnatmaka, on Feb 19 2008, 08:17 AM, said:

Same here: slow random I/O (using a database).

I published an overview and I'm in the process of refreshing all those informations after completing a bunch of tests with a 9650, for both the controller-implemented RAID (mostly RAID10) and the Linux 'md' one.

If you want me to run a given test please describe it to me, along with the reason why it is pertinent.


Some great thorough work and research you've done. Pretty much sums up all the info and tweaks I've ever tried. There is one thing that puzzled me a bit about your tweaking. If dealing with small 64k reads a majority of the time, isn't setting read-ahead counter-productive? After all, if the reads are random, virtually nothing it reads ahead would be useful would it? To that end, perhaps setting a large read ahead with blockdev --getra is not a good idea, and you might even try turning off read-ahead on your controller (although if it's adaptive read-ahead, it should learn fairly quickly that reading ahead isn't useful and not do it).





  • (4 Pages)
  • +
  • 1
  • 2
  • 3
  • 4
  • You cannot start a new topic
  • You cannot reply to this topic

3 User(s) are reading this topic
0 members, 3 guests, 0 anonymous users