Sign in to follow this  
gfody

It's that time again

Recommended Posts

Every couple of years I spec out new DB servers for our particularly demanding OLAP+OLTP application.

I've always gotten great feedback from you guys (DB spec 2005, DB spec 2007), so here goes..

DB spec 2009:

4x ($825) Areca ARC-1680ix-12

36x ($88) 4GB DDR2-533 ECC Registered

48x ($349) Transcend SATA 128GB SSD 2.5"

4x ($2355) Opteron 8384 Quad Core 2.7GHz

1x ($885) Tyan N6550EX

1x ($2239) RM4048 4U 48-Bay 2.5" HDD Chassis

Total ~$35,764

I'm not confident about the drives. I started out thinking Intel X25 but they're $865ea for the 160GB. Considering I'll be striping them 12x with 4gb cache on each controller the performance difference wouldn't matter - but what about reliability? Is anyone using MLC drives in servers? Do you think it's okay to use raid 0? Seems like not striping would be a waste do you really have to worry about "losing" an SSD? Maybe raid 5 with a hot spare for piece of mind.

Last time we ended up going w/Dell, lots and lots of MD1000's and SAS drives.. and pretty much ran out of rack space. The servers were great for the money and very reliable but we definitely have to improve on the density this time. I don't really see anything like this from the OEMs.

Discuss! :P

Share this post


Link to post
Share on other sites

"Do you really have to worry about losing an SSD?"

I think the answer to that is: well, do you *want* to? They may not have moving parts, but neither do motherboards or cpus or RAID controller cards or. . .many other things that are known to fail. If you don't have a DB mirror/replica, then I'd say RAID0 is insanity no matter what the drives.

About the SSD performance. . .have you seen the reports of bad write performance from SSDs, including the Intel E-series SLC, and "enterprise" products like the Fusion-io ioDrive? They're not necessarily any faster than 15k SAS drives, may be slower, and are certainly much less predictable, currently.

Share this post


Link to post
Share on other sites

you should REALY take a look at MFT for your SSD's.....

http://www.easyco.com/mft/index.htm

We are currently testing SSD's for our low budget san. And MFT is a MUST if you are using SSD's. Cause Random Writes at 4K blocks are a KILLER for all types of SSD's, and MFT gets that performance to the TOP!.

Sorry i cant post any "before data" But here are some data for 2x OCZ Core 120GB drives (V1.) in RAID 0:

2 MLC OCZ Core Series 128Gb in RAID 0

Block

Write Tests

Size (10 threads)

IOPS

BW

512 B

5262

3.6MB

1KB

4201

4.1MB

2KB

3525

7.9MB

4KB

40878

160.7MB

8KB

24317

190.0MB

16KB

12582

197.6MB

32KB

5627

176.9MB

64KB

2922

183.7MB

128KB

1426

178.3MB

256KB

735

184.8MB

512KB

366

183.5MB

1MB

183

184.9MB

2MB

92

184.4MB

4MB

48

192.0MB

We where getting something like .... 100 IOPS at 4k through our raid controller (raid 0) without MFT , and about 7IOPS with direct SATA connection without raid controller (1 SSD) without MFT ... so the conclusion is ..... MFT is SICK!!!

One thing though...... price is (i think) about 125USD per 32GB

I'll try to get some before numbers. But the great thing is... you can buy "low performance" SSD's and get great performance. I mean... with that many number of drives, your sequencial read/write will be at max even with "low performance ssd's". So buy low performance SSD's, throw MFT after them, and your random writes will KICK ASS !!!

Ohh one thing.... when using MFT some of your storage will be used for "MFT stuff" i think its about 10% of the drives that is unavailable... =)

Regards

Jan Chu

Edited by Jan Chu

Share this post


Link to post
Share on other sites

I just had a look at that MFT software. It looks like a basic software layer that employs your CPU and RAM to transform random writes into sequential writes. It seems to use some sort of deferred-format method using idle time to optimize the data, they explain the drawbacks here. Honestly it smells like snake oil to me.. especially with statements like this:

Flash Solid State Drives (SSDs) random read up to fifty times faster than Hard Disks but random write a hundred times slower than they random read. Thus, bare SSDs normally perform no better than Hard Disks.
So SSDs are many times faster than HDDs but because they don't random write as fast as they random read they're no better than HDDs? That's a total non sequitur. Random writes are killer for any drive, not just SSDs. Things like command queuing, write cache and big buffer HBAs work with SSDs too.

I have seen some posts here on the garbage-collection-mode performance issues. It seems that with lots of drives and lots of controller cache it would be hard to run into that worst case scenario for very long. Even if the SSDs were permanently limited to ~20mb/sec writing, the overall system would be clearing its write cache at 960mb/sec. I'm definitely interested if anyone can share any real world numbers with large arrays of SSDs.

My biggest concern is still MLC vs. SLC. Do reliability concerns hold water?

Share this post


Link to post
Share on other sites
4KB

40878

160.7MB

As mentioned before. we got about 100 IOPS without MFT, and with MFT we got 40k iops.... so if you think it sounds to good to be true i would advice you to get a trial and se for your self!!! Cause it is that amazing. What impacts it has on MTBF (lifespan) i dont know.... but with low budget SSD's and MFT you get amazing performance for less money!

//jan Chu

Share this post


Link to post
Share on other sites

For performance I would recommend the upcomming vertex drives, released next week. But a slower ssd could be better since 4 vertex discs will probably overload the controllercard. Even Intels 1.2Ghz IOps has a hard time keeping up with lots of ssds.

I recommend visisting http://www.ocztechnologyforum.com/forum/fo...splay.php?f=186

And talk to the persons with experience running 4- 8 ssds on different raidcontrollers.

I would also recommend i7 (nahelem) based Xeons as the are currently better then the opterons out there.

Share this post


Link to post
Share on other sites

godfy,

I took a look aver the last two threads.

Do you have a DBA and the schema you're working with available? From the looks of things, it appears that you're going for the "one huge hammer" approach. This is usually not fault tolerant (a requirement for OLTP under PCI and SAS70) and is usually not performant. There is a possibility that setting up a 10g RAC cluster and partitioning the DB may be a much more eloquent option.

Frank

Share this post


Link to post
Share on other sites
What database package are you using?

Do you have a target number for I/Os per second off the data array?

It's MSSQL and yes we know all the tricks :D raw volumes, binary collation, etc.

We don't generally target a specific performance number. Application performance is top priority and if the server is overbuilt it just means we can host more instances. There will be many servers like this each with many instances of our application.

I recommend visisting http://www.ocztechnologyforum.com/forum/fo...splay.php?f=186

And talk to the persons with experience running 4- 8 ssds on different raidcontrollers.

This is a great link, thank you!
Do you have a DBA and the schema you're working with available? From the looks of things, it appears that you're going for the "one huge hammer" approach.
It's a real-time analytics system using MSSQL internally. I only say OLTP+OLAP because that's what the actual SQL activity resembles - doing both at the same time. RAC is an interesting solution but at this point the application is really married to MSSQL. The app scales sideways with many instances but we also try to improve the per instance performance every couple of years with new hardware.

I hadn't kept up on Intel vs. AMD. I will try to find an Intel board with 32 memory and 4 PCIe slots.

Share this post


Link to post
Share on other sites

Given the per-socket (and not per-core) pricing on MSSQL, it might be cost-effective for you to use the 6-core Xeon 7400 series.

I don't think Supermicro makes a mobo with 32 DIMM slots and 4 PCIe slots. They do make two that have all the RAM slots but only 3 PCIe slots, however. They are quad-socket Xeon 7x series.

The HP Pro-Liant servers can be had with 32 DIMM sockets. You could use the on-board 2.5" slots for txn log drives, and an external array for the DB. We managed to get a few of those servers (with single CPU but 128GB of ram and full of SAS drives) for around $16k each, after significant haggling.

Re: MFT. Personally, I'm pretty wary of "Easy Computing Company". Also, you can get more than 100 IOPS from a good SSD without it, for sure. Given the incredible variability of SSD performance over time / workload, the posts in this thread of "I only got 100 IOPS before, now with MFT I'm getting 40k!!" are, pretty much, worthless. Meanwhile, it's that variability and "black-box" behavior that makes the current crop of SSDs iffy for important jobs.

I still don't know how much actual disk volume space is required. SSD performance is much more likely to stay blazingly fast if you only ever use a small subset of the LBAs available on them. (This is true with HDDs too, but they fall off linearly as you use more, because the seeks get longer. SSD write performance will go down much more drastically).

Share this post


Link to post
Share on other sites
Given the per-socket (and not per-core) pricing on MSSQL, it might be cost-effective for you to use the 6-core Xeon 7400 series.
The 6-core Xeons look really cool but I'm not really liking the motherboard options. The most RAM capacity I see so far is 24 DIMMs and that's a quad socket board. I do like the idea of a 24 core server but don't like skimping on RAM. Nothing with 4 PCIe slots either.

With external arrays the size of these things really starts to add up. We're currently almost completely out of rackspace because of each server taking up 12-16U of MD1000's and that's only 15 drives per MD unit. I really love the density of that 4U 48 drive chassis - why don't HP or Dell offer something like this?

Thinking if SSD is just too much of a risk the other options are Savvio 15k 2.5" SAS $288ea for 73GB but that's cutting capacity almost in half from the 128GB SSDs. I would probably consider the 150GB Velociraptors instead gain some capacity and save money. It would be a shame to miss out on the possibly incredible performance boost in our app from SSD's random read performance.

Share this post


Link to post
Share on other sites
It would be a shame to miss out on the possibly incredible performance boost in our app from SSD's random read performance.

If it's any consolation,

If this is an OLTP db (or a copy of one), then you likely have a really big atomic "purchase" table with events of types GRANT, REVOKE, CONCLUDE, RESTORE. Keep this table alone on it's own volume and reads from it are no longer random.

Frank

Share this post


Link to post
Share on other sites
The most RAM capacity I see so far is 24 DIMMs and that's a quad socket board. I do like the idea of a 24 core server but don't like skimping on RAM. Nothing with 4 PCIe slots either.

Thinking if SSD is just too much of a risk the other options are Savvio 15k 2.5" SAS $288ea for 73GB but that's cutting capacity almost in half from the 128GB SSDs.

Some points:

  • Your 32 DIMMs slots MB is too large for your 4U rack (48x Hot-swap 2,5" in 4U :wub: ), you should stop dreaming of a 128GB MB with 4GB DIMMs (Quad 8-core with 128 GB of 8GB DIMMs will be available for DB spec 2011)
  • Using a stuttering JMicron 602 based SSD is definitely dangerous...and I think any SSD will enter this slow mode when getting close to its capacity limit...you may look at "copy-on-write" feature (Win7 and some Linux fs) to delay this problem and plan the use of an "erase" editor tool
  • You can rely on SATA drives in RAID 1x for most classic operations : OS, APPS, TEMP, BACKUP and LOGS directories. Staying at a 2,5" form factor means using the $320 300GB VelociRaptor for those drives...and you can have less hw raid card cause raid 1x is not a IOP high consumer
  • Go with some SSD to allow for some intensive direct io access carefully chosen

Share this post


Link to post
Share on other sites
Your 32 DIMMs slots MB is too large for your 4U rack (48x Hot-swap 2,5" in 4U :wub: ), you should stop dreaming of a 128GB MB with 4GB DIMMs (Quad 8-core with 128 GB of 8GB DIMMs will be available for DB spec 2011)
They both say Extended-ATX but you're right the MB specs says 13"x16.2" and the chassis specs says 12"x13". That's a major bummer.. better to catch it now I guess. I really liked this MB w/all the DIMM sockets and 4x PCIe slots. I will be searching MBs and chassis again. I hear the new Xeons are the way to go now anyways.

I'm feeling a bit discouraged on SSD. It seems like it's almost but just not quite ready for prime time. I think it's definitely time to go 2.5" though. I see Dell has MD units supporting 2.5" drives now (24x in 2U). This could be a way for us to recover some rack space w/o upgrading the servers.

Share this post


Link to post
Share on other sites
I think it's definitely time to go 2.5" though.

I found two 4U 48-bay for 3,5" form factor drives...but that does not solve the MB size problem...and 8GB DIMM are still really expensive compared to 4GB.

Regarding SSD, I encouraged you to try buying 8x (on 48x) to allow for very intensive IOPS tables to be located on those SSD.

Share this post


Link to post
Share on other sites

That XJ2000 chassis confuses me. The description says 48-bay but the picture shows 24 bays. :blink:

Regarding SSD, I encouraged you to try buying 8x (on 48x) to allow for very intensive IOPS tables to be located on those SSD.
I will probably try this out as a test.. if SSD is viable at all then I would be inclined to avoid spinning drives completely. If there are serious performance problems then I don't think I would be able to use them even for special tables. I'm just going to have to do a real world test and measure the best and worst case scenarios and make a decision.

Share this post


Link to post
Share on other sites

Regarding SSD you have to options today

OCZ Vertex and Intels x25, vertex beeing the new product with better write performance and cheaper in general.

Share this post


Link to post
Share on other sites
That XJ2000 chassis confuses me. The description says 48-bay but the picture shows 24 bays. :blink:
I had the same confusion until I saw that they provide a special hot-swap tray allowing TWO 3,5" drive per bay...anyway, it looks like an extension rack not capable of handling a MB.

Regarding the SSD drives, I although think you can buy some of the brand new OCZ Vertex (Pre-sales $470 for 120 GB) or the still expensive Intel X25-M ($360 for 80 GB)

Share this post


Link to post
Share on other sites
Regarding the SSD drives, I although think you can buy some of the brand new OCZ Vertex (Pre-sales $470 for 120 GB) or the still expensive Intel X25-M ($360 for 80 GB)

I just realized that PCIe x4 (2.5Gb/s*4 with 8b/10b encoding) is limited to 1000MB/s which limits each PCIe x4 port to 4 SSD (Consistently reading at 210MB/s) and each PCIe x8 to 9 SSD

Share this post


Link to post
Share on other sites
I just realized that PCIe x4 (2.5Gb/s*4 with 8b/10b encoding) is limited to 1000MB/s which limits each PCIe x4 port to 4 SSD (Consistently reading at 210MB/s) and each PCIe x8 to 9 SSD
Yeah that's why I chose that opteron board with 4 PCIe x8 slots and went with 4 12 port controllers rather than 2 24 port.

Share this post


Link to post
Share on other sites
that opteron board with 4 PCIe x8 slots and went with 4 12 port controllers rather than 2 24 port.

Did you look closely at this MB block diagram ?

It shows "16x16@2GT/s" for both NFP 3050 (3x PCIe-x8 + 1 PCI-X-x4) and NFP 3600 (2x GbE + 1x PCIex16 + 8x SAS + 6x SataII). Any idea of how MB/s this "16x16@2GT/s" are ?

Appart from that, did you find a hw raid card with a PCIex16 usage ? They looks limited by PCIe x8...

Share this post


Link to post
Share on other sites
Did you look closely at this MB block diagram ?

It shows "16x16@2GT/s" for both NFP 3050 (3x PCIe-x8 + 1 PCI-X-x4) and NFP 3600 (2x GbE + 1x PCIex16 + 8x SAS + 6x SataII). Any idea of how MB/s this "16x16@2GT/s" are ?

Appart from that, did you find a hw raid card with a PCIex16 usage ? They looks limited by PCIe x8...

Hypertransport is a packet based protocol so they measure it in transers per second. 1GT/s is effectively 2GB/s (2 bytes per transfer)

So going by the block diagram 3 of the PCIe slots will be limited to 4GB/s total and the 4th is on a separate 4GB/s channel shared with the onboard SAS controller and NIC.

Usually it's the controller cards that bottleneck you before the bus/HT but I don't know the max throughput on these 1.2GHz IOP cards (they're PCIe x8 and will work in the x16 slot btw). The 800MHz cards topped out at about 800MB/s.

Edited by gfody

Share this post


Link to post
Share on other sites
I don't know the max throughput on these 1.2GHz IOP cards (they're PCIe x8 and will work in the x16 slot btw). The 800MHz cards topped out at about 800MB/s.

There are some benchmarks showing IOP348@1,2Ghz delivering about 1200MB/s in raid 5/6.

I don't know if this throughput is limited by the XOR operations or not.

If not, it means you may have higher throughput in raid 0/1...anyway 1650MB/s looks like another limit (seen in some "all-inb-cache" bench)

==> Please, let us know if you see a 1600MB/s throughput in a 10x SSD raid 10 array :lol:

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this