Sign in to follow this  
AndreSchmitz

Need help metering a Software defined SAN

Recommended Posts

Good morning,

this is my first post here so thank you for reading! :-)

We are using a software defined SAN (SanSymphony) an have some serious performance-problem running our processes. Now its on me to tell if the SAN ist the bottleneck. Unfortunately im a Database-Guy and not a SAN-Dude.

Can someone tell me some best-practices how to tell if a SAN is running hot?

The SAN:

Some 15k Disk-Raids on lower Tier, 2 SSD-Raid 5 for first tier (about max 15k IOPS Random Write for 8k Blocks each), connected via 8GPS Fibrechannel to the hosts. All put together with SanSymphony.

All Servers an services are running virtualized on vmware.

What i can tell:

Disk latencys (shown in Perfmon on the servers) are pretty high! 2 to 20ms Average on low load-dutys, up to 100-400ms on heavy load are typical for our MS SQL-Server. IMHO these numbers are horrific for physical Database-Servers but some consultants told us that this high latencys are normal for virtualization + san. So we tried to ignore the latencys.

I grap some logs of our SAN and figured out the most frequently Workload-Profiles (% Read, %Write, Average Read-Blocksize, Average Write-Blocksize) and setup an IOMeter-Scenario reflecting these workloads. I fired this IOMeter-Setup to our SAN in the off-duty-time, measured the maximum IOPS per Workload-Profile and compared it to the IOPS happening in real world for the specific Workloadscenario.

I put all these numbers together in some Excel-sheets and now ... i dont know any further.

IOPS_rel.jpg is showing 2 days in our companys live. Each datapoint represents about 30 minutes. I named the maximal benchmarkt IOPS 100% and compaired the real world iops with them.

What i can see:

- our SAN is continously running at about 30% maxload.

- the to peaks (1 to 13 and 51 to 59) shown in the diagramm are the processes causing trouble. The first spike hits the 100%-mark (the 120% i would tell benchmark-tollerance...) the second one is not touching the 80% -line.

So ... shall we upgrade our SAN or not? I know that this decision is at the last step a comparison of money.

But what would you say from a technical point of view?

Tank you very mutch for reading!

Andre

post-103228-0-34827500-1448261268_thumb.

Edited by AndreSchmitz

Share this post


Link to post
Share on other sites

Have you spoken to the SanSymphony crew to see if you can monitor SAN-level activity from the appliance itself (disk stats, networking, etc) to sort out that as a first step? If the devices are getting hammered, the SAN is most likely the cause.

Since you are using flash and HDD in that system, are you seeing hot data finally start spilling outside of flash and onto the HDD tier?

Share this post


Link to post
Share on other sites

Good morning!

Thank you for your reply and excuse my late one. ATM there is a Datacore-Crew inspecting our SAN. for now it looks like our SSD-Tiers (about 10% of the overall SAN-volume) is

- to small for our Hot Data

- to slow for our demands.

So we are thinking of compensating our bottlenecks with a new Top-Tier of Intel DC-P3700 SSD's as first step "emergency-solution"

Thank you for your patience!

Share this post


Link to post
Share on other sites

The sizing issue is a very common one, despite analytical tools, many people struggle with sizing a flash tier. Hopefully more/better helps you. What drives are in there now?

Share this post


Link to post
Share on other sites

What model SSD are you using currently? Are the SSDs old and slow, or is it RAID overhead or CPU constraints of the host platform?

Share this post


Link to post
Share on other sites

Good Morning!

We are using to Raidsets for the Toptier.

#1 Hitachi HUSSL4040ASS600 (~13k IOPS Randomwrite 8kb each, 4 active+1hotspare drives in Raid5)

#2 Toshiba PX02SMF080 (~15k IOPS Randomwrite 8kb each, 4 active+1hotspare drives in Raid5)

Both Raid together hosts about 3TB of space.

As the budged is limited we suggested to either

a) place a new TopTier with 800GB DC-P3700

b )or expand each Raid5 by "a few" (1-2) SSD to increase the Raid5 write-throughput. Streetprices of the P3700 and the PX02SMF080 seems to be equal.

c) Our Datacore/Serverhardwaresupplier suggested to double the amount of the existing SSDs and use them as RAID10.

Of course the 2nd way looks less promising but it eliminates the use of in our company unknown technologie like pcie-Storage. For know i looks like we can test 3) for free and then decide to by or not.

Talking in Numbers:

a) Using 1 x DC-P3700 ~2000€ / 45k IOPS as new Tier0 resulting in Tier0 = 800GB 45k IOPS + Tier1 ~23k IOPS, 3TB

b )Using 1 x additional SSD per Raid 5 ~2 x 2000€ / 15k IOPS increasing Tier0 from 3TB/23k IOPS to 4,2TB / 28,8k IOPS

c) Switching to Raid 10 using 10x additional SSD ~ 10 * 2000€ / 15k IOPS increasing Tier0 from 3TB / 23k IOPS to 3TB / 60k IOPS

asking me i would prefer a)

The Host-CPU is way bigger then Datacore recommended and dont show significant load.

The used Raid-controller are some kind of MegaRaid SAS-Conroller, but dont know the exact model :-(

Have a nice day,

Andre

Edited by AndreSchmitz

Share this post


Link to post
Share on other sites

Today our Datacore-Consultant told us that in his oppinion the Problem is designated to Tier2, a Raid10-Set of SAS-HDD.

We configured our SAN to use 100% of the Tier2-Capacity, so usinig the last x% results in a performancedegradation for Tier2-accesse duo to the nature of HDD.

His solution is now: Expand the Tier2 Raid10 by additional drives to increase the overall volume by 20% and then letting the last 20% untouched to force the SAN not to use the least speady 20% of the HDD

I know the Problem using the last parts of HD, but is this a common way? This would ONLY make sense if the last 20% of Tier2 is slower then the average of Tier3 or? And, even if this is given, our performancebottlenecks are in IOPS-Regions far above the Tier2-performance, so i would still point to Tier1.

Share this post


Link to post
Share on other sites

Yesterday i recalculated the Tier's our reseller configured for our SAN.

This is what it looks like:

Tier1: 2 x Raid 5 (4 x SSD, 35000IOPS per Raid) 3TB at all

Tier2: 1 x Raid10 (4 x HDD 10k, 520IOPS per Raid) 1,2TB at all

Tier3: 1 x Raid 5 (6 x HDD 10k, 390IOPS per Raid), 3TB at all

2 x Raid 5 (5 x HDD 7.2k, 130IOPS per Raid), 8TB at all

Tier4: 1 x Raid 5 (5 x HDD 10k, 325IOPS per Raid), 2,4TB at all

Tier5: 3 x Raid 5 (5 x HDD 7.2k, 130IOPS per Raid),12TB at all

Tier6: 1 x Raid 5 (3 x HDD 7.2k, 80IOPS per Raid), 2TB at all

We will mix all the Drives new like this:

Tier1: 2 x Raid 5 (4 x SSD, 35000IOPS per Raid) 3TB at all

Tier2: 1 x Raid10 (16 x HDD 10k, 2080IOPS per Raid) 4,8TB at all

Tier3: 4 x Raid5 (7 x HDD 7.2k, 157IOPS per Raid) 24TB at all.

Both configurations offers round about 32TB capacity, but in second one the average IOPS/TB is doubled.

Thank you for your help!

Share this post


Link to post
Share on other sites

We haven't logged any time with Datacore yet unfortunately but it sounds like you're making progress. If you're so inclined, please keep us updated on your progress. I'm sure other Datacore users would benefit from your sharing/insights.

Share this post


Link to post
Share on other sites

Very well!

But all our Server-Hardware/SAN-Hardware AND Datacorelicensing/Support lies by a small reseller who is not very ... helpfull.

Look at the config above:

The original config was done by the reseller. We figured out our new config and got comments like "on your owne risk!", "good luck", "in our oppinion you need a future concept" etc.

As i said im not a SAN-Guy and so easy to intimidate ;-)

What we can say fpr now:

This weekend we updatet our SanSymphonies to the actual SPS including the "Parallel IO"-Feature greatly promoted by Datacore.

On the first look we had an improvement up to about 10% for long running processes.

Share this post


Link to post
Share on other sites

Yeah, sizing is a huge challenge and resellers without tools and experience will struggle with it. This is true for hybrids as well as software defined. Anyway, glad you're on your way again!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this