stevecs

Experiences with large san/nas home projects (>48 drives)

Recommended Posts

Just wanted to throw a net out here to act as fly paper to gather any gotcha's or ideas/other people's build notes to learn from.

Currently my home array (about 3 years old) is running 80 ST31000340NS (1TB seagate drives) installed into 5 AIC RSC-3EG-80R-SA1S-0CR chassis going to 3 Areca 1680 cards. The drives are raided by the areca (6D+2P) and the luns presented to LVM and then running file systems on top of the LV's.

- AIC RSC-3EG chassis are only 3Gbps (12Gbps multiplexed) which is becoming a constraint w/ 16 drives

- AIC RSC-3EG is single-path (single expander) only so you can't do multi-pathing even if you have SAS drives.

- AIC RSC-3EG chassis uses just 3 120x38MM fans which are easy to replace with quieter models to cut down on noise if using the chassis for drives only.

- AIC RSC-3EG chassis requires modification (PSU) to power without MB.

- Areca 1680 does not allow smart data collection from os through to drive.

- Areca 1680 does not appear to support >2.18TB drives

- Areca 1680 does not support drives w/ sector sizes >512bytes (520/528 for TIF/DIF)

- Areca 1680 does not do any IOECC on reads only on writes.

- Areca 1680 IOC caps out at ~800MB/s.

- power draw is large (~1100W+) using non LP drives

- using hardware controllers makes it hard to really have good global sparing without greatly over-subscribing channel or controller (IOC) bandwidth.

- SATA drives knee-out after about 40% utilization; if working in heavy I/O environments SAS seems to allow workloads up to 80% before hitting the same issues.

I'm looking at replacing at least the drives shortly with 2TB Seagate SAS ones as well as to ditch the Areca's to go with LSI 9200-8e's and use software raid (zfs or mdadm).

- I would like to use the ST2000DL003's but have not found much real-world notes on them (TLER/vibration tolerance; if they have IOECC read & write; native/logical sector sizes) due to their power saving and cost saving but only if it doesn't put at risk availability/reliability. In lieu of that the SAS drives are rated for high vibration; TLER; and have IOECC on reads.

- the LSI 9200 cards support DIF extensions if you have sas drives you can format them with 520 byte sectors and the LSI card will read the extra 8 bytes per sector which has parity check+LBA number in it which is checked on reads (notifies torn/wild reads/writes) more important if not using ZFS.

- LSI card not being a raid card will also allow for multi-pathing (raid done in OS) to prevent availability failure if controller card itself dies (assuming chassis are also upgraded).

- Re-design to have 1 controller per external chassis as primary path and have one drive per chassis per raid group (i.e. 5-way raid5 (4D+1P) would be 5 chassis one drive each so if chassis fails, all arrays will loose at most 1 drive but still function).

- attach second port on LSI to different chassis (multi-path fail over) (port 0-chassis 0; port 1- chassis 1) et al.

- longer-term goal replacing AIC chassis w/ supermicro ones; they are much louder but are dual-expander and multi-path enabled plus allow for more drives (28) in 3U opposed to just 16.

- with UCE rates so high and very large drives, plus with above design of 1drive/chassis/raid group leaning toward 4D+2P arrays (6 drive raid6/raidz2). This also comes into play for ZFS as there is apparently issues with optimization when you are not base2 aligned (2,4,8 data drives).

- ZFS is looking interesting under linux (ZPL was just released GA from KQ but still has bugs so probably not viable still in the short-term unless running freebsd or solaris).

Edited by stevecs

Share this post


Link to post
Share on other sites
- Areca 1680 does not appear to support >2.18TB drives
I thought Areca said it did.. .does it not?

Not that controller cost is a huge thing these days...

Share this post


Link to post
Share on other sites

It didn't when I tested (v1.48) and the answer I got back from Areca was that it was not supported on the 1680 cards. Now don't know if it's something that they can fix, it may be as these cards are really SAS cards so it should be just firmware changes I would think.

However >2TB drives have another problem which is the UBR rates and silent errors which far outweigh other aspects in MTTF_DL calculations.

I'm still working on getting a spec together for the next refresh here which will probably due to financial reasons just be 96 2TB drives in 6 chassis (4D+2P configuration) with ZFS. I'm testing zfs on linux (0.6.0RC2) though may do the initial build of this with zfs-fuse especially if a stable release is not out in the next month for llnl's port. KQTech's port does work well though older file system version, but since their ZPL layer is being merged back into the LLNL port want to stick with the main trunk under the assumption that more devels will be working on that than on the KQ one.

Share this post


Link to post
Share on other sites
However >2TB drives have another problem which is the UBR rates and silent errors which far outweigh other aspects in MTTF_DL calculations.
Agreed...

I've been debating upgrading from my ARC-1680i myself, and would want either 2TB or 3TB drives to go with the next upgrade, but I'm not sure what to do at the moment. Still waiting for hardware vendors (ESPECIALLY Areca) to catch up with their Hardware Compatibility Lists. Adaptec and 3ware/LSI are pretty good about testing 3TB drives for their HCL's... Areca... not so much.

Share this post


Link to post
Share on other sites

For data integrity reasons there are pretty much only two options and both are incompatible with consumer hardware raid. The first would be to acquire drives that support T10 DIF (fat sectors, i.e. 520/528byte or 4160byte) and a HBA card that also supports it (LSI9200 series mainly though the support is very primitive). This is expensive as the only drives that support that would be SAS or FC drives (albeit you get some more benefits with them, i.e. sas/fc interface & queuing; Better UBR (factor of 10 at least); longer warranty period; et al.)

Or you can go with file systems that support checksumming of all data on both read & write requests (mainly ZFS & BTRFS however BTRFS is pretty much not in play and probably won't be for several years). ZFS (solaris; zfsonlinux (LLNL/KQTech); ZFS+FreeBSD; ZFS-fuse). However all except the solaris ones have various issues (performance; assurance; et al).

With that being said however, the LLNL/zfsonlinux port seems to have the most promise mainly due to the DOE funding, though it still has a ways to go to prove it's commitment.

This is pretty much where I'm heading right now mainly waiting to see when 0.60 stable comes out to fix some of the bugs; though this is a system where I have multiple tape backup sets (you should always have at least one); It will take a bit to perform the same with solaris native if at all (memory handling is completely different for example). Though another item is file fragmentation for example ZFS would not be a good pick for say old-fashioned torrents or files that are edited heavily. This would need a defrag tool.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now