poodel

RAID5 fileserver recommendations

Recommended Posts

My old file server is reaching it's limits, so I'm building a new one. This time I'm going to make it proper. So far I've got a 19" Chieftec case big enough to fit 18 HDD:s, some hotswap cassettes, and now I'm looking at the storage. Having worked with storage in Sun environment for ~10 years, I'm pretty sure of what I want, but not as sure of how the consumer-level hardware will perform.

What I'm aiming for is:

* RAID5 redundancy

* Single file system on the array. The 2TB limit should not apply if I run a 64-bit debian distro?

* Online capacity expansion... add a disk whenever I need to, and have the file system grow.

* Debian compatibility. Which filesystem do I use to make it growable on the fly?

* RAID1 snapshot of the root disk (in case some APT update screws something important up)

I'm not going to boot from the RAID array, so I'll need some smaller SATA disks as root disks.

Disks

Bigger is better, so I'm thinking 750GB Seagates. Any reason not to?

RAID controller

I've narrowed it down to Promise EX8350 or Areca 1220, but here's where it gets tricky. Does anyone know how they work in Debian? I'm a bit worried about what's possible to do and what's convenient to do.

Share this post


Link to post
Share on other sites
My old file server is reaching it's limits, so I'm building a new one. This time I'm going to make it proper. So far I've got a 19" Chieftec case big enough to fit 18 HDD:s, some hotswap cassettes, and now I'm looking at the storage. Having worked with storage in Sun environment for ~10 years, I'm pretty sure of what I want, but not as sure of how the consumer-level hardware will perform.

What I'm aiming for is:

* RAID5 redundancy

* Single file system on the array. The 2TB limit should not apply if I run a 64-bit debian distro?

* Online capacity expansion... add a disk whenever I need to, and have the file system grow.

* Debian compatibility. Which filesystem do I use to make it growable on the fly?

* RAID1 snapshot of the root disk (in case some APT update screws something important up)

I'm not going to boot from the RAID array, so I'll need some smaller SATA disks as root disks.

Disks

Bigger is better, so I'm thinking 750GB Seagates. Any reason not to?

RAID controller

I've narrowed it down to Promise EX8350 or Areca 1220, but here's where it gets tricky. Does anyone know how they work in Debian? I'm a bit worried about what's possible to do and what's convenient to do.

Get the 750GB ES (for RAID drives) and Areca no question, promise cards are 'fakeraid' - Areca cards are the fastest SATA cards in the world.

Share this post


Link to post
Share on other sites

Have you looked at the Infrant RAID NAS's? Here's a link to their Wiki site. That is of course, if your interest is solely Raid 0-5 storage inc hot swapping.

Phil.

Share this post


Link to post
Share on other sites

If cost is a concern over pure capacity, I'd suggest getting something other than 750GB drives. They've the worst capacity/price ratio of any 7200RPM drive out there.

Share this post


Link to post
Share on other sites
If cost is a concern over pure capacity, I'd suggest getting something other than 750GB drives. They've the worst capacity/price ratio of any 7200RPM drive out there.
Also remember that if you choose RAID5 with X-Raid (Infrant one company that has this feature - others no doubt as well) you can upgrade one disk at the time without having to backup and restore. Sweet spot right now is the Seagate 320GB SATA II ST3320620AS at about US$75 eac. But, realise that until you have all 4 disks upgraded (to say 750GB when these become decent value per GB), you will be limited by the size of your smallest disk - that is if you want to retain RAID 5. As you know, the useful data storage is the total capaity of all drives less one, i,e 960GB for 4 x 320GB drives).

Share this post


Link to post
Share on other sites
Get the 750GB ES (for RAID drives) and Areca no question, promise cards are 'fakeraid' - Areca cards are the fastest SATA cards in the world.

The ES drives are about $80 more a piece, so what's the upside?

Share this post


Link to post
Share on other sites

Get the 750GB ES (for RAID drives) and Areca no question, promise cards are 'fakeraid' - Areca cards are the fastest SATA cards in the world.

The ES drives are about $80 more a piece, so what's the upside?

Depends on how reliable you want your system to be-- the consumer-- regular drives (which I currently use 7200.8/9/10 400s) will often die/have more problems if they encounter a few bad sectors. Where with RAID (ES) or WD RE/RE2 drives, they will just keep ticking/running with thousands of bad sectors (over time) because they are built to run in a RAID setting.

Share this post


Link to post
Share on other sites
If cost is a concern over pure capacity, I'd suggest getting something other than 750GB drives. They've the worst capacity/price ratio of any 7200RPM drive out there.

Yes, but then there's the RAID controller, the disk enclosures, the computer, the power usage... assuming I start with 4 * 750, the initial cost of the disks is going to be about 40% of the total server cost. Considering the whole system, !/$ would actually be worse if I went for the cheaper disks.

Edited by poodel

Share this post


Link to post
Share on other sites

Also don't forget the 32-bit LBA (2TiB) limit is not a function of the OS but of the RAID controller. Buying one new, however, it shouldn't be an issue at all.

Share this post


Link to post
Share on other sites
Also don't forget the 32-bit LBA (2TiB) limit is not a function of the OS but of the RAID controller. Buying one new, however, it shouldn't be an issue at all.

True.

But the LBA limitation is not the only (nor even the most important) limitation. The filesystem may have limitations too, though anything reasonably modern won't. I'm a fan of XFS (wikipedia), but even ext2 (with 8k blocks) will scale to 32TB filesystems and 2TB files.

The biggest limitation in terms of using not-cutting-edge tech will be the partition table format. The standard partition formats limit you to 2TB as well, so you'll have to use something like GPT (wikipedia) or simply put a filesystem on unpartitioned space, which works just fine.

Share this post


Link to post
Share on other sites

Also don't forget the 32-bit LBA (2TiB) limit is not a function of the OS but of the RAID controller. Buying one new, however, it shouldn't be an issue at all.

True.

But the LBA limitation is not the only (nor even the most important) limitation. The filesystem may have limitations too, though anything reasonably modern won't. I'm a fan of XFS (wikipedia), but even ext2 (with 8k blocks) will scale to 32TB filesystems and 2TB files.

The biggest limitation in terms of using not-cutting-edge tech will be the partition table format. The standard partition formats limit you to 2TB as well, so you'll have to use something like GPT (wikipedia) or simply put a filesystem on unpartitioned space, which works just fine.

I second the use of XFS :) -- I use it for a 3.3TB FS.

Share this post


Link to post
Share on other sites

If you've had experience with Sun environments, have you considered Solaris 10 x86 instead?

For serving up bits over the network Debian isn't going to get you anything extra, and they both cost the same. But instead of agonizing over the best bang-per-buck hardware RAID cards for Linux... you may get better data consistency, flexibility, and performance by just buying cheap PCI/PCIe/PCIx cards and feeding them to ZFS:

http://en.wikipedia.org/wiki/Zfs

Yes, Linux supports a wider variety of IDE/SATA cards, but the Sol10 HCL gets longer every day, and there are many forums for Sol10/OpenSolaris/SolarisExpress full of people who can help you make the correct hardware choice.

I use Debian and Gentoo at home myself, but use Solaris at work, and my next fileserver will use ZFS.

Something to think about.

Happy New Year!

Share this post


Link to post
Share on other sites
For serving up bits over the network Debian isn't going to get you anything extra, and they both cost the same. But instead of agonizing over the best bang-per-buck hardware RAID cards for Linux... you may get better data consistency, flexibility, and performance by just buying cheap PCI/PCIe/PCIx cards and feeding them to ZFS:

If you do decide to use zfs, let us all know what hardware it's running on and how fast it is, easy to use, etc. It's a fairly new file system so there's not too much collective wisdom out there about it yet.

Share this post


Link to post
Share on other sites
If you've had experience with Sun environments, have you considered Solaris 10 x86 instead?

For serving up bits over the network Debian isn't going to get you anything extra, and they both cost the same. But instead of agonizing over the best bang-per-buck hardware RAID cards for Linux... you may get better data consistency, flexibility, and performance by just buying cheap PCI/PCIe/PCIx cards and feeding them to ZFS:

Yes, Linux supports a wider variety of IDE/SATA cards, but the Sol10 HCL gets longer every day, and there are many forums for Sol10/OpenSolaris/SolarisExpress full of people who can help you make the correct hardware choice.

Hm... interesting idea. I'll have to read up on how ZFS performs (and I'd have to keep my Debian box on the side). Thanks.

Share this post


Link to post
Share on other sites

If you've had experience with Sun environments, have you considered Solaris 10 x86 instead?

For serving up bits over the network Debian isn't going to get you anything extra, and they both cost the same. But instead of agonizing over the best bang-per-buck hardware RAID cards for Linux... you may get better data consistency, flexibility, and performance by just buying cheap PCI/PCIe/PCIx cards and feeding them to ZFS:

Yes, Linux supports a wider variety of IDE/SATA cards, but the Sol10 HCL gets longer every day, and there are many forums for Sol10/OpenSolaris/SolarisExpress full of people who can help you make the correct hardware choice.

Hm... interesting idea. I'll have to read up on how ZFS performs (and I'd have to keep my Debian box on the side). Thanks.

With all of this talk I am also building another fileserver as I have not outgrown, but become sick of slow speeds of the PCI bus. ZFS in my opinion is probably one of the best filesystems currently in existence; however, it is new and not proven over time yet. I currently use XFS and I am satisfied with it.

Instead of purchasing a $1500 RAID controller, I am going to use the onboard SATA and multiple PCI-e x1 cards with dual SATA ports. Not sure if I want to use RAID5 or RAID10 yet; however, this will give me the speed the drives can push. Currently, my configuration is as follows:

/dev/md3:
	Version : 00.90.03
 Creation Time : Fri Jul  7 18:52:29 2006
 Raid Level : raid5
 Array Size : 3516378624 (3353.48 GiB 3600.77 GB)
Device Size : 390708736 (372.61 GiB 400.09 GB)
  Raid Devices : 10
 Total Devices : 10
Preferred Minor : 3
Persistence : Superblock is persistent

Update Time : Wed Jan  3 05:32:03 2007
	  State : active
Active Devices : 10
Working Devices : 10
Failed Devices : 0
 Spare Devices : 0

	 Layout : left-symmetric
 Chunk Size : 512K

	   UUID : 6b8f95e6:23e17793:9107a4ba:c2732883
	 Events : 0.6224664

Number   Major   Minor   RaidDevice State
   0	   3		1		0	  active sync   /dev/hda1 *seagate/400
   1	  57		1		1	  active sync   /dev/hdk1 *seagate/400
   2	  34		1		2	  active sync   /dev/hdg1 *seagate/400
   3	  33		1		3	  active sync   /dev/hde1 *seagate/400
   4	  56		1		4	  active sync   /dev/hdi1 *seagate/400
   5	   8	   81		5	  active sync   /dev/sdf1 *seagate/400
   6	   8	   97		6	  active sync   /dev/sdg1 *seagate/400
   7	   8	   33		7	  active sync   /dev/sdc1 * wd/400
   8	   8	   49		8	  active sync   /dev/sdd1 * wd/400
   9	   8	   65		9	  active sync   /dev/sde1 * seagate/400

As you can see, its a mix-mash of IDE+SATA and WD/SEAGATE, in the new case I am contemplating whether I should get all the exact same model number drives/or do something else.

Currently though, Linux SW RAID has been nothing but awesome, I started out with 1.8TB and 'grew' the RAID5 from there and then I used xfs_growfs to grow the filesystem.

Like this (I kept the logs when I did this):

Step #1: Growing the RAID

First, you add a spare to the RAID5 pool.

box:~# df -h | grep /raid5
/dev/md3			  746G   80M  746G   1% /raid5
box:~# umount /dev/md3
box:~# mdadm -D /dev/md3
/dev/md3:
	Version : 00.90.03
 Creation Time : Fri Jul  7 15:44:24 2006
 Raid Level : raid5
 Array Size : 781417472 (745.22 GiB 800.17 GB)
Device Size : 390708736 (372.61 GiB 400.09 GB)
  Raid Devices : 3
 Total Devices : 4
Preferred Minor : 3
Persistence : Superblock is persistent

Update Time : Fri Jul  7 18:25:29 2006
	  State : clean
Active Devices : 3
Working Devices : 4
Failed Devices : 0
 Spare Devices : 1

	 Layout : left-symmetric
 Chunk Size : 64K

	   UUID : cf7a7488:64c04921:b8dfe47c:6c785fa1
	 Events : 0.26

Number   Major   Minor   RaidDevice State
   0	   3		1		0	  active sync   /dev/hda1
   1	  33		1		1	  active sync   /dev/hde1
   2	   8	   33		2	  active sync   /dev/sdc1

   3	  22		1		-	  spare   /dev/hdc1


Then you "grow" the RAID5.

box:~# mdadm /dev/md3 --grow --raid-disks=4
mdadm: Need to backup 384K of critical section..
mdadm: ... critical section passed.

Then you check the status:

box:~# cat /proc/mdstat 
Personalities : [raid1] [raid5] [raid4]
md1 : active raid1 sdb2[1] sda2[0]
  136448 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
  70268224 blocks [2/2] [UU]

md3 : active raid5 hdc1[3] sdc1[2] hde1[1] hda1[0]
  781417472 blocks super 0.91 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
  [>....................]  reshape =  0.0% (85120/390708736) finish=840.5min speed=7738K/sec

md0 : active raid1 sdb1[1] sda1[0]
  2200768 blocks [2/2] [UU]

Then wait a while, when done, you can grow the filesystem...

Step #2: Growing the filesystem

Growing the XFS filesystem is a breeze:

# xfs_growfs /raid5

box:~# df -h | egrep '(^Filesystem|/dev/md3)'
Filesystem			Size  Used Avail Use% Mounted on
/dev/md3			  2.6T  932G  1.7T  36% /raid5
box:~# xfs_growfs /raid5
meta-data=/dev/md3			   isize=256	agcount=38, agsize=18314368 blks
	 =					   sectsz=4096  attr=0
data	 =					   bsize=4096   blocks=683740288, imaxpct=25
	 =					   sunit=128	swidth=768 blks, unwritten=1
naming   =version 2			  bsize=4096  
log	  =internal			   bsize=4096   blocks=32768, version=2
	 =					   sectsz=4096  sunit=1 blks
realtime =none				   extsz=3145728 blocks=0, rtextents=0
data blocks changed from 683740288 to 781417472
box:~# df -h | egrep '(^Filesystem|/dev/md3)'
Filesystem			Size  Used Avail Use% Mounted on
/dev/md3			  3.0T  932G  2.1T  32% /raid5
box:~#

PROS:

1) RAID5 (don't need to worry about a drive dying)

2) Only 5-15% CPU utilization under heavy I/O, here the dd is doing 40-120MB/s and the RAID5 process is only using 12% of the CPU (old 3.4GHZ Pentium4 Prescott)

  PID USER	  PR  NI  VIRT  RES  SHR S %CPU %MEM	TIME+  COMMAND		   
20105 bob	   18   0  2008  540  440 D   36  0.1   0:05.68 dd				 
 381 root	  10  -5	 0	0	0 S   12  0.0  99:08.57 md3_raid5

3) I can monitor all drives via smartctl (SMARTMONTOOLS)-- yes, 3ware allows a pass-thru to get to the drives, but many other RAID cards do not. This also means I can monitor temperature very easily as well.

$ ctemp
/dev/hda: ST3400832A: 35°C
/dev/hde: ST3400832A: 34°C
/dev/hdg: ST3400832A: 34°C
/dev/hdi: ST3400832A: 33°C
/dev/hdk: ST3400633A: 36°C
/dev/sda: WDC WD740GD-00FLC0: 30°C
/dev/sdb: WDC WD740GD-00FLC0: 31°C
/dev/sdc: ST3400633AS: 35°C
/dev/sdd: ST3400620AS: 37°C
/dev/sde: ST3400633AS: 36°C
/dev/sdf: WDC WD4000KD-00NAB0: 33°C
/dev/sdg: WDC WD4000KD-00NAB0: 30°C

4) I get between 100-133MB/s read from the array, which is nice.

CONS:

1) PCI bus is limited to 133MB/s.

2) Even though I use SATA drives on the motherboard, I believe they are also on the PCI bus as PCI-express was not out when my motherboard was created.

3) Write speed is 38-40MB/s sustained, again, I believe this is because of the PCI bus, it has to calculate/write PARITY and then the data..

4) Current case setup is a nightmare, which is why I ordered the Cooler Master Stacker. The entire case had to be modded to put fans where they did not belong and the cables are everywhere. Part of the problem is that some drives are IDE and some are SATA (IDE cables, even the round ones take up a lot of room). The Antec TruPower 550W handles the drives with no issues, at bootup it hits 500-520 watts and then after the drives have spun up it uses 220-280 watts.

Pictures of setup (below):

The two raptors are at the very top, followed by the two WD 400s and below that the rest are Seagate IDE+SATA.

Amazingly, with about 10-12 fans in the box, everything stays very cool.

Front of the case, I disconnected the temperature control in the front because it added an additional 3-6 power cables/fan control cables in the case, and as you can see, I have enough of those!

case1-small.jpg

The side of the case, yes, its a mess.

case2-small.jpg

Plan:

Build new machine.

New drives (possibly).

Use cooler master stacker.

Hopefully have a lot less mess!

Justin.

Share this post


Link to post
Share on other sites

my 2 cents.

1: Linux DMraid is great for archival fileservers where performance is not critical.

2: I'd like the origional poster to check out 3ware's offerings before deciding on a controller. I swear by them.

Thank you for your time,

Frank Russo

Share this post


Link to post
Share on other sites

jpiszcz, excellent HOWTO. Thanks. I didn't know that Linux RAID 5 arrays were growable yet! Is there a specific kernel version that you need to support it? Do you know if RAID 6 arrays are growable?

Thanks!

Share this post


Link to post
Share on other sites
jpiszcz, excellent HOWTO. Thanks. I didn't know that Linux RAID 5 arrays were growable yet! Is there a specific kernel version that you need to support it? Do you know if RAID 6 arrays are growable?

Thanks!

I /think/ 2.6.17 introduced support for growable RAID5 arrays. RAID6 is not growable AFAIK, just RAID5.

Share this post


Link to post
Share on other sites
I /think/ 2.6.17 introduced support for growable RAID5 arrays. RAID6 is not growable AFAIK, just RAID5.

2.6.16 cleaned up a bunch of RAID code, but you're correct that 2.6.17 introduced the code and interfaces for growing RAID5 arrays. Later kernels include various comments about fixing bugs in the RAID5 code, both specifically related to growability and not, so I wouldn't recommend using the minimum 2.6.17. 2.6.18 merges the RAID4/5/6 code, though it's not apparent from the changelog whether this means RAID6 arrays are growable or not. RAID5->RAID6 migration is not currently possible but it's a near-term feature and the code is actively moving in that direction. I tried to find in the changelog where someone (Andrew Morton?) commented that enough successful reports of RAID5 growing had come in that he's comfortable that it's pretty stable, but I've so far failed.

2.6.19 includes yet more md fixes; 2.6.19.1 does not, but it's the latest stable kernel, so that's probably your best bet. 2.6.20-rc[1-3] don't include many changes to md code.

Share this post


Link to post
Share on other sites

Ok, now I've spent the better part of the day playing around with ZFS on a 4-disk SunFire V240. It's really an incredibly flexible filesystem. I'm not yet convinced that I will get it running on some decent X86 hardware. It installed properly on my old AMD 3000XP, but my more recent Core2Duo-system was less convincing.

I did some basic write comparisons between a 3-disk RAID-0 and a 3-disk RAID-Z with some interesting results:

# dd if=/dev/zero of=/kalle/testfs/file.out bs=1024 count=1000000

raidz write 26MBps

raid0 write 36MBps

# dd if=/dev/zero of=/kalle/testfs/file.out bs=2048 count=1000000

raidz write 59MBps

raid0 write 51MBps

# dd if=/dev/zero of=/kalle/testfs/file.out bs=4096 count=1000000

raidz write 82MBps

raid0 write 56MBps

Write performance in RAID-Z doesn't seem to be a problem at all.

Share this post


Link to post
Share on other sites
Ok, now I've spent the better part of the day playing around with ZFS on a 4-disk SunFire V240. It's really an incredibly flexible filesystem. I'm not yet convinced that I will get it running on some decent X86 hardware. It installed properly on my old AMD 3000XP, but my more recent Core2Duo-system was less convincing.

What problems did you have with the C2D system? My new storage server will most likely be a C2D-based.

Share this post


Link to post
Share on other sites
What problems did you have with the C2D system? My new storage server will most likely be a C2D-based.

Nothing definitive yet, but my P5W DH had some issues at install time... didn't detect the onboard network cards and some of the SATA controllers, etc. Nothing recent in the official Sun:s HW compatibility list either, so it's off to the forums for me.

Share this post


Link to post
Share on other sites

What problems did you have with the C2D system? My new storage server will most likely be a C2D-based.

Nothing definitive yet, but my P5W DH had some issues at install time... didn't detect the onboard network cards and some of the SATA controllers, etc. Nothing recent in the official Sun:s HW compatibility list either, so it's off to the forums for me.

You're using Solaris 10 or OpenSolaris?

Share this post


Link to post
Share on other sites
Write performance in RAID-Z doesn't seem to be a problem at all.

One of the SUN blogs has some numbers that you'll probably find informative. Basically RAID-Z trades small random reads for everything else. Small random read performance will not scale with number of drives in a RAID-Z. The trade-off is that it's faster for everything else. In particular, it writes faster than a mirror or RAID-10, while having RAID5-like read performance.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now