superlgn

3Ware 9650Se Controller Resets Under Load On Linux

Recommended Posts

I've been pretty happy with my 3ware 9650se at home, but every once in a while the Linux kernel ends up resetting the controller, often when it's under significant load, which causes my system to hang until it reinitializes:

Oct 31 09:15:16 jam kernel: [598376.988513] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card.

Oct 31 09:15:16 jam kernel: [598415.260507] 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.

Nov 7 10:33:00 jam kernel: [1211490.000024] sd 0:0:1:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.

Aug 6 07:55:05 jam kernel: [42445.988058] sd 0:0:0:1: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.

Aug 6 07:55:46 jam kernel: [42486.928048] 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.

Aug 20 17:23:04 jam kernel: [11827.976071] sd 0:0:0:1: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card.

Aug 20 17:23:44 jam kernel: [11867.180041] 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.

Aug 20 19:52:40 jam kernel: [20803.964079] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.

Aug 20 19:53:02 jam kernel: [20825.620069] 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.

Aug 23 10:14:44 jam kernel: [222339.988070] sd 0:0:0:1: WARNING: (0x06:0x002C): Command (0x88) timed out, resetting card.

Aug 23 10:14:44 jam kernel: [222366.132048] 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.

Aug 23 10:21:59 jam kernel: [222811.976077] sd 0:0:0:1: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card.

Aug 23 10:22:21 jam kernel: [222833.296054] 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.

I think this is most of 'em, but I may have neglected to record one or two. October & November are from 2009 and August is 2010. It's been happening pretty much since the beginning and usually occurs following a change in the raid and a large amount of disk activity (rebuilding the array in a different configuration and/or recreating filesystems, then restoring a few TB of data). Most of these scsi commands (0x2a, 0x8a, etc) are write commands and there's a read or two mixed in as well. Usually there's both reads and writes happening at the same time. I'm pretty sure the very first time happened while I was unraring a large file (8gb+) and one of the more recent ones happened during a backup to an older disk after a rebuild/restore. The last instance happened while the system was under very little load, some assorted disk activity due to slower downloads and other system services, actually while I was documenting the previous reset. :P Another one (Aug 6?) occurred while I was asleep and the system was doing next to nothing. Mostly it's under high load though.

The only time I didn't see this was when I reconfigured my array from a 3 disk raid 5 to a 4 disk raid 10 (December-July). I know raid 10 is a bit simpler, so I can't exclude the possibility that there's just too much going on for the controller to handle, but with 16 and 24 port versions of the 9650se (plus configurations using expanders) it seems a bit unlikely. A raid10 at home didn't sit particularly well with me since I have limited means and I could make use of the additional storage, so I went to a 4 disk raid 5 in July. Now I'm using an 8 disk raid6.

The instances from October and November 2009 were with a 9650se-4lpml and everything in the last week has been with a 9650se-12ml so I know it's not a faulty controller. Every time this happens I usually go through the same rigmarole... Google for the error and read the same articles I've no doubt read the last time. :P

I no longer remember if the first instance occurred before or after my initial scheduler/queue depth/readahead tuning, so I can't discount that as a possibility. 3ware's queue depth on Linux is 254, which is fairly large. This is roughly double the size of the default number of requests at 128, and from what I've read is not good. 3ware suggests raising nr_requests to 512, but I've found a lower queue_depth makes the system more responsive during load (and I'm not alone in that regard).

After the recent spat of controller resets I disabled smartd in case there was some strange firmware passthrough issue that possibly could have occurred during high load times when smartd was polling, but it happened again so I'm glad to say that's not the issue as I wouldn't be too keen on running without smartd.

Until the other day I was using something a bit closer to 3ware's suggested tuning, deadline for the scheduler, a readahead of 2048 (they suggest 16384), and a queue depth of 16. Now I've gone with defaults except for a queue depth of 31 (cfq for the scheduler, default readahead of 256). I've also pulled my logbufs=8,logbsize=256k XFS optimizations.

If this doesn't work, the next items on my list are storsave=protect (currently using balanced), storsave=perform, perhaps playing with the block scheduler and other related tunables again, then some individual drive settings on the controller like write cache, ncq, and phy link speed. storsave=perform is a bit of a concern since there's no write journal happening with that setting and it renders my $100 BBU completely useless, and as you can see from the error messages it is synchronizing the cache following the reset. Who knows what data could be lost. storsave=protect will reduce performance a bit compared to balance, but really not significant enough that it'd bother me. bonnie++ and dd still shows a solid 300MB/sec writes with protect. I'm more concerned with having solid storage at the moment. But hey, if storsave=perform fixed it I've got a UPS so all I'd have to worry about are lockups and faster writes. :P

I've got a handful of servers at work running 9650se and one with a 9550sxu and none of them have had controller resets. The loads on the machines are different though and I know it sounds weird, but they're not even close to what I'm putting my home machine through. The servers at work are using supported motherboards (Supermicro) and are running CentOS 5 with an older kernel (2.6.18). My motherboard is another possibility, but so far the problems have happened on both my MSI P35 Platinum and now my Gigabyte GA-MA785G-UD3H.

I'm using the latest firmware for the 9650se (4.10.00.007 from the 9.5.3 codeset) and a fairly current driver (v2.26.02.012 included with Linux 2.6.32 on Debian Squeeze). All 9650se servers at work are running the latest firmware as well, and the tuning generally mirrors my home machine.

This would be alot easier if I had some way to reliably reproduce the problem. I could sit here with bonnie++ all day and go without issue, then I'll load up Firefox while I'm benchmarking/stress testing and all of a sudden it hits. At the rate I'm going it'll take me a good 24 months to get through my checklist.

Anyone have suggestions? It's driving me nuts.

Edited by superlgn

Share this post


Link to post
Share on other sites

Hi,

See my post:

LSI just got back to me on a second case, their latest beta firmware:

http://www.lsi.com/DistributionSystem/AssetDocument/readme.txt

SCR 2196: Unexpected controller soft resets

Fixed an issue with regards to deferral of write and read commands to help

eliminate unexpected soft resets.

Well well!

Plan of action:

1. See if the error recurs without 'nobarrier' - only because there are some

threads about problems when this is enabled on 3ware controllers.

2. LSI recommended I disconnect/replace my BBU module, I already did once

before, ~$120, so the next step I will apply their latest Beta firmware.

3. If it STILL does it, I can increase the timeout, HOWEVER, when I see this,

I/O on the system, everything locks up for 60-120-180-360 seconds and then it

comes back after it resets the controller. The weird thing is this controller

is only for a /data volume and not the root filesystem. The root filesystem is

on a separate RAID-1 4 port card (also w/BBU).

If you google that exact error, there is very little information, if I find

something that works, I'll update the list so others have something to point to

when they see this problem.

Justin.

Share this post


Link to post
Share on other sites

Hi,

Hi jpiszcz. I think I've read many of your posts on XFS in the past.

LSI just got back to me on a second case, their latest beta firmware:

http://www.lsi.com/DistributionSystem/AssetDocument/readme.txt

SCR 2196: Unexpected controller soft resets

Fixed an issue with regards to deferral of write and read commands to help

eliminate unexpected soft resets.

Ah, great. I guess I didn't notice the 'Beta' radio button on their site or I never figured anything would be there. It's nice to know they're still working on 9650se firmware. I was afraid they'd stop now that the 9750 is out.

1. See if the error recurs without 'nobarrier' - only because there are some

threads about problems when this is enabled on 3ware controllers.

I do run with nobarrier on my XFS filesystems. One is built on top of the device mapper which can't have barriers enabled and the other is my main filesystem which has most of the activity. My smaller ext3 root uses the defaults (so barrier=0). There's not a whole lot of writes happening there, mainly just some logging to /var. I'll add barriers to my checklist and see how much I can enable it on if/when I get there.

2. LSI recommended I disconnect/replace my BBU module, I already did once

before, ~$120, so the next step I will apply their latest Beta firmware.

Hm. If the battery has something to do with it, I think I'd rather just pull it temporarily or /c0/bbu disable vs buying a new one, at least for now anyway.

3. If it STILL does it, I can increase the timeout, HOWEVER, when I see this,

I/O on the system, everything locks up for 60-120-180-360 seconds and then it

comes back after it resets the controller. The weird thing is this controller

is only for a /data volume and not the root filesystem. The root filesystem is

on a separate RAID-1 4 port card (also w/BBU).

Hm, so you lock up even with the hang happening on the other controller? I wonder if it's because of the mounted filesystem. Occasional NFS hangs at work don't lock up the machine, just the apps needing access to the fs. I guess the timeout value you're talking about is in the 3ware driver (3w-9xxx.h). 60 seconds seems pretty long.. You'd think that if it's not able to complete the command by that time it won't be able to at all.

If you google that exact error, there is very little information, if I find

something that works, I'll update the list so others have something to point to

when they see this problem.

Some of the pages I come across with these errors look like they have other problems going on. Disk, driver, or filesystem errors, but not alot of just the hang.

Have you ever noticed your performance drop after this happens? Some of my last resets happened while I was doing some additional tuning and my performance with dd and bonnie++ dropped from 300MB/sec to about 100. As far as I can tell, all of the kernel tunables are still there and the 3ware controller's settings don't change.

Thanks for the info. I probably wouldn't have come across your post from the 27th until the next time around. I'll run a backup and give the beta firmware a try.

Share this post


Link to post
Share on other sites

Hi,

I believe I just found the root cause of the problem I was having, note our problems are similar but I am not sure they are the same:

Hi,

Current theory:

Mobo: Intel DP55KG

I had this:

#append="3w-9xxx.use_msi=1 reboot=a snd_hda_intel.enable_msi=1"

Now I am no longer using MSI for the 3ware driver, so far, the 'lag' /

controller reset problem has not occurred yet:

append="reboot=a snd_hda_intel.enable_msi=1"

I wish both cards did not use the same IRQ, however, if this solves the problem

then I'll be happy.

From /proc/interrupts:

16: 44899 0 0 0 0 0 12418

0 IO-APIC-fasteoi 3w-9xxx, 3w-9xxx, ehci_hcd:usb1

So far I'm typing this e-mail on the client and I do not see any lag at this

time, it appears this has fixed my problem, I'll give it a few more hours/days

before I consider it 'fixed' but after changing all of the settings in 3dm2

and replacing/removing the BBUs, noticing no change, it appears to be an

interrupt problem when the 3ware cards are used with MSI on my motherboard.

Are there cases in which MSI should/should not be used? Generally I have it on

for all of my devices but perhaps the 3ware cards don't play well with it

on the Intel motherboard?

I am guessing the 'lag' issue causes the controller to reset when there is

too much I/O launched via cron (backups and such)..

We'll see if it happens again, I did briefly try the newest Beta firmware

as LSI recommended made no difference in terms of the lag issue.

I am still running with 'nobarrier' removed and have not gotten any controller

resets yet, but I had the lag issue as mentioned above until I removed the

MSI option from the 3w-9xxx driver.

Before removal of MSI option:

Drive Performance Monitor Configuration for /c1 ...

Performance Monitor: ON

Version: 1

Max commands for averaging: 100

Max latency commands to save: 10

Requested data: Instantaneous Drive Statistics

Queue Xfer Resp

Port Status Unit Depth IOPs Rate(MB/s) Time(ms)

------------------------------------------------------------------------

p0 OK u0 1 123 1.202 730148 p1 OK u0

1 123 1.203 644249 p2 OK u1 1 7

0.000 132 p3 NOT-PRESENT - - - - -

After removal:

Drive Performance Monitor Configuration for /c1 ...

Performance Monitor: ON

Version: 1

Max commands for averaging: 100

Max latency commands to save: 10

Requested data: Instantaneous Drive Statistics

Queue Xfer Resp

Port Status Unit Depth IOPs Rate(MB/s) Time(ms)

------------------------------------------------------------------------

p0 OK u0 1 53 3.124 85900 p1 OK u0

1 53 3.124 85900 p2 OK u1 0 0

0.000 0 p3 NOT-PRESENT - - - - -

..

This is what the problem looks like (sdb=root) as it happened twice (before

removal of the MSI option, I've not been able to reproduce it with the MSI

option disabled).

$ (output form dstat)

-dsk/total----dsk/sda-----dsk/sdb--

read writ: read writ: read writ

0 6884k: 0 0 : 0 6884k

0 4580k: 0 0 : 0 4580k

0 4728k: 0 0 : 0 4728k

0 3936k: 0 0 : 0 3936k

0 5764k: 0 0 : 0 5764k

0 1292k: 0 0 : 0 1292k

0 7760k: 0 0 : 0 7760k

0 5480k: 0 0 : 0 5480k

0 7408k: 0 0 : 0 7408k

0 5040k: 0 0 : 0 5040k

0 6236k: 0 0 : 0 6236k

0 788k: 0 0 : 0 788k

0 0 : 0 0 : 0 0

0 2900k: 0 0 : 0 2900k

0 8160k: 0 0 : 0 8160k

4096B 7916k: 0 0 :4096B 7916k

0 7812k: 0 0 : 0 7812k

0 4804k: 0 0 : 0 4804k

148k 7268k: 0 0 : 148k 7268k

0 888k: 0 0 : 0 888k

0 7688k: 0 0 : 0 7688k

0 7644k: 0 0 : 0 7644k

0 6352k: 0 0 : 0 6352k

-dsk/total----dsk/sda-----dsk/sdb--

read writ: read writ: read writ

0 5324k: 0 0 : 0 5324k

0 1432k: 0 0 : 0 1432k

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

--dsk/total----dsk/sda-----dsk/sdb--

read writ: read writ: read writ

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0 dddddddddd

0 8700k: 0 0 : 0 8700k

0 2076k: 0 0 : 0 2076k

0 5428k: 0 0 : 0 5428k

0 7984k: 0 0 : 0 7984k

0 8236k: 0 0 : 0 8236k

0 2536k: 0 0 : 0 2536k

0 232k: 0 0 : 0 232k

0 268k: 0 0 : 0 268k

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 508k: 0 0 : 0 508k

0 72k: 0 0 : 0 72k

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

4096B 232k: 0 0 :4096B 232k

344k 2380k: 0 0 : 344k 2380k

0 1172k: 0 0 : 0 1172k

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0

0 0 : 0 0 : 0 0 ^C^C^C

$

Share this post


Link to post
Share on other sites

Hi jpiszcz. I think I've read many of your posts on XFS in the past.

..

Hi,

I have just tried everything:

1. disable bbus

2. remove bbus

3. drop temps by 10c to ~25c w/large fan

4. try beta firmware

5. try messing around with various settings in /proc

6. i disabled my ohwraid optimization script (for now) and the problem persisted until I stopped using MSI

What motherboard are you using?

Are you using the 'enable_msi' option?

Other things to look at if you search around is the use of acpi=off,noapic sometimes fix problems with some of the server motherboards.

What are your system details?

Share this post


Link to post
Share on other sites

Also,

Is this your post?

http://lkml.org/lkml/2006/8/28/331

Adam also replied to my problems on linux-scsi, you may want to ping him and linux-scsi if none of the settings (APIC/MSI/etc) do not solve the problem..

I'll re-read over your thread but a quick search does not bring back too many results with the errors you are getting from your card.

Justin.

Share this post


Link to post
Share on other sites

One other thing I found:

http://forums.debian.net/viewtopic.php?f=7&t=36800

More similar to your problem:

sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card.

3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.

What he did:

Since this, I've lowered the queue depth to 16 (echo 16 > /sys/block/$i/device/queue_depth). The lower queue length made the system a bit more responsive during heavy I/O and these problems have not yet recurred.

What is your queue_depth set to?

Share this post


Link to post
Share on other sites

What motherboard are you using?

Gigabyte GA-MA785G-UD3H. Before that I was using the controller on a MSI P35 Platinum. The problems have happened on both.

Are you using the 'enable_msi' option?

No and I've seen the 3w-9xxx.enable_msi in use when other people had similar hangs, so that's one thing I've avoided.

Other things to look at if you search around is the use of acpi=off,noapic sometimes fix problems with some of the server motherboards.

I'll add that to the list.

What are your system details?

My previous system (turned into a gaming system) is an Intel e8400 + MSI P35 Platinum + Corsair HX520W PSU. My current system is an AMD Phenom II X2 550BE + GA-MA785G-UD3H + Corsair HX620W PSU. I'm using Debian Squeeze with 2.6.32-5 (bigmem) and no special kernel tweaks or params. I have two 3ware 9650se controllers, one is a 4lpml (no longer in use) and the other is a 12ml. The 12ml is driving 8 hard drives in a raid6. 4 are WD1002FBYS and the other 4 are WD1001FALS (with TLER enabled). I started with the 4*WD1001FALS drives and later migrated to the 4*WD1002FBYS. I found the 12ml on eBay for $300 and decided to make use of all of the drives. The hangs happened on both and later when combined. The drives connect to two 5 disk an iStarUSA hot swap backplanes, 4 drives in each for the 3ware array and the extra tray for backup drives. My gaming system has a 4 bay version, which was originally in use when I had my 3 disk mdadm raid5. That configuration was slow, but solid. No dropoffs or other apparent problems in either configuration.

Here's my 3ware config:

/c0 Driver Version = 2.26.02.012
/c0 Model = 9650SE-12ML
/c0 Available Memory = 224MB
/c0 Firmware Version = FE9X 4.10.00.016
/c0 Bios Version = BE9X 4.08.00.002
/c0 Boot Loader Version = BL9X 3.05.00.002
/c0 PCB Version = Rev 032
/c0 PCHIP Version = 2.00
/c0 ACHIP Version = 1.90
/c0 Number of Ports = 12
/c0 Number of Drives = 8
/c0 Number of Units = 1
/c0 Total Optimal Units = 1
/c0 Not Optimal Units = 0 
/c0 JBOD Export Policy = off
/c0 Disk Spinup Policy = 1
/c0 Spinup Stagger Time Policy (sec) = 1
/c0 Auto-Carving Policy = off
/c0 Auto-Carving Size = 2048 GB
/c0 Auto-Rebuild Policy = on
/c0 Rebuild Mode = Adaptive
/c0 Rebuild Rate = 3
/c0 Verify Mode = Adaptive
/c0 Verify Rate = 3
/c0 Controller Bus Type = PCIe
/c0 Controller Bus Width = 4 lanes
/c0 Controller Bus Speed = 2.5 Gbps/lane

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    OK             -       -       256K    5587.88   RiW    OFF    


VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u0   931.51 GB SATA  0   -            WDC WD1002FBYS-01A6 
p1    OK             u0   931.51 GB SATA  1   -            WDC WD1002FBYS-02A6 
p2    OK             u0   931.51 GB SATA  2   -            WDC WD1002FBYS-02A6 
p3    OK             u0   931.51 GB SATA  3   -            WDC WD1002FBYS-02A6 
p4    OK             u0   931.51 GB SATA  4   -            WDC WD1001FALS-00J7 
p5    OK             u0   931.51 GB SATA  5   -            WDC WD1001FALS-00J7 
p6    OK             u0   931.51 GB SATA  6   -            WDC WD1001FALS-00J7 
p7    OK             u0   931.51 GB SATA  7   -            WDC WD1001FALS-00J7 

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       255    07-Aug-2010  

/c0/u0 status = OK
/c0/u0 is not rebuilding, its current state is OK
/c0/u0 is not verifying, its current state is OK
/c0/u0 is initialized.
/c0/u0 Write Cache = on
/c0/u0 Read Cache = Intelligent
/c0/u0 volume(s) = 2
/c0/u0 name = system               
/c0/u0 Ignore ECC policy = off       
/c0/u0 Auto Verify Policy = off       
/c0/u0 Storsave Policy = balance     
/c0/u0 Command Queuing Policy = on        
/c0/u0 Rapid RAID Recovery setting = all
/c0/u0 Parity Number = 2         

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       RAID-6    OK             -       -       -     256K    5587.88   
u0-0     DISK      OK             -       -       p0    -       931.312   
u0-1     DISK      OK             -       -       p1    -       931.312   
u0-2     DISK      OK             -       -       p2    -       931.312   
u0-3     DISK      OK             -       -       p3    -       931.312   
u0-4     DISK      OK             -       -       p4    -       931.312   
u0-5     DISK      OK             -       -       p5    -       931.312   
u0-6     DISK      OK             -       -       p6    -       931.312   
u0-7     DISK      OK             -       -       p7    -       931.312   
u0/v0    Volume    -              -       -       -     -       25        
u0/v1    Volume    -              -       -       -     -       5562.88   

No, that's not me.

Adam also replied to my problems on linux-scsi, you may want to ping him and linux-scsi if none of the settings (APIC/MSI/etc) do not solve the problem..

I'm not sure that these are all Linux issues though.. I've read about similar problems on FreeBSD.

http://forums.debian.net/viewtopic.php?f=7&t=36800

More similar to your problem:

sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x28) timed out, resetting card.

3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.

That's me and those are some of the first hangs I mentioned in my post here.

What is your queue_depth set to?

Currently 31 with cfq as the scheduler. I've read about nr_requests vs queue_depth before and everything I've seen (1, 2) says that the queue_depth should be half of nr_requests. 3ware defaults to TW_Q_LENGTH(256)-2, but the default nr_requests is only 128. So nr_requests should be doubled or queue_depth lowered and lowering the queue depth makes the system a bit more responsive for me than a higher nr_requests.

Share this post


Link to post
Share on other sites

Hi,

If I had the same issue I'd do the following.

1. Open a case with LSI/3ware and see what they say:

http://www.lsi.com/support/resources/

=> eService Tech Support

2. There was a bug with 256KiB chunk sizes in RAID-6 in earlier firmwares that caused the card to lockup, I went back to 64KiB chunk size.

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    OK             -       -       64K     12107.1   RiW    ON
u1    SPARE     OK             -       -       -       931.505   -      ON

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u0   931.51 GB SATA  0   -            WDC WD1002FBYS-01A6
p1    OK             u0   931.51 GB SATA  1   -            WDC WD1002FBYS-01A6
p2    OK             u0   931.51 GB SATA  2   -            WDC WD1002FBYS-01A6
p3    OK             u0   931.51 GB SATA  3   -            WDC WD1002FBYS-01A6
p4    OK             u0   931.51 GB SATA  4   -            WDC WD1002FBYS-01A6
p5    OK             u0   931.51 GB SATA  5   -            WDC WD1002FBYS-01A6
p6    OK             u0   931.51 GB SATA  6   -            WDC WD1002FBYS-01A6
p7    OK             u0   931.51 GB SATA  7   -            WDC WD1002FBYS-01A6
p8    OK             u0   931.51 GB SATA  8   -            WDC WD1002FBYS-01A6
p9    OK             u0   931.51 GB SATA  9   -            WDC WD1002FBYS-01A6
p10   OK             u0   931.51 GB SATA  10  -            WDC WD1002FBYS-01A6
p11   OK             u0   931.51 GB SATA  11  -            WDC WD1002FBYS-01A6
p12   OK             u0   931.51 GB SATA  12  -            WDC WD1002FBYS-01A6
p13   OK             u0   931.51 GB SATA  13  -            WDC WD1002FBYS-01A6
p14   OK             u0   931.51 GB SATA  14  -            WDC WD1002FBYS-01A6
p15   OK             u1   931.51 GB SATA  15  -            WDC WD1002FBYS-01A6

3. It would be interesting to know if the MSI option made anything worse/better.

4. Try the newer Beta firmware, I already tested it on a 4port/16port, did not solve my issue so I went back to the 9.5.3 release version, but then I knew it was some other problem.

5. Try to find something to reproduce it, for me it was running rss2email with > 20-40 RSS feeds mailing those results generates a lot of lines in syslog and thousands of emails (filter by procmail), that is what I found as a good way to test my issue.

6. When your problem happens, can you please show us:

tw_cli /c0 show diag?

This /may/ show us what is going wrong.

Thanks.

Share this post


Link to post
Share on other sites

If I had the same issue I'd do the following.

1. Open a case with LSI/3ware and see what they say:

http://www.lsi.com/support/resources/

=> eService Tech Support

Yeah. I originally thought this was just some Linux tuning, but I'm running out of things to do so I think there's more to it than that. Maybe when I have some diag info to go along with the reset.

2. There was a bug with 256KiB chunk sizes in RAID-6 in earlier firmwares that caused the card to lockup, I went back to 64KiB chunk size.

All of my arrays have been configured with 256k (became the default in fw 4 as I recall). raid5, raid10 (?), and now raid6, and like I said in my original post, the only one that didn't show any problems was raid10. The 9.5.2 manual (latest available for the 9.5.3 codeset) says "For RAID 6, only stripe size of 64KB is supported." I don't know if that means 64KB was the only tested stripe size or if this documentation is old and means that 64KB is the only stripe size you can select. My system is definitely more general purpose, so maybe 64KB would be optimal for me. It's something I've never benchmarked.

When you make your xfs filesystems, do you use any special arguments? I recently tested (and had reset problems) with mkfs.xfs /dev/sdXY and mkfs.xfs -u sw=256k,su=6 /dev/sdXY. FYI I didn't see any real differences between the two with bonnie++. I wonder if specifying sw/su would have an adverse effect if I ever grew this array beyond the 8 disks...

3. It would be interesting to know if the MSI option made anything worse/better.

I'm a bit exhausted with all of this at the moment... Maybe I'll test it sometime in the future.

4. Try the newer Beta firmware, I already tested it on a 4port/16port, did not solve my issue so I went back to the 9.5.3 release version, but then I knew it was some other problem.

I've already got it on, but since I've no way to reliably reproduce this, who knows when the next one could hit...

5. Try to find something to reproduce it, for me it was running rss2email with > 20-40 RSS feeds mailing those results generates a lot of lines in syslog and thousands of emails (filter by procmail), that is what I found as a good way to test my issue.

This has been the hard part for me. Like I said in my first post, it's mostly around the time of raid rebuild/restoration.

6. When your problem happens, can you please show us:

tw_cli /c0 show diag?

I'll definitely record the diagnostic info right after it happens.

Share this post


Link to post
Share on other sites

All of my arrays have been configured with 256k (became the default in fw 4 as I recall). raid5, raid10 (?), and now raid6, and like I said in my original post, the only one that didn't show any problems was raid10. The 9.5.2 manual (latest available for the 9.5.3 codeset) says "For RAID 6, only stripe size of 64KB is supported." I don't know if that means 64KB was the only tested stripe size or if this documentation is old and means that 64KB is the only stripe size you can select. My system is definitely more general purpose, so maybe 64KB would be optimal for me. It's something I've never benchmarked.

> You can migrate back to 64KiB without rebuilding the array.

When you make your xfs filesystems, do you use any special arguments? I recently tested (and had reset problems) with mkfs.xfs /dev/sdXY and mkfs.xfs -u sw=256k,su=6 /dev/sdXY. FYI I didn't see any real differences between the two with bonnie++. I wonder if specifying sw/su would have an adverse effect if I ever grew this array beyond the 8 disks...

> I benchmarked special arguments vs. not and there was little to no difference in speed when I used to use XFS (on EXT4 now). With MD raid, mkfs.xfs optimizes automatically, but for HW RAID, I did not notice any improvement whether the optimized parameters were used or not.

This has been the hard part for me. Like I said in my first post, it's mostly around the time of raid rebuild/restoration.

> What about verify, surely you run these once a week or month, yes?

One other thing (so its not missed) is the hotswap bays. I actually bought 3 Enlights (5-in-3) one had a bad port. I've also seen others start to flake out.

So it'd be helpful if you could show:

1. /c0 show diag (does it have the old data still there?)

2. smartctl -d 3ware,0 -a /dev/twa0 # for each drive, ,1,2,3 etc

For all of the drives, do you have any HW_ECC errors?

Share this post


Link to post
Share on other sites

> You can migrate back to 64KiB without rebuilding the array.

Hm, I'll have to look at that.

> I benchmarked special arguments vs. not and there was little to no difference in speed when I used to use XFS (on EXT4 now). With MD raid, mkfs.xfs optimizes automatically, but for HW RAID, I did not notice any improvement whether the optimized parameters were used or not.

In my experiences with mdadm and 3ware, the filesystem has made more of a difference than anything else. I guess it's good to know I'm not the only one seeing little or no difference with the sw,su stuff. It's still a recommendation on the XFS FAQ, so I'll continue to use it I guess.

> What about verify, surely you run these once a week or month, yes?

Actually I ran it monthly for a long while along with daily short and weekly long SMART self-tests, but recently switched to weekly verifications after seeing that recommendation in the 3ware docs (3Ware Verify Vs Long Smart Self-Tests). The auto-verify has been behaving a bit strange since I enabled it. verify=basic is set to 12am on Saturday, but for unknown reasons a verify ran Friday evening as well, shortly after booting up my system after installing a shorter breakout cable. To my knowledge the reset problem has never occurred while any of my arrays have been initializing, rebuilding, or verifying. I spent about an hour and a half with bonnie++ the morning after I built the 8 disk raid6 and in all that time the problem only happened once and it hasn't happened since with infrequent dd/bonnie++/unrar activity.

Speaking of verify, how long does it take you to verify your 16 disk raid6? My 3 and 4 disk raid5 arrays would initialize and verify in 3-3.5 hours so when I saw my 8 disk raid6 initialize in approximately 3.5 hours, I figured the verify would go at the same rate, but it took approximately 12 hours.

One other thing (so its not missed) is the hotswap bays. I actually bought 3 Enlights (5-in-3) one had a bad port. I've also seen others start to flake out.

My backplanes have been alright so far. The only thing is the 5 drives are really close together and they're pretty warm, usually 45-47c under load. I'd prefer to take 'em down a few degrees to help reduce the ambient temperature in my case, but everything is still within tolerance.

1. /c0 show diag (does it have the old data still there?)

Yeah, there's some data in there yet. I figured it was like alarm/events and lost after a reboot, but I see a record of a repaired sector and that happened before my flash and reboot this afternoon. The log is gigantic (65716 bytes), but right at the top it says:

Event Trigger and Log Information:
Triggered Event(s) =
   ctlreset (controller soft reset)
   fwassert (firmware assert)
   driveerr (drive error)
Diagnostic log save mode = cont (continuous/last trigger)
Diagnostic event trigger counter = 1
Trigger event counter for ctlrreset = 0
Trigger event counter for fwassert = 0
Trigger event counter for driveerr = 0

If this is correct, there have been no ctlrreset x days/bytes/events back, so there wouldn't be anything in the diag to show this. It's been about a week...

2. smartctl -d 3ware,0 -a /dev/twa0 # for each drive, ,1,2,3 etc

One of my WD1001FALS drive has been acting up after I built my raid6, but it was solid as were the 4*WD1002FBYS drives when they were each in their own respective raid5's.

For all of the drives, do you have any HW_ECC errors?

Do you mean UDMA_CRC_Error_Count? They're all:

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

Everything else looks good as well, except for the Raw_Read_Error_Rate (58 and climbing, was 49 a week ago) and the Multi_Zone_Error_Rate (8) on that one WD1001FALS drive.

Share this post


Link to post
Share on other sites

Speaking of verify, how long does it take you to verify your 16 disk raid6?

> About 8-9 hours, depending on load.

--

> Also the migration from 64KiB to 256KiB, and (then the reverse due to the stability issue in a particular firmware) each took about 7 days (on a 15 disk raid-6+1 spare).

Share this post


Link to post
Share on other sites

Ouch. I'm only using a fraction of my 6TB so I'd probably be better off doing a full backup (with no excludes) and then rebuilding and restoring. I could probably do that in about 8 or 9 hours. I've done it so many times now that I could probably almost script it... :P:( This would be alot easier if I didn't have my root on the array...

Also (in regard to what I was saying earlier about stripe size), the the addendum for 9.5.2 says support for 256KB stripe size on raid6 was added and that it was made the default. Do you remember what firmware you tried 256KB with?

Share this post


Link to post
Share on other sites

Ouch. I'm only using a fraction of my 6TB so I'd probably be better off doing a full backup (with no excludes) and then rebuilding and restoring. I could probably do that in about 8 or 9 hours. I've done it so many times now that I could probably almost script it... :P:( This would be alot easier if I didn't have my root on the array...

Also (in regard to what I was saying earlier about stripe size), the the addendum for 9.5.2 says support for 256KB stripe size on raid6 was added and that it was made the default. Do you remember what firmware you tried 256KB with?

Yes, that problem should be fixed now, but note I did not go back to 256KiB once I ran into it, it was a pretty nasty problem I had, controller resets, etc.

2009-WARNING - Verify fixed data/parity mismatch: unit=0

2009-About 10 of these e-mails, is there a serious bug in this firmware/configuration? Should I go back to 256kb->64kb chunk size and then back to the older firmware?

2009- Thank you for the call. Please revert back to controller firmware 4.06.00.004. We saw some issues in testing the latest firmware of 4.08.00.006 using these drives. Once you revert I suspect that the controller resets will go away.

2009- The older firmware fixed the problem and I have not had a crash yet, but I was wondering if I have to stay on the old code set (9.5.1.1 I believe) or has this issue been fixed in 9.5.3? Is 9.5.3 safe or should I stay on 9.5.1.1?

Share this post


Link to post
Share on other sites

2009-WARNING - Verify fixed data/parity mismatch: unit=0

2009-About 10 of these e-mails, is there a serious bug in this firmware/configuration? Should I go back to 256kb->64kb chunk size and then back to the older firmware?

2009- Thank you for the call. Please revert back to controller firmware 4.06.00.004. We saw some issues in testing the latest firmware of 4.08.00.006 using these drives. Once you revert I suspect that the controller resets will go away.

Recent drive compatibility lists say that firmware 4.06.00.006 is required for WD1002FBYS drives, so I guess they just introduced a bug in the newer 4.08 firmware. By the time I got my WD1002FBYS drives I was flashed to 4.10.00.007. It's a bit troubling to see that so recently... I'll add 64KB stripe sizes as the next item in my checklist, but I'm going to hold off for now because I don't want to do too many things at the same time or I won't know what fixed it (if anything fixes it). Right now I've reverted to the defaults for my XFS mount options, closer to the defaults for the Linux block scheduler (cfq+queue_depth=31), and flashed the 4.10.00.016 firmware. The firmware / stripe size seems more likely to me than the other two.

Share this post


Link to post
Share on other sites

Recent drive compatibility lists say that firmware 4.06.00.006 is required for WD1002FBYS drives, so I guess they just introduced a bug in the newer 4.08 firmware. By the time I got my WD1002FBYS drives I was flashed to 4.10.00.007. It's a bit troubling to see that so recently... I'll add 64KB stripe sizes as the next item in my checklist, but I'm going to hold off for now because I don't want to do too many things at the same time or I won't know what fixed it (if anything fixes it). Right now I've reverted to the defaults for my XFS mount options, closer to the defaults for the Linux block scheduler (cfq+queue_depth=31), and flashed the 4.10.00.016 firmware. The firmware / stripe size seems more likely to me than the other two.

Well, look at this!

[152904.636062] 3w-9xxx: scsi1: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.

[152917.721865] 3w-9xxx: scsi1: AEN: INFO (0x04:0x0001): Controller reset occurred:resets=2.

So the problem is my optimizations..

I do the same thing you do:

# Set to 8 so large read/writes does not freeze I/O to the system.

echo 8 > /sys/block/$i/device/queue_depth # 254 is default

I am going to put all my settings back to normal and see what happens..

Share this post


Link to post
Share on other sites

Hi,

What kernel are you running?

I just found something quite interesting:

Problem persists during heavy I/O, different error this time:

[152904.636062] 3w-9xxx: scsi1: ERROR: (0x06:0x0036): Response queue (large)

empty failed during reset sequence.

[152917.721865] 3w-9xxx: scsi1: AEN: INFO (0x04:0x0001): Controller reset

occurred:resets=2.

--

syslog:Aug 6 05:58:19 p34 kernel: [ 1.763002] 3w-9xxx: scsi1: AEN: INFO

(0x04:0x0001): Controller reset occurred:resets=1.

syslog:Aug 20 07:30:41 p34 kernel: [1171646.371290] 3w-9xxx: scsi0: WARNING:

(0x06:0x0037): Character ioctl (0x108) timed out, resetting card.

syslog:Aug 24 09:04:11 p34 kernel: [246703.526388] 3w-9xxx: scsi0: WARNING:

(0x06:0x0037): Character ioctl (0x108) timed out, resetting card.

syslog:Aug 24 09:14:04 p34 kernel: [247295.662461] 3w-9xxx: scsi0: WARNING:

(0x06:0x0037): Character ioctl (0x108) timed out, resetting card.

syslog:Aug 26 21:05:22 p34 kernel: [462727.945313] 3w-9xxx: scsi0: WARNING:

(0x06:0x0037): Character ioctl (0x108) timed out, resetting card.

syslog:Aug 27 00:39:43 p34 kernel: [475586.575030] 3w-9xxx: scsi0: WARNING:

(0x06:0x0037): Character ioctl (0x108) timed out, resetting card.

syslog:Aug 31 05:53:03 p34 kernel: [152904.636062] 3w-9xxx: scsi1: ERROR:

(0x06:0x0036): Response queue (large) empty failed during reset sequence.

syslog:Aug 31 05:53:03 p34 kernel: [152917.721865] 3w-9xxx: scsi1: AEN: INFO

(0x04:0x0001): Controller reset occurred:resets=2.

--

It started happening on August 6th, this is around the same time I upgraded

to the 2.6.35 kernel, the 2.6.35 kernel according to kernelnewbies was

released on August 1, 2010:

http://kernelnewbies.org/LinuxChanges

Linux 2.6.35 has been released on 1 Aug, 2010.

http://marc.info/?l=linux-kernel&m=127419418304483&w=2

Adam Radford (2):

3ware maintainers update

3w-xxxx, 3w-9xxx: force 60 second timeout

I will go back to 2.6.34 and see if the problems persists.

Justin.

Share this post


Link to post
Share on other sites

Per my earlier message, both cards were resetting, scsi0 and scsi1 under

kernel 2.6.35(.x), with 2.6.34, the lag problem is gone and my machine is

back to normal again. It is unlikely that both cards would be at fault,

in addition I spent several hours trying different things, removing the BBU

modules for example, but always staying on the 2.6.35 kernel.

I am now able to run rss2email (which ALWAYS caused the machine to lockup

and freeze until one of the controllers reset) without any problems using

2.6.34.

2.6.34.1 = good

2.6.35.x = has 3ware bug

There appears to be a bug in the commit for the 3w-9xxx updates.

On 2.6.34 now and no problems so far, I am continuing to test.

http://marc.info/?l=linux-kernel&m=127419418304483&w=2

Adam Radford (2):

3ware maintainers update

3w-xxxx, 3w-9xxx: force 60 second timeout

Justin.

Share this post


Link to post
Share on other sites

I use whatever Debian Squeeze gives me, currently 2.6.35. This isn't a recent problem for me though... I've had my occasional read/write resets going all the way back to October 2009, when I was probably using Unstable Debian with 2.6.31 or .32. I think I've had the 2.6.35 kernel for a month or two and I don't have any resets logged in that time except for the last 4 which all happened after I rebuilt my system and was restoring data / benchmarking.

I've never seen the "Response queue (large)" or "Controller reset occurred" messages before.

Did all of your resets begin with 2.6.35 and then the only other problem you had was with the 256kb stripe size on a raid6?

--edit--

Oops, Debian Squeeze is 2.6.32-5. When I switched from Unstable to Squeeze, they were in sync on most packages, so I probably haven't run anything newer than 2.6.32 on my home machine ever. I thought for a minute that might explain why I've had so many problems this month, but nope. :P

Edited by superlgn

Share this post


Link to post
Share on other sites

Did all of your resets begin with 2.6.35 and then the only other problem you had was with the 256kb stripe size on a raid6?

The most recent ones (Aug) - yes.

The spurious ones before that I am not sure.

When I had the 256kb stripe issue, the controller actually crashed once or twice until I went back to 64KiB chunk size.

Share this post


Link to post
Share on other sites

When I had the 256kb stripe issue, the controller actually crashed once or twice until I went back to 64KiB chunk size.

So moving from the 256kb stripe to a 64KiB strip on the controller solved some of the controller reset problems you were having?

I find myself with the same setup. 8x WD2002FYPS with 9650Se-16 port on Ubuntu x64 8.04 (2.6.24-24) and xfs. xfs_info gives me;

root@srv-bak1:/etc# xfs_info /dev/sdb1
meta-data=/dev/sdb1              isize=256    agcount=32, agsize=106483247 blks
        =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=3407463904, imaxpct=25
        =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=1
        =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

Do you think rebuilding the array with 64k strips would help?

Share this post


Link to post
Share on other sites

So moving from the 256kb stripe to a 64KiB strip on the controller solved some of the controller reset problems you were having?

I find myself with the same setup. 8x WD2002FYPS with 9650Se-16 port on Ubuntu x64 8.04 (2.6.24-24) and xfs. xfs_info gives me;

root@srv-bak1:/etc# xfs_info /dev/sdb1
meta-data=/dev/sdb1              isize=256    agcount=32, agsize=106483247 blks
        =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=3407463904, imaxpct=25
        =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=1
        =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

Do you think rebuilding the array with 64k strips would help?

Hi,

I cannot tell for sure because I never used 256K again after my experience. But you do not have to rebuild the array, you can migrate it, it just takes a very long time.

Justin.

Share this post


Link to post
Share on other sites

So moving from the 256kb stripe to a 64KiB strip on the controller solved some of the controller reset problems you were having?

I find myself with the same setup. 8x WD2002FYPS with 9650Se-16 port on Ubuntu x64 8.04 (2.6.24-24) and xfs.

Are you getting the "Character ioctl (0x108)" timeouts or the scsi commands like "Command (0x28)" or 0x2a? Are you able to reproduce yours more easily or are they pretty random? What firmware are you using (tw_cli /c0 show firmware) and have you tried the beta firmware (tw_cli /c0 update fw=prom0006.img, then reboot)? I put the beta firmware on my 9650se-12ml the other day and I ran 3 simultaneous instances of bonnie++ for about an hour. No problems, but then I could never reliably reproduce the resets anyway...

xfs_info gives me; ...

sunit and swidth are both 0, so it looks like you made the filesystem without any special tuning for a specific block size or number of data drives. I recently did some testing with and without mkfs.xfs -d su=[stripe_size],sw=[num_data_drives] and I didn't see any differences in performance. I don't think xfs can migrate these values anyway, so it would all just be a 3ware thing (initially). I guess I probably wouldn't bother recreating the filesystem with su,sw unless you were putting down a new filesystem after rebuilding the array.

Do you think rebuilding the array with 64k strips would help?

jpiszcz said it took about a week to migrate his 15 disk raid6 to 256kb and back. I imagine there's some relation to the length of time it takes and how large the array is. He's only got 1TB drives, so it's possible your array could take upwards of 2 weeks to migrate. I'm not sure what would happen if you lost power during a migration, but it's probably a good idea to make sure it's on a decent UPS and then hope for no long power outages. :P

Share this post


Link to post
Share on other sites

I'm not sure what would happen if you lost power during a migration, but it's probably a good idea to make sure it's on a decent UPS and then hope for no long power outages. :P

A power outage is questionable, however, during the migration I forget 64->256 or 256->64, the raid card crashed and I rebooted and it picked up where it left off (if I recall correctly, I did not lose any data), so that was good.

Justin.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now