superlgn

3Ware 9650Se Controller Resets Under Load On Linux

Recommended Posts

Hi,

Motherboard is Supermicro: X8DTN

CPU is Intel® Xeon® CPU E5520 @ 2.27GHz

Firmware of card

Firmware Version = FH9X 4.10.00.007

Bios Version = BE9X 4.08.00.002

Boot Loader Version = BL9X 3.08.00.001

What motherboard/CPU?

Also what firmware of the card are you running?

tw_cli /c0 show all

Thanks,

Amar

Share this post


Link to post
Share on other sites

Hi,

Motherboard is Supermicro: X8DTN

CPU is Intel® Xeon® CPU E5520 @ 2.27GHz

Firmware of card

Firmware Version = FH9X 4.10.00.007

Bios Version = BE9X 4.08.00.002

Boot Loader Version = BL9X 3.08.00.001

Thanks,

Amar

Hi,

I recommend two things:

1. Upgrade to the latest beta firmware for your card:

http://www.lsi.com/channel/products/raid_controllers/3ware_9690sa4i/index.html

Name: 9650SE_9690SA_firmware_beta_fw_4.10.00.019.zip

Changelog:

SCR FIRM03219 Pchip reset interrupt not handled

SCR FIRM03220 Put parity segment to free state completely when deallocate unused parity segment

2. Turn off 'cpufreq' / do NOT load this module if you are running it (disable any turbo boost if used)

--

Can you also show /c0 show diag?

Share this post


Link to post
Share on other sites

Can you also show /c0 show diag?

Its attached. 3w_diag.txt

Is it safe to put beta version on production systems?

Hi,

I recommend two things:

1. Upgrade to the latest beta firmware for your card:

http://www.lsi.com/channel/products/raid_controllers/3ware_9690sa4i/index.html

Name: 9650SE_9690SA_firmware_beta_fw_4.10.00.019.zip

Changelog:

SCR FIRM03219 Pchip reset interrupt not handled

SCR FIRM03220 Put parity segment to free state completely when deallocate unused parity segment

2. Turn off 'cpufreq' / do NOT load this module if you are running it (disable any turbo boost if used)

--

Can you also show /c0 show diag?

Share this post


Link to post
Share on other sites

Can you also show /c0 show diag?

Its attached. 3w_diag.txt

Is it safe to put beta version on production systems?

Hi,

I see a lot of these:

ErrorCode: 0x291, ErrorDisp: 0x4

Target type = 7, devId = 0x6

cdb ==> b7 08 00 00 00 00 00 00 00 08

E=0291 T=09:18:44 P=6h: Recovered Error, no retries

Scsi Cdb ==> b7 08 00 00 00 00 00 00 00 08

Scsi Sense Data : 70 00 01 00 00 00 00 0a 00

00 00 00 1C 00 01 01 00 ca

ErrorCode: 0x291, ErrorDisp: 0x4

Target type = 7, devId = 0x7

cdb ==> b7 08 00 00 00 00 00 00 00 08

E=0291 T=09:18:44 P=7h: Recovered Error, no retries

Scsi Cdb ==> b7 08 00 00 00 00 00 00 00 08

Scsi Sense Data : 70 00 01 00 00 00 00 0a 00

00 00 00 1C 00 01 01 00 ca

I don't recall seeing these on my controllers, you may want to open a case with 3ware/LSI to check on this and your problem in general. I ran that (and continue) to run that Beta firmware and I think I was only able to make it occur/reproduce it once and then it never happened again (I also turned off CPUFREQ/etc) do you use cpu frequency scaling? lsmod output?

Edited by jpiszcz

Share this post


Link to post
Share on other sites

For the SCSI commands B7h (Request Defect Data), there is nothing to worry about (Recovered Error, no retries): the controller is checking the defect data on the drives and the format requested by the controller is not supported, so the drives sends a different format of the data and notifies the controller that the format returned is not the one requested. It is up to the controller to implement the proper format decoding (just a few cases) so most controllers should have it implemented.

As for the other error. it is on a Log Sense command where the controller is asking for some information (page 1Fh) and the drives do not support it and fails the command with the SCSI standard way:

PAGE CODE Field

The PAGE CODE field specifies which log page of data is being requested. If the log page code is reserved or not implemented, the command shall be terminated with CHECK CONDITION status, with the sense key set to ILLEGAL

REQUEST, and the additional sense code set to INVALID FIELD IN CDB.

Regards,

MEJV

Share this post


Link to post
Share on other sites

Hello All...

Just wanted to throw my hat in here to say that I've had similar issues with my 3ware 9650SE-8LPML. jpiszcz was kind enough to point me here, and the information has been a big help. Especially knowing that I'm not the only one with these problems.

I've recently applied a bunch of the insights and suggestions from this forum to my problematic system. Hopefully it'll yield some good results. I'll be sure to post them here.

I've recently downgraded my kernel from 2.6.38 to 2.6.31. Since the machine I'm troubleshooting is running SuSE, the actual kernel is this:

Linux 2.6.31.14-0.6-default #1 SMP 2010-12-10 11:18:32 +0100 x86_64 x86_64 x86_64 GNU/Linux

When I had moved to 2.6.38, my resets were occurring almost every 2 or 3 days... only under heavy I/O load. The unit runs BackupPC, a backup program that archives remote workstations/servers via rsync over the network. Whenever the machine got busy enough, it would trigger a reset like this:

Jan 23 19:41:33 consus kernel: [3806479.043461] 3w-9xxx: scsi0: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.
Jan 23 19:42:34 consus kernel: [3806540.057685] 3w-9xxx: scsi0: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence.
Jan 23 19:43:09 consus kernel: [3806575.437989] 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.
Jan 23 19:44:18 consus kernel: [3806644.429212] 3w-9xxx: scsi0: ERROR: (0x06:0x0010): Microcontroller Error: clearing.
Jan 23 19:45:18 consus kernel: [3806704.321014] 3w-9xxx: scsi0: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.
Jan 23 19:45:48 consus kernel: [3806734.268052] 3w-9xxx: scsi0: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Jan 23 19:46:18 consus kernel: [3806764.215117] 3w-9xxx: scsi0: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Jan 23 19:46:48 consus kernel: [3806794.162058] 3w-9xxx: scsi0: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Jan 23 19:47:18 consus kernel: [3806824.109040] 3w-9xxx: scsi0: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Jan 23 19:47:18 consus kernel: [3806824.109046] 3w-9xxx: scsi0: ERROR: (0x06:0x002B): Controller reset failed during scsi host reset.
Jan 23 19:47:55 consus kernel: [3806853.942897] 3w-9xxx: scsi0: ERROR: (0x06:0x0010): Microcontroller Error: clearing.
Jan 23 19:47:55 consus kernel: [3806853.952467] Modules linked in: usb_storage edd af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf xfs exportfs loop dm_mod coretemp snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd sr_mod pcspkr iTCO_wdt soundcore e1000e cdrom serio_raw i2c_i801 iTCO_vendor_support snd_page_alloc sg button ahci sd_mod fan processor 3w_9xxx ata_generic libata scsi_mod thermal thermal_sys
Jan 23 19:48:48 consus kernel: [3806913.835057] 3w-9xxx: scsi0: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.
Jan 23 19:49:18 consus kernel: [3806943.781997] 3w-9xxx: scsi0: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Jan 23 19:49:48 consus kernel: [3806973.772849] 3w-9xxx: scsi0: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence.
Jan 23 19:50:03 consus kernel: [3806988.792999] 3w-9xxx: scsi0: ERROR: (0x06:0x0010): Microcontroller Error: clearing.

I have since removed all of the ACPI_PROCESSOR and CPUFREQ modules (not to mention the various thermal and soundcard modules). The system is now pretty bare-bones for kernel modules:

# lsmod
Module                  Size  Used by
xfs                   674720  1
exportfs                6504  1 xfs
edd                    13168  0
af_packet              28936  0
loop                   22260  0
dm_mod                101096  0
button                  8232  0
sr_mod                 20580  0
i2c_i801               15624  0
sg                     39904  0
serio_raw               7692  0
pcspkr                  3720  0
e1000e                167096  0
cdrom                  47688  1 sr_mod
ide_pci_generic         5484  0
ide_core              148416  1 ide_pci_generic
3w_9xxx                42404  1
ata_generic             6508  0

I have also upgraded the card to the latest beta firmware mentioned in this thread, and activated the standard /sys/block/ changes that are recommended. My server uses the following:

Motherboard: ASUS P7F-E
CPU: Intel(R) Xeon X3460 @ 2.80GHz
RAM: 4GB of KVR1333D3D8R9S 

Relevant output from tw_cli /c6 show diag:

### Time Stamp:        20:51:05 30-Jan-2011

### Host Architecture: x86_64 (64 bit)
### OS Version:        Linux 2.6.31.14-0.6-default
### Model:             9650SE-8LPML
### Serial #:          L326027A0120167
### Controller ID:     6
### CLI Version:       2.00.11.016
### API Version:       2.08.00.017
### Driver Version:    2.26.02.012
### Firmware Version:  FE9X 4.10.00.020
### BIOS Version:      BE9X 4.08.00.003
### Available Memory:  224MB

The RAID unit is in RAID6, with 8 drives... all Seagates. The stripe size is currently 256K, which (according to this thread) is a bad idea. If I continue to experience resets, I'll rebuild the array with a 64K stripe.

Thanks again jpiszcz, for pointing me here. I'll be sure to keep the thread updated with my experiences.

--

Joe Ripley

Share this post


Link to post
Share on other sites

Well, its been a month now since installing the beta firmware, and the server is still running:

11:06:38 up 31 days, 15:00,  1 user,  load average: 0.31, 0.67, 0.67

So far, so good. :)

Share this post


Link to post
Share on other sites

.21 came out a few days ago. I think 2 of the last 4 beta releases have had fixes for controller resets... I'm still running .16 without issue.

Do you have a link to .21?

Share this post


Link to post
Share on other sites

Do you have a link to .21?

Found it (not on LSI's site though):

Following post stolen from:

http://www.tek-tips.com/viewthread.cfm?qid=1640298&page=1

Link to the above mentioned firmware: http://www.lsi.com/DistributionSystem/User/AssetMgr.aspx?asset=56123

Readme.txt for Beta firmware for LSI 3ware 9650SE and 9690SA RAID controllers v1.2

February 2011

Firmware version 4.10.00.021

Includes changes from previous beta version 4.10.00.016, 4.10.00.019, and 4.10.00.020.

prom0006.img is for the 9650SE

prom0008.img is for the 9690SA

See LSI 3ware KB article 10058 http://kb.lsi.com/KnowledgebaseArticle10058.aspx for instructions on how to flash upgrade the controller.

Summary of changes from 4.10.00.016 release (which are also included in this 4.10.00.021 release):

SCR 2196: Unexpected controller soft resets

Fixed an issue with regards to deferral of write and read commands to help eliminate unexpected soft resets.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now