Jump to content


Photo

3ware 9690: problems building RAID6 (but RAID0 is ok)


  • You cannot start a new topic
  • Please log in to reply
4 replies to this topic

#1 mcenno

mcenno

    Member

  • Member
  • 6 posts

Posted 28 March 2013 - 04:22 AM

Hello,


I've got a strange problem building a RAID6. This is a somewhat long-winded story, but I'll try to be concise.

The hardware is a Supermicro X7SB4/E with a 3ware 9690SA-4I controller (firmware FH9X 4.10.00.027, driver 2.26.02.011). A LSILOGIC SASX36 A.0 backplane with 24 bays is connected to the controller via a single cable. 12 bays are filled with WD 2TB drives in a RAID6, and this RAID has been running for ~2 years with no problems. Now more space was needed and I purchased 12 Hitachi Deskstar 7K3000 3TB drives to fill the remaining 12 bays in a RAID6 config, and that is where the trouble started.

When I put in the drives, all were correctly identified by the controller. A few moments after initialising a RAID6, however, several drives failed, throwing a "degraded" error for the drive, and the initialise aborted. So I deleted the RAID6 and started again, and this time, one or two _other_ drives failed after a few moments. I can swap drives from bay to bay, to no avail. Every time I build a RAID5/6, some drives will fail after a few moments and the initialise is aborted.


So then I tried to build a RAID0, and lo and behold, it worked. The server is running Openfiler, and the RAID0 is recognised as a new block device in the system, and I can export it via iscsi and format and use it.


I contacted LSI support and they instructed me how to run a diagnostic tool and to send them the output. They came back to me with the following issues:

-the old 2TB drives are connected with 3.0 Gbps, the new ones with 1.5Gbps. Why that is I don't understand. I tried to enforce 3.0Gbps without any effect. When I enforce 1.5Gbps, the WD drives are no longer visible to the controller, and since they hold the OS, the machine won't boot. So I've set the connection speed back to Auto again. The new Hitachi drives can do 6Gbps, so why they connect with just 1.5Gbps I don't know.

-the controller logs report "Cable CRC errors". Of course that's where LSI's support rep homed in on. So I opened the case, removed the cable from the controller and backplane (the connectors sat really snugly), cleaned them and re-attached them. This didn't change anything. Also, if there were serious issues with the cable, why would a RAID0 work but a RAID6 wouldn't? I have to note, however, that there are 4 so-called "phys" on the controller (a phy seems to be some sort of transmitter for the signals between controller and backplane). Apparently, each of them controls 6 bays, and I can see that in the cable are four sub-cables. So in principle it is possible that phy 0-1 deal with the old drives and phy 2-3 with the new ones. But that still doesn't explain why I can make a RAID0 with the new drives.


So that's where I'm stuck. Does anyone have ideas what I could try? The obvious thing, of course, is to order a new cable (and I will probably do that), but I have my doubts that it will change anything.


Any input is much appreciated.


Cheers,

Enno

#2 mejv

mejv

    Member

  • Member
  • 259 posts

Posted 03 April 2013 - 02:57 PM

I believe you could be seeing an issue due to the quick read-modify-write RAID 5/6 are using due to the controller generation/update of the RAID parity.
Some drives have issues with that..

The 9690SA does not work with anything 6GB. The controller will fail if direct-attached to 6 GB devices.

If the slots were not occupied for a long time, the connectors might be oxidized and having contact issues (CRC errors ?)

Best of luck!

MEJV

#3 sub.mesa

sub.mesa

    Member

  • Member
  • 25 posts

Posted 03 April 2013 - 05:18 PM

Can you post the SMART data of each harddrive? In particular, bad sectors and cabling errors are interesting.

You have used OCE or Online Capacity Expansion. This procedure is not without risk in particular bad sectors are a high risk during the rebuild. You should always do a rebuild before expanding, to try to minimise the risk of making your array inaccessible due to an aborted expansion attempt.

#4 mcenno

mcenno

    Member

  • Member
  • 6 posts

Posted 09 April 2013 - 06:34 AM

Hello,

(apologies for the delayed reply, I was on vacation).

Can you post the SMART data of each harddrive? In particular, bad sectors and cabling errors are interesting.


I have included the smartctl output for two drives below - one is an older WD 2TB drive of the RAID6 which has been working well for years now, the other one from one of the 3TB drives I have recently added.

You have used OCE or Online Capacity Expansion. This procedure is not without risk in particular bad sectors are a high risk during the rebuild. You should always do a rebuild before expanding, to try to minimise the risk of making your array inaccessible due to an aborted expansion attempt.


I'm not sure I understand what you mean. There was an existing 18TB RAID6 consisting of 12 WD 2TB drives, and I haven't modified that system. I have simply filled 12 empty bays with new 3TB drives to create a new, separate RAID6 with ~28TB capacity, and this doesn't work. From the controller's point of view there are two separate units which don't have anything to do with one another.


Any comments are much appreciated.


Regards,

Enno




Here is the smartctl output of one of the new 3TB Hitachi drives. The number of errors varies from drive to drive, but is of order 50-100:

smartctl -a -d 3ware,28 /dev/twa0
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     Hitachi HUA723030ALA640
Serial Number:    MK0371YVHNBWLA
Firmware Version: MKAOAA10
User Capacity:    3,000,592,982,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Apr  9 13:31:15 2013 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (  28) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       83
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       601
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       9
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   125   125   020    Pre-fail  Offline      -       30
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       354
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       24
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       24
194 Temperature_Celsius     0x0002   166   166   000    Old_age   Always       -       36 (Lifetime Min/Max 22/39)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       21

SMART Error Log Version: 1
ATA Error Count: 21 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 21 occurred at disk power-on lifetime: 42 hours (1 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 61 9f 03 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 80 08 80 03 00 40 00      00:03:12.482  READ FPDMA QUEUED
  60 80 00 00 03 00 40 00      00:03:12.482  READ FPDMA QUEUED
  60 80 00 80 02 00 40 00      00:03:12.481  READ FPDMA QUEUED
  60 80 00 00 02 00 40 00      00:03:12.480  READ FPDMA QUEUED
  60 80 10 80 03 00 40 00      00:03:08.461  READ FPDMA QUEUED

Error 20 occurred at disk power-on lifetime: 42 hours (1 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 71 8f 02 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 80 00 80 02 00 40 00      00:02:16.885  READ FPDMA QUEUED
  60 80 00 00 02 00 40 00      00:02:16.884  READ FPDMA QUEUED
  60 80 18 80 03 00 40 00      00:02:12.859  READ FPDMA QUEUED
  60 80 10 00 03 00 40 00      00:02:12.859  READ FPDMA QUEUED
  60 80 08 80 02 00 40 00      00:02:12.859  READ FPDMA QUEUED

Error 19 occurred at disk power-on lifetime: 42 hours (1 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 21 df 01 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 80 10 80 01 00 40 00      00:01:11.067  READ FPDMA QUEUED
  60 80 08 00 01 00 40 00      00:01:11.067  READ FPDMA QUEUED
  60 80 00 80 00 00 40 00      00:01:11.067  READ FPDMA QUEUED
  60 80 00 00 00 00 40 00      00:01:11.031  READ FPDMA QUEUED
  61 03 00 b9 7b 50 c0 00      00:01:10.257  WRITE FPDMA QUEUED

Error 18 occurred at disk power-on lifetime: 18 hours (0 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 41 bf 29 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 80 00 80 29 00 40 00      18:51:01.996  READ FPDMA QUEUED
  60 80 08 00 29 00 40 00      18:51:01.995  READ FPDMA QUEUED
  60 80 00 80 28 00 40 00      18:51:01.995  READ FPDMA QUEUED
  60 80 00 00 28 00 40 00      18:51:01.994  READ FPDMA QUEUED
  60 80 00 80 29 00 40 00      18:50:57.971  READ FPDMA QUEUED

Error 17 occurred at disk power-on lifetime: 18 hours (0 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 71 8f 20 00 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 80 00 80 20 00 40 00      18:50:21.705  READ FPDMA QUEUED
  60 80 00 00 20 00 40 00      18:50:21.704  READ FPDMA QUEUED
  60 80 18 80 21 00 40 00      18:50:17.691  READ FPDMA QUEUED
  60 80 10 00 21 00 40 00      18:50:17.691  READ FPDMA QUEUED
  60 80 08 80 20 00 40 00      18:50:17.691  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


And here is the smartctl output of one of the 2TB WD drives which have now been working well for ~2 years:

smartctl -a -d 3ware,8 /dev/twa0
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD2003FYYS-02W0B0
Serial Number:    WD-WMAY03035831
Firmware Version: 01.01D01
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Apr  9 13:32:54 2013 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (30360) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       8658
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       37
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11806
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       35
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2
194 Temperature_Celsius     0x0022   116   095   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

#5 mcenno

mcenno

    Member

  • Member
  • 6 posts

Posted 12 April 2013 - 02:59 AM

Hello,


it appears that I've found the culprit: it's the backplane (SAS846EL1). It's only specified for SATA-2, so attaching SATA-3 drives to it is not expected to work flawlessly (which the manufacturer confirms).

So I'll have to work my way around that. Anyway, many thanks for your replies, that is much appreciated.


Cheers,

Enno



0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users