Jump to content


Photo

Event ID 129: Reset to device, \Device\RaidPort1, MegaSAS (R


  • You cannot start a new topic
  • Please log in to reply
1 reply to this topic

#1 Donald Fountain

Donald Fountain

    Member

  • Member
  • 1 posts

Posted 27 October 2013 - 05:32 PM

For a couple of weeks now, I've been chasing some PCIe RAID port resets that are sent to my LSI 84016E RAID controller that I cannot seem to solve. For several weeks, the SAN performed great, and then began having regular and erratic Event 129 Reset to Device errors.
 
Anytime that the RAID is under load of any kind (it serves media for my XBMC HTPCs in the house, among other things), drive activity to the RAID volume will lock up for 60 seconds, and then resume, seemingly randomly. It's only current function is to serve media files out to HTPCs around the house via Windows File Sharing, however, prior to this, it was serving as a iSCSI mount for VMWare ESXi nodes. Total free space is around 40% as of now.
 
The event log always shows an Event ID 129 with the message "Reset to device, \Device\RaidPort1, was issued" with the provider as megasas. The RAID card logs show NO ERRORS when this happens. The last troubleshooting I did was to unhook the drives from the RAID card and do a surface scan of each one with Hitachi's tool (around 6 hours per drive) and all came back clean.
 

So at this point, I'm out of ideas. The only thing I haven't replaced is the CPU, RAM (but it comes clean on a RAM check), the drives, or the Power Supply. I'm loathe to continue to replace parts indiscriminately. Google isn't much help either.

 
Note: I did break down and order a new power supply this weekend: a 1000W Gold Certified. It''ll be here Tuesday.
 
Any ideas that I may have missed?
 
Troubleshooting, Specs, and History
=============================
 
This is a custom built SAN with the following specs:
  • CPU: FX-6100 Motherboard: ASUS M5A97 (current) MSI 970A-G43 (prior)
  • RAM: 32GB DDR3-1600
  • RAID Card: LSI 84016E in PCIex16 slot
  • Power Supply: Corsair Professional Series HX 750
  • OS Drive: 128GB Crucial M4 SSD
  • RAID Drives: 16 x 2TB Hitachi 7200RPM (3Gbps/6Gbps mixed w/14 drives in RAID6, 2 drives in RAID1)
  • OS: Win7 Ultimate (current) Server2008R2 (prior)
Drive Models for the Drives:
  • HDS5C302 Deskstar 6Gb/s 32MB = x4
  • HDS72302 Deskstar 6Gb/s 64MB = x4
  • HDS72202 Deskstar 3Gb/s 32MB = x2
  • HUA72202 Ultrastar 3Gb/s  32MB cache = x6
History and Troubleshooting:
  • RAM Tests come back clean
  • Drives unhooked from RAID and connected directly to motherboard and all SMART tests come back clean
  • Cables swapped on RAID card with new cables
  • Motherboard replaced
  • RAID card replaced with identical model
  • RAID card Firmware updated (both cards)
  • Fan attached to heatsink on RAID card for better temperature regulation
  • OS Changed from Server2008R2 to Win7 Ultimate
  • Power supply tested via a tester and multimeter. All rails holding steady voltage, even under load drive load
  • Can replicate error/reset by using CrystalDiskMark3. Lockup/reset SEEMS to happen on the write cycle
  • Cannot replicate error/reset using HDTunePro or IOMeter, even allowing them to run 1 hour+
  • IOMeter does not cause the error even on write cycles (see the CrystalDiskMark3 entry above)
  • Have tried DirectIO and Cached IO on the RAID card
  • Have tried NQC on and off
  • Errors happen to both the RAID1 and RAID6 virtual drives, suggesting it's not limited to a single virtual drive or set of physical drives
  • RAID card consistency check comes back clean
  • RAID card Read Patrol comes back clean
  • Chkdsk on both virtual drives comes back clean
  • sfc /scannow comes back clean (See above: OS replaced)
  • Virus checks come back clean (See above: OS replaced)
  • No errors in RAID card log
  • RAID card log shows no correctable errors, or other errors or alarms
  • MegaCLI shows no errors or SMART errors
Full error text, including the details tab from the Windows Event Viewer:
Reset to device, \Device\RaidPort1, was issued.


- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
  <Provider Name="megasas" /> 
  <EventID Qualifiers="32772">129</EventID> 
  <Level>3</Level> 
  <Task>0</Task> 
  <Keywords>0x80000000000000</Keywords> 
  <TimeCreated SystemTime="2013-10-22T17:32:26.936828400Z" /> 
  <EventRecordID>21077</EventRecordID> 
  <Channel>System</Channel> 
  <Computer>SAN.xxxxxxxx.local</Computer> 
  <Security /> 
  </System>
- <EventData>
  <Data>\Device\RaidPort1</Data> 
  <Binary>0F001800010000000000000081000480040000000000000000000000000000000000000000000000000000000000000001000000810004800000000000000000</Binary> 
 

#2 dietrc70

dietrc70

    Member

  • Member
  • 106 posts

Posted 29 October 2013 - 04:39 AM

Is anything else associated with the reset error in the event log?

 

I have received those errors during TDR events due to Nvidia's glitchy video drivers.  If something unrelated temporarily hangs the system, drive reset errors seem to be one possible result.

 

You've troubleshooted this problem so thoroughly that I'm only able to think of two things:

 

1.)  Windows is having some issue unrelated to the RAID array that is triggering the resets.

 

2.)  You are using an AMD desktop CPU and chipset for what is really a serious server application.  I think RAID cards are tested primarily on Intel Xeon and AMD Opteron platforms.  I switched to an Intel Xeon E3-1245 (C206 chipset) for my last build because I was sick of experiencing weird incompatibilities with unusual workstation/server hardware.  I'd consider switching to an E3 Xeon platform or maybe an Opteron.  I know LSI cards can be finicky, so checking the motherboard/chipset hardware compatibility lists from LSI might be a good idea.  For a 24/7 server, I think ECC RAM is a good thing to have.

 

Having a premium PSU is always a good idea, so I think that was a sensible upgrade.  My gut feeling is that you may be encountering an odd incompatibility between your RAID application and a CPU/chipset that is primarily intended for gamers on a budget, and not for server hardware.


Edited by dietrc70, 29 October 2013 - 04:44 AM.




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users