alpha754293

problems with dual Opteron system

Recommended Posts

Okay, first off, system specs (per original build):

dual AMD Opteron 246 (2.0 GHz)

Tyan S2875ANRF

2x Crucial 1GB PC3200 ECC Registered DDR SDRAM

Adaptec 19160 SCSI HBA

2x Hitachi 73 GB 10krpm U160

3DLabs Realizm 200 512 MB

Realtek 8139 10/100 NIC

Sparkle 460W

LG DVD burner

"spare parts"

no-name brand 550W

Sapphire ATi Radeon 9250 64 MB

6x new Hitachi 73 GB 10krpm U160

Here's the issue:

Loss of video signal/system response. Can't reboot (with CTRL+ALT+DEL). Must turn off the system, wait anywhere from 1-5 minutes before power on, and doesn't always power on right away.

Stuff I've tried:

Originally, it started happening intermittently varying usages, CPU load, times.

Recently, it's been happening a lot more frequently (usually 30 seconds-2 minutes after the system boots into Windows XP Pro).

I thought that it was a power supply issue, so I switched it to the 550W.

I thought that it might be because the video card is pulling too much power, so I switched it to the Sapphire.

I thought that it might be an OS/virus issue with the install, so I tried installing Windows onto a fresh drive, and it would loose the signal after about 13% of a full NTFS format.

I thought that it might be an issue with the 10/100 NIC, so I pulled that out, leaving just the video card and SCSI HBA.

I thought that it might have been a RAM issue, so I ran through the permutations of the RAM modules (all the while with ECC disabled.) Tried it with ECC enabled, and 40 ns for all other options, no cigar.

Tried it again the next day and was able to successfully install Windows. Reformatted the drive and was about to install RedHat Enterprise Linux 4 WS (after about 4 tries). Was about to have the system running for about an hour before it did the same thing.

And now, half the time, the system can't even get through the SCSI HBA BIOS (hard drive detection). If it manages to get through that and to boot the Windows CD, (using another new hard drive), it would "crap out" after about anywhere from 1-5% of the format.

If I enable ECC, it stops after POST. And the stop point varies quite a bit.

I haven't been able to test the memory on another system because I don't have anything else that can take PC3200.

Right now, a lot is pointing to the motherboard being the cause of the problem.

Any suggestions? Ideas?

Cuz I'm out. :(

P.S. I've had this system for a little less than a year and it's used primarily for engineering/scientific computation.

Share this post


Link to post
Share on other sites

Well, Sparkle oftenly use TEAPO capacitors in their powersupplies, and generally they are ok for use in a switchmode PSU (never ok in the VRM for a CPU)

This should tell you something about their quality, what I am saying is that I think your capacitors have failed, probably without bulging or leaking electrolyte (that is common for TEAPO caps)

And that "no-name" 550w PSU can for all we know be a Deer based design and thus will never be able to deliver anything close to what it says on the label

Either take the PSU apart to veryify what capacitors are in it, and if they visually look ok, taking care about the warning hazardous voltages, the primary caps hold quite a punch; it can kill you...

Fortunantley it is pretty easy to replace the caps in a powersupply with high quality ones since it is a single sided PCB, if you wanna do so please read up first at the site www.badcaps.net/forum

Share this post


Link to post
Share on other sites

To me it looks like either a RAM problem or something like a chipset overheat. Take a look at the temperatures of the core system components, and also double check to make sure the CPU temps are OK, and the heatsinks are fitted properly - a slight (0.5') tilt can easily render a heatsink useless and cause problems similar to this.

Share this post


Link to post
Share on other sites
To me it looks like either a RAM problem or something like a chipset overheat. Take a look at the temperatures of the core system components, and also double check to make sure the CPU temps are OK, and the heatsinks are fitted properly - a slight (0.5') tilt can easily render a heatsink useless and cause problems similar to this.

CPU temps are normal at ~44 C with a 20 C ambient.

I don't think that I would be someone who should be rebuilding power supplies.

Share this post


Link to post
Share on other sites

Replace power supply (no names is crap, get one like silverstone or such) and replace the NIC with decent brand like 3COM etc.

Cheers, Wizard

Share this post


Link to post
Share on other sites
Replace power supply (no names is crap, get one like silverstone or such) and replace the NIC with decent brand like 3COM etc.

Cheers, Wizard

I've only had the Sparkle for a couple of months though. Prior to that, I was using an Antec True430 with a ATX 20 pin to 24-pin adapter (I don't remember the names of the standards at the moment).

So, I find it "odd" that it would do that.

My theory is that it's either the motherboard or the RAM; especially given the erratic, and unpredictable, and unexpected behavior when ECC is enabled.

Share this post


Link to post
Share on other sites

Have you tried with a single CPU? If the memory controller on either CPU is bad, this may be the cause. That or the mainboard - you've swapped most parts already. And another SCSI controller?

Share this post


Link to post
Share on other sites
Have you tried with a single CPU? If the memory controller on either CPU is bad, this may be the cause. That or the mainboard - you've swapped most parts already. And another SCSI controller?

Haven't tried permutating through CPUs cuz I'm worried that I would eventually mess up one of them.

I placed an order for the board to see if it is a board problem because for the time being, even if I did manage to run through all of that and it's a board problem; I would probably still end up being inconclusive about it.

If with the new board, and same stuff is still happening, then I know that it is not on account of the board, and thus I can try the other stuff.

Other than that, I don't know what else to do/try.

Share this post


Link to post
Share on other sites
I've only had the Sparkle for a couple of months though. Prior to that, I was using an Antec True430 with a ATX 20 pin to 24-pin adapter (I don't remember the names of the standards at the moment).

So, I find it "odd" that it would do that.

Antec has always used crap capacitors from Fuhjyyu; they generally do not last longer than one year before starting to bulge and cause exactly the problems you describe

See this thread on the "quality" of Antec powersupplies and then come back and say you still like them ;)

http://www.badcaps.net/forum/showthread.php?t=1165

Share this post


Link to post
Share on other sites

I've only had the Sparkle for a couple of months though. Prior to that, I was using an Antec True430 with a ATX 20 pin to 24-pin adapter (I don't remember the names of the standards at the moment).

So, I find it "odd" that it would do that.

Antec has always used crap capacitors from Fuhjyyu; they generally do not last longer than one year before starting to bulge and cause exactly the problems you describe

See this thread on the "quality" of Antec powersupplies and then come back and say you still like them ;)

http://www.badcaps.net/forum/showthread.php?t=1165

I have like 3 systems using Antec True430s.

One is 2.5 years old, and two are probably like a year and a half.

They've all ran fine without any problems.

One system has 3 SCSI drives, dual MP2600, and an PATA drive.

The other has 4 SCSI drives, two PATAs, dual MP1800, and a Wildcat3 6110 (which requires an AGP Pro50 slot).

The problems on the dual Opteron is with a Sparkle 460W (switched it over after about 4 months to help simplify some of the wiring).

Edited by alpha754293

Share this post


Link to post
Share on other sites

Antec only resells power supplies made by others.

I prefer to buy PSU directly seperate from case. I like silverstone and one others. Especially one with 120mm fan for quietness.

Cheers, Wizard

Share this post


Link to post
Share on other sites
Antec only resells power supplies made by others.

I prefer to buy PSU directly seperate from case. I like silverstone and one others. Especially one with 120mm fan for quietness.

Cheers, Wizard

The power supplies were purchased separate from the case.

I still find it highly unlikely that it is a power supply problem because I think that the chances of both power supplies going bad at the same time, especially when one of them was not in use for a while.

Would a system be necessarily able to power on (but fail at POST) on account of a power supply? I would think not because if it is a bad cap, I would have expected that the system wouldn't even power on; and/or that it wouldn't even make it past the video card BIOS.

Share this post


Link to post
Share on other sites

This really sounds like either a power supply issue (you just might need more power than your power supply is able to give) or a motherboard issue (as you pointed out).

Since you can't do anything about trying the new motherboard until it arrives, you might also consider the following suggestions:

Try to calculate the power draw on the individual power rails (12V, 5V, and 3.3V) to see how much juice you need with all that hardware and then compare to what the power supply is able to provide?

Hook up an oscilloscope to your power supply to see how "clean" the power is (i.e., how close to a perfect sine wave).

Thinking along those lines, you might also try hooking up a quality UPS to this system if it doesn't already have one to eliminate power spikes/sags from your local power grid being the root cause. You might be sharing your source circuit with other "noisy" equipment (laser printers are notorious for this).

Keep us posted on your progress!

Share this post


Link to post
Share on other sites
Hook up an oscilloscope to your power supply to see how "clean" the power is (i.e., how close to a perfect sine wave).

If you get a sine wave output from the PS, you know there is a major problem! Perhaps Trinary meant power supply as in the AC wall outlet. Output from the PS should be essentially flatline (DC) with minor ripples.

Share this post


Link to post
Share on other sites
This really sounds like either a power supply issue (you just might need more power than your power supply is able to give) or a motherboard issue (as you pointed out).

Since you can't do anything about trying the new motherboard until it arrives, you might also consider the following suggestions:

Try to calculate the power draw on the individual power rails (12V, 5V, and 3.3V) to see how much juice you need with all that hardware and then compare to what the power supply is able to provide?

Hook up an oscilloscope to your power supply to see how "clean" the power is (i.e., how close to a perfect sine wave).

Thinking along those lines, you might also try hooking up a quality UPS to this system if it doesn't already have one to eliminate power spikes/sags from your local power grid being the root cause. You might be sharing your source circuit with other "noisy" equipment (laser printers are notorious for this).

Keep us posted on your progress!

I don't have a scope handy.

And I don't know how much power the system draws. I think that the Realizm 200 is 85W, 2 SCSI drives 73 GB each; (no idea), dual Opterons (max. thermal I think it's 89 W?).

That still doesn't explain how switching it to one SCSI drive; and the Radeon 9250 (which has a passive heatsink on it) results in the same problem.

Haven't tested the power supplies out yet; but I think that it would have no problem supplying enough juice to the six 73 GB drives for a RAID50 array. (Just a guess, not verified/validated yet).

Share this post


Link to post
Share on other sites

I have about the same configuration, minus WildCat. It eats 600 Watt using Tagan PSU. So, I don't think you can cut it up by 550Watt "generic" one.

Anyway, have you tried to use memtest? It is extremely handy in recognizing RAM problem. Just write them into a diskette, and boot from it. It will test your RAM in 4-5 steps.. To test your board and CPU stability, use OCCT or Prime95.

here is memtest homepage - http://www.memtest.org/

Share this post


Link to post
Share on other sites

As the others say... i would also test with a good psu. Crappy/defect PSU's give u strange problems... and this piece of equipment u have need it's juice :)

Share this post


Link to post
Share on other sites

Also, you may have some joy running the memory at 333 (PC2700) instead of 400 (PC3200), or in single channel mode, or both.

I'm naturally suspicious of the memory in this instance! That's not to say the power supply won't become an issue later on. Bitter experience has lead me to spec high quality PSUs on all my recent builds.

Share this post


Link to post
Share on other sites
I still find it highly unlikely that it is a power supply problem because I think that the chances of both power supplies going bad at the same time, especially when one of them was not in use for a while.

The capacitors from the bad companies even goes bad when not in use, I have an old Epox 8KHA+ mobo here on which the caps where fine when it was retired, now a few years later some caps have started bulging... GSC crap...

Would a system be necessarily able to power on (but fail at POST) on account of a power supply? I would think not because if it is a bad cap, I would have expected that the system wouldn't even power on; and/or that it wouldn't even make it past the video card BIOS.

Unless the capacitor is shorted (which is extremley uncommon) it will power on just fine, many systems will allow you to boot into Windows but when you do something intensive like loading the CPU 100% it will crash

Why do you not take a look at that link I posted, also do a search for "Antec" on the same site, their quality really is not all it's cracked up to be (to say it in a kind way)

Also keep in mind that many powersupply makers post the specs their powersupply can deliver at room temp inside the PSU, and in the usual setup the PSU will be drawing hot air from the processors, in many cases air that is already at 40°C... And efficiency of some of the less than stellar powersupplies decrease tremendeously with increased temp...

That is why you see the real nice brands like Zippy for example state 400w@50°C (just an example)

Share this post


Link to post
Share on other sites

There is no way that it can be 600W, even with the Wildcat.

I don't know what the power draw is on the processors, but presuming that it's 90W a piece (for Opteron 246), that's 180W.

According to the datasheet, and if I have done my calculations correctly (mech, not EE/CE/CS); power requirement for startup is around 33W. Given that the system failed even when I tried to install Windows XP on the one drive; total is taken to be 33W. (Hard drive idles at 8 W)

Realizm 200 (from what I've been told) is 85 W.

That totals up to 300W.

I don't understand where the other 300W would come from?

Why do you not take a look at that link I posted, also do a search for "Antec" on the same site, their quality really is not all it's cracked up to be (to say it in a kind way)

The dual Opteron hasn't been using the Antec for quite some time. And the systems that are using the Antec are still running without a hitch.

It's not that I don't believe you, I think that you probably have a very valid point; but I cannot afford to just go and replace all of the power supplies on account of that; especially when they're powering dual MPs (which are less power efficient than the dual Opterons) and has more drives.

why don't try memtest

no floppy drive attached.

I also figure that if a) I can't get the system to stay running Windows (idle), then I don't see much reason/purpose to be running any of the stability checks cuz when the OS dies (Windows and/or RHEL) dies; so does the program.

If Windows is able to stay idle for like an hour; then absolutely, I would agree with you to run those tests.

BIOS

a) no floppy drive attached.

B) I cannot have the system "shutdown" on me partway through flashing the BIOS. BAD, wouldn't even begin to describe it.

If I recall correctly, Tyan does not warrant against faulty or improperly done BIOS flashes.

- * - * - * -

How would I hook up a scope to the power supply to begin with? Would the system have to be connected, pluged in, and on in order to me to get readings from it?

Share this post


Link to post
Share on other sites

Hi,

I have been running dual processor systems for several years now, the most common problem whenever I have to upgrade the system to a new architecture has been finding the appropriate power supply for it.

My suggestion is to make sure that the PSU you are using can deliver the required power for the motherboard and peripherals, modern PSUs can have several rails with multiple lower amperage ratings to make up for the high power demand that newer systems require, check to see if your system is indeed properly catered for, per rail instead of just looking at the overall power consumption.

Forget most, run of the mill PSUs, go for a server grade unit, a good starting point will be a Enermax 851, check that the version you buy has the correct wiring for your board.

Other than PSU issues, check that the system has sufficient cooling as dual boards tend to run rather hot.

Hope this helps.

T.

Share this post


Link to post
Share on other sites

Now I remember about comment on memory choices. Yes very picky with multi-CPU boards.

PSU is not rated by the MAX power. It is surge max power that PSU can provide for short time. For steady use, take a typical total watt and add 25% fudge factor and you have it.

For example you figured 300W, this is bit too low really. Ram, chipsets, video, network, etc cards can consume around 100-130W, dual CPUs around 200W at the most because of voltage regulator losses. Hard drives is typically 11-13W writing/seeking. If have another HD or two, That's 39W if have 3 hard drives for example.

Add this all up is nearly 370W, plus 25% fudge factor gives you

465W. A decent 500W PSU is sufficient.

Oh yea, fans are not power misers, many can eat up 3W to 5W per fan. High performance fans even eat up 10W and that's even for a 6800rpm 60mm whiners.

You don't have to buy stuff, borrow or test swap certain parts one at a time to isolate the problem. Quality ram is a MUST especially for multi-CPU system. OCZ, curical, etc are ones should have.

Cheers, Wizard

Share this post


Link to post
Share on other sites
Quality ram is a MUST especially for multi-CPU system. OCZ, curical, etc are ones should have.

Quality RAM is always needed for this type of setup... But both OCZ and Crucial are known for NOT playing nice with SMP boxes (at least when I was buying my setup).

You're better off sticking with what the board manufacturer has certified to work, eg Kingston (which is what I run with my K8W), Corsair, Transcend, etc.

Back on-topic. I had very similar issues with my K8W, and it was the PSU. I had all sorts of errors - SCSI drive errors, ECC checksum errors, hard locking while running OpenGL apps, and after replacing the PSU, it was all good again.

Share this post


Link to post
Share on other sites
why don't try memtest

no floppy drive attached.

There is an ISO version of Memtest86+ for buring onto a CD, use that...

And I agree that you should not flash your BIOS if the computer is unstable, take a flashlight and light through the holes into the powersupply looking for the capacitors tops to make sure they are 100% flat (i.e. not bulging or leaking electrolyte)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now