alpha754293

problems with dual Opteron system

Recommended Posts

Thanks for correction on that ram choices for multi-CPU setup.

But, for 1 socket applications (run of mill stuff, no SMP boards stuff), these memory are ok for this?

I repair electronics (TVs mostly) as full-time job so I have experience. :)

Cheers, Wizard

Share this post


Link to post
Share on other sites
But, for 1 socket applications (run of mill stuff, no SMP boards stuff), these memory are ok for this?

Sure no problem using those brands in single cpu systems.

The problem is that companies like OCZ and Crucial run non-specc'ed timings on their chips to allow more aggressive timings, (what I mean, is that the timings are run outside the normal range as defined by the CPU/chipset manufacturer as the allowable range). While their sticks may be good for over-clockers, the normally aggresive timings cause funny things to happen with SMP systems.

OCZ also tend to buy from the cheapest supplier at the time, so you may never get what you expect. (Eg 2 sticks of the same model, may in fact actually use different brands of chips on the sticks).

Share this post


Link to post
Share on other sites

theres a lot of back and forth about what can and can't be done.

Get the memtest iso and burn, test you ram.

Swap the powersupply with another unit

Swap the ram and \ or motherboard.

Its one of those three and the sooner you start testing them properly the sooner you start correcting the problem.

Share this post


Link to post
Share on other sites
Also, you may have some joy running the memory at 333 (PC2700) instead of 400 (PC3200), or in single channel mode, or both.

alpha, you didn't come up with a good reason not to try the above so I assume you are doing it! Any luck?

Share this post


Link to post
Share on other sites
Now I remember about comment on memory choices. Yes very picky with multi-CPU boards.

PSU is not rated by the MAX power. It is surge max power that PSU can provide for short time. For steady use, take a typical total watt and add 25% fudge factor and you have it.

For example you figured 300W, this is bit too low really. Ram, chipsets, video, network, etc cards can consume around 100-130W, dual CPUs around 200W at the most because of voltage regulator losses. Hard drives is typically 11-13W writing/seeking. If have another HD or two, That's 39W if have 3 hard drives for example.

Add this all up is nearly 370W, plus 25% fudge factor gives you

465W. A decent 500W PSU is sufficient.

Oh yea, fans are not power misers, many can eat up 3W to 5W per fan. High performance fans even eat up 10W and that's even for a 6800rpm 60mm whiners.

You don't have to buy stuff, borrow or test swap certain parts one at a time to isolate the problem. Quality ram is a MUST especially for multi-CPU system. OCZ, curical, etc are ones should have.

Cheers, Wizard

The 300W that I mentioned includes the 85W for video.

I have reduced the hard drive count to one because the system has problems with that as it is as well; so in order to simplify it, that what I have done.

Which drops that down to the 11-13W that you mentioned. However; I also did you one better because I used the spin-up power consumption instead and while I know that it doesn't consistently consume that much power post-spin up. That too, is also included in the 300W.

The Sparkle that I've been using is a 460W, which puts me 5W under; or a little over 1%.

Quality RAM is always needed for this type of setup... But both OCZ and Crucial are known for NOT playing nice with SMP boxes (at least when I was buying my setup).

You're better off sticking with what the board manufacturer has certified to work, eg Kingston (which is what I run with my K8W), Corsair, Transcend, etc.

Back on-topic. I had very similar issues with my K8W, and it was the PSU. I had all sorts of errors - SCSI drive errors, ECC checksum errors, hard locking while running OpenGL apps, and after replacing the PSU, it was all good again.

Pretty much all of my 2P systems use Crucial. Heck, even my 1P systems use Crucial, with the exception of my laptop, my Sun Ultra 60, and my old server.

And while I can't say for certain; I believe that all of them are either Samsung memory chips, or Hynix. (Haven't checked all of the modules).

In any case, I just checked the tracking on the shipment for the board, and it does have an ETA of 7/07/06, so I will definitely report back once I have the system up and running and stabilized.

alpha, you didn't come up with a good reason not to try the above so I assume you are doing it! Any luck?

Nope, haven't tested it yet. Waiting for the new board to arrive before I start the next sequence/series of tests.

theres a lot of back and forth about what can and can't be done.

Get the memtest iso and burn, test you ram.

Swap the powersupply with another unit

Swap the ram and \ or motherboard.

Its one of those three and the sooner you start testing them properly the sooner you start correcting the problem.

The CD burner resides on the unstable system.

powersupply as already been swapped. The no-name brand unit was the only one that I have available; that isn't in use.

don't have any other RAM modules to swap it with (those are the only two PC3200 ECC Registered modules I have). Motherboard's on it's way.

- * - * - * -

SLIGHTLY off-topic question:

how "bad" would it be do run PC3200 RAM on a PC2100 board?

I remember hearing that the best (acceptable limit, as a rule of thumb) is that you should only drop the speed by one grade, and it should be tolerant of it. Two, and you might cause more harm than good.

(I also don't have any other system that can even TAKE the PC3200 as it is).

Otherwise, I'd run the memtest on that instead.

Share this post


Link to post
Share on other sites

Chewy509,

"OCZ also tend to buy from the cheapest supplier at the time, so you may never get what you expect. (Eg 2 sticks of the same model, may in fact actually use different brands of chips on the sticks)."

OUCH! That's no fun. Except I bought two dual sticks on seperate occasions once for 256MB pair and other one 512MB pair, good so far.

Now, what is the good memory module that doens't pull a joke on me with mismatched chips?

Dropping below rated speed is PERFECTLY fine, I have been known to run DDR400 as 333, even 200, no problems.

Cheers, Wizard

Share this post


Link to post
Share on other sites

Windows, even idle, still running on RAM, or is it not?

The purpose of running OCCT and memtest is to pinpoint the culprit. memtest for RAM and OCCT for processor. That's a standard procedure when you overclock, to pinpoint the problem when system cannot go up again. To pinpoint, what component caused unstability. Granted, usually they use this tests on 250Mhz or 300Mhz or even more. But in this case, it is not stable on 200Mhz. Correct?

Why don't we check one by one, which component is good, and which one is not?

Share this post


Link to post
Share on other sites

for anyone who need to know the max watts of specific processors, google for: erols processor specification and click on first link.

There are two kind of Optrons one is 246 at 89W each and other one is 246HE at 55W appox.

Was thinking and revised the wattage estimate so bear with my thoughts:

Figuring 89W each, for typical 200W for two CPUs (don't forget the regulator losses). Also does have lot of stuff to account for. Say you have more than 2 HDs let's say 3, 12W each (seeking/reading/writing). Video card if it is mid range, say 50W. Memory say 5W or so per stick anything more it won't work since 10W requires heatspreaders.

Add 3W for each fan (typical machines 3 to 4 fans) Cdrom pulls hard on 12V while actively using (reading or burning) so give it 10W. Give another 15 to 20W for assorted cards.

Then you have appox 340W with 25% fudge factor added will be: 425W bit too close to capacity for most 500W PSUs. Should go with quality 550-600W range PSU.

Several years ago Toms hardware did several stress tests on several PSU, many shut down or drooped excessively, and in one case one blew up!

MOST of them shut down or fall out of regulation well below their rated specs. Just to keep in mind.

Keep us updated!

Cheers, Wizard

Share this post


Link to post
Share on other sites

Hold your horses there, you just made the jump from a system that peaks at 340W (assuming you can get everything working flat out at the same time, which is unlikely), to a 600W PSU! As long as there's enough 12V wattage on the relevant lines to support CPU(s) and graphics card(s), you don't need to overspecify quite that much on the PSU. A quality 430W PSU ought to serve just fine in that system (though to get the EPS 12V connectors most dual socket boards use, you'd probably end up using a higher end model anyway).

Get something to read out how many watts the system is drawing from the wall. Load your system as much as you can. You may be surprised at just how little your system draws, especiallly given that typically only 70 - 80% of the wall draw gets converted into DC to feed the components, the rest is lost as heat generated by the PSU.

Moving on, memtest is useful because it's independent of any software or hard drive problems you may be having. If memtest fails, it's likely that your memory is faulty, or your combination of memory, motherboard, voltages and timings isn't working.

Share this post


Link to post
Share on other sites
OUCH! That's no fun. Except I bought two dual sticks on seperate occasions once for 256MB pair and other one 512MB pair, good so far.

Buying them in the dual pack tends to be fine (OCZ ensure that they are matched), you really need to watch it when buying single sticks at a time. Some of the cheaper no-name brands also tend to have the same problem.

Share this post


Link to post
Share on other sites

I'm stepping into this late but I have to say it sounds like a heat problem.

A couple other things that can overheat and cause problems are:

  • North/Southbridge chips. Sometimes they come with a lousy fan that dies and causes instability.
  • Video cards. Same problem: lousy fan.

Other weird problems I've encountered that caused system instability:

  • Memory that was oversold: sometimes it is rated too high so it is unstable at its default settings and you have to slow it down manually to make your system stable.
  • A motherboard short: A loose screw was causing screwy things to happen on one system I built.
  • PowerNow! Enabled: I turned this on with a pair of E4-stepping Opterons and the system became completely unstable (which shouldn't happen).
  • Bad SCSI cable: I had an 80-wire straight cable that caused me a lot of problems. I replaced it with a higher quality cable and all my problems were solved.
  • One bad ECC memory module: You would think the error-checking would help you avoid problems.

Good luck.

Share this post


Link to post
Share on other sites

Can an improperly mounted motherboard cause such problems that I have been experiencing?

i.e. where you think that the board is mounted securely and properly; and that something isn't quite right with it? (off by a little bit)

Can it cause such problems as that I am experiencing?

Share this post


Link to post
Share on other sites

Is there a Linux version for OCCT? (I think that I do remember seeing a Linux version for memtest).

Also, does anybody know of a tool that would be able to read the temperatures off the S2875 in RHEL 4?

Share this post


Link to post
Share on other sites
Can an improperly mounted motherboard cause such problems that I have been experiencing?

i.e. where you think that the board is mounted securely and properly; and that something isn't quite right with it? (off by a little bit)

Can it cause such problems as that I am experiencing?

It can but is rare if as you say, it is mounted securely and properly. Once I experienced an issue where bad quality casing can flutter and cause momentary short circuiting of contacts on the back of the motherboard, causing a system crash.

I suppose you could rule this out by simply pulling the board and all associated devices out of the case and sticking it on your table or so - though that may or may not be practical in your situation.

Share this post


Link to post
Share on other sites

Good points on these.

Mostly common sense stuff.

Tyan, I had trouble with two before (socket 7 days), may have to get non-Tyan dual-CPU board instead.

1. Is the CPU heatsinks mounted PROPERLY?

2. If chipsets have heatsinks or have fans on them, best to change them over to larger chipset heatsink with some artic paste.

3. What kind of case is it is and good air flow design practice?

Cheers, Wizard

Share this post


Link to post
Share on other sites

Well, that's part of the reason why I am asking.

My "replacement" board arrived yesterday.

So, in doing a gut/go-around check, I pulled the board out of the case, set it on top of the anti-stat bag, on top of the box, and the system had no problems booting up.

And then, I had another case nearby that had different mounts for the motherboard, put it in that; and since then, the system has been running ok.*

* - minus my Adaptec 19160 - I think that I might have accidently killed it when I didn't seat it properly in the slot. My fault. Stupid error.

I tried to install the OS on an AcceleRAID 250; but for some strange reason, once the install is finished; it has a problem trying to find the correct drive to boot off of, even if I make the drives go offline.

Then I remember that I have an old Adaptec 2940U2W sitting around somewhere; dug it up, installed it, and I'm now running RHEL 4 off of it on one of the drives.

Share this post


Link to post
Share on other sites

Were those successes with the old motherboard, or the new one? If it was with the old one, that's very interesting. If it was with the new one, it's less conclusive.

Do you plan to leave it as-is now, or do you want to troubleshoot your SCSI card issues?

Share this post


Link to post
Share on other sites
Were those successes with the old motherboard, or the new one? If it was with the old one, that's very interesting. If it was with the new one, it's less conclusive.

Do you plan to leave it as-is now, or do you want to troubleshoot your SCSI card issues?

Old. The new one is still sitting in the foam, in the anti-stat bag.

To be perfectly, honest, I don't know if I am going to be returning the new board now that I seem to have the system working again. I like knowing that the board is there and that I don't have to wait for a replacement board to arrive, just in case.

But at the same time, if I return it; then at least I can get my money back, possibly minus any restocking fees.

And yes, troubleshooting the SCSI card issue would be greatly appreciated.

Share this post


Link to post
Share on other sites

Could be some sort of electrical short in the old case; could also be down to the change in airflow pattern, maybe some critical part of the motherboard was overheating in the old case. Or perhaps there's a tiny fracture in the motherboard, that used to come apart as the system heated up, but with the different pattern of mounting points that's no longer an issue.

Bottom line - if it continues to work, lave the motherboard alone. If you encounter the problems again, the motherboard should be your first port of call.

As for booting off the SCSI card, perhaps there's something between the motherboard and RAID BIOSes that's not quite working. It may help to disable any unused devices on the motherboard, especially controllers, but anything that might have its own BIOS elements that load during startup. There's only so much BIOS stuff a standard PC can load, so if you have too many devices with BIOSes, it can cause problems. Just a possibility.

Share this post


Link to post
Share on other sites
Could be some sort of electrical short in the old case; could also be down to the change in airflow pattern, maybe some critical part of the motherboard was overheating in the old case. Or perhaps there's a tiny fracture in the motherboard, that used to come apart as the system heated up, but with the different pattern of mounting points that's no longer an issue.

Bottom line - if it continues to work, lave the motherboard alone. If you encounter the problems again, the motherboard should be your first port of call.

As for booting off the SCSI card, perhaps there's something between the motherboard and RAID BIOSes that's not quite working. It may help to disable any unused devices on the motherboard, especially controllers, but anything that might have its own BIOS elements that load during startup. There's only so much BIOS stuff a standard PC can load, so if you have too many devices with BIOSes, it can cause problems. Just a possibility.

There is no change in the airflow pattern. I just transferred all contents from one case to another.

System (always) runs "open case" (i.e. mounted, but with no side panels).

I have one 80 mm fan in the front cooling the SCSI drives, and that's it.

I don't have a means to report the CPU temps right now as it is running RHEL 4 (and I don't know of any tool or application that can pick up the temperature readouts off the S2875 that runs in Linux.)

Other than that, the system sits right in front of the air condition vent, and the thermostat is set for 68 F.

The SCSI card was working before. I think I might have killed it when I took the motherboard out of the case and ran it on top of the anti-stat bag because I thought that it was seated properly prior to startup (without checking) and then it couldn't find the card, and that's when I think I realize "uh oh".

Share this post


Link to post
Share on other sites
Well, that's part of the reason why I am asking.

My "replacement" board arrived yesterday.

So, in doing a gut/go-around check, I pulled the board out of the case, set it on top of the anti-stat bag, on top of the box, and the system had no problems booting up.

And then, I had another case nearby that had different mounts for the motherboard, put it in that; and since then, the system has been running ok.*

* - minus my Adaptec 19160 - I think that I might have accidently killed it when I didn't seat it properly in the slot. My fault. Stupid error.

I tried to install the OS on an AcceleRAID 250; but for some strange reason, once the install is finished; it has a problem trying to find the correct drive to boot off of, even if I make the drives go offline.

Then I remember that I have an old Adaptec 2940U2W sitting around somewhere; dug it up, installed it, and I'm now running RHEL 4 off of it on one of the drives.

Well, just keep in mind that Windows MUST have the boot volume created in the first 8 GB of the array. If, for example, you have other OSes/data installed in the first 8 GB, then I seem to remember that will cause an issue with Windows failing to boot. Do you receive any kind of error message to go along with this failure to boot?

Also, were you previously using nylon screws & washers (except for one) to mount the motherboard? If you use more than 1 metal screw / washer, I think that might have cause some grounding issues, but I'm not really sure.

Share this post


Link to post
Share on other sites

Well, that's part of the reason why I am asking.

My "replacement" board arrived yesterday.

So, in doing a gut/go-around check, I pulled the board out of the case, set it on top of the anti-stat bag, on top of the box, and the system had no problems booting up.

And then, I had another case nearby that had different mounts for the motherboard, put it in that; and since then, the system has been running ok.*

* - minus my Adaptec 19160 - I think that I might have accidently killed it when I didn't seat it properly in the slot. My fault. Stupid error.

I tried to install the OS on an AcceleRAID 250; but for some strange reason, once the install is finished; it has a problem trying to find the correct drive to boot off of, even if I make the drives go offline.

Then I remember that I have an old Adaptec 2940U2W sitting around somewhere; dug it up, installed it, and I'm now running RHEL 4 off of it on one of the drives.

Well, just keep in mind that Windows MUST have the boot volume created in the first 8 GB of the array. If, for example, you have other OSes/data installed in the first 8 GB, then I seem to remember that will cause an issue with Windows failing to boot. Do you receive any kind of error message to go along with this failure to boot?

Also, were you previously using nylon screws & washers (except for one) to mount the motherboard? If you use more than 1 metal screw / washer, I think that might have cause some grounding issues, but I'm not really sure.

All mounts were with straight metal screws.

I think that it might have been related to the standoffs actually.

There is no other OS and/or data within the first 8 GB of the drive. It BSODs very quickly, then reboots. I can barely even catch that there was a BSOD.

Edited by alpha754293

Share this post


Link to post
Share on other sites

If I understand correctly, you have had this system working for over 4 months. If you are having trouble now, then something broke. You also have it sitting with an open case next to the A/C vent. Do you have any condensation in the computer?

I agree with a previous poster that it sounds like a heat related issue. If you turn it on and it locks up shortly after. Then you let it sit until the next day and it stays on longer the first try and shorter on each consecutive try.

Now the computer is running well in a new case, with the same power supply and a different SCSI card. I would like to see you put it back into the old case now as see if it works. I suspect the Adaptec 19160 was the culprit all along.

Share this post


Link to post
Share on other sites
If I understand correctly, you have had this system working for over 4 months. If you are having trouble now, then something broke. You also have it sitting with an open case next to the A/C vent. Do you have any condensation in the computer?

I agree with a previous poster that it sounds like a heat related issue. If you turn it on and it locks up shortly after. Then you let it sit until the next day and it stays on longer the first try and shorter on each consecutive try.

Now the computer is running well in a new case, with the same power supply and a different SCSI card. I would like to see you put it back into the old case now as see if it works. I suspect the Adaptec 19160 was the culprit all along.

The Adaptec 19160 was working for a little bit after I had migrated it to the case. And then when I pulled it out to permutate through the slots, that's when I started noticing that it wasn't picking up my drives.

Checked all connections; still nothing.

There is no condensation (as far as I can see/tell) because it's central air, set for 20 C.

Since then, it's been up for 2 days, 8 hours.

I've been able to do some of my CFD runs on it no problems (including ones that were giving me troubles before), and now I am running two instances of F@H.

Which leads me to another question - is running two instances of F@H enough of a stress test for the system?

Prior to the case migration, I couldn't even get the OS installed and if I did, it couldn't stay running long enough to do that.

And another thing, I have also resorted back to the Sparkle 460W power supply, partially because it has the EPS12V connector (freeing a molex) and partly because the 550W that I had used, had a total of 8-molexes, and ATX12V connector. So, I didn't want to put the hard drives, the Wildcat Realizm, and the motherboard molex on the two 4-connector chains; (plus a LG DVD burner), so moving back to the Sparkle was my only option.

Again, as I've said, since then, uptime of 2 days 8 hours running full load continously.

Another reason why I think that it was due to a mounting (short) problem is because the system would not always power on right away (takes a couple of tries) regardless of which power supply I used.

Now, it has no problems.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now