alpha754293

What's the fastest processor available?

Recommended Posts

alpha - since you said that you don't know much about how the code is written, my suggestion is that you contact that person and ask them to post here. "FLOP" metrics are incredibly generic. In the real world, CPUs execute floating point instructions differently. You can look up clock cycle execution times for different types of operations for the various CPUs being discussed here. All any "FLOP" benchmark will do is try out a sampling of different FP operations, assign an arbitrary weight to each of them, then dump a weighted average to your screen. If you're looking to optimize the platform to a piece of code, you either need to tell us what the code is doing (at a CPU instruction level, try profiling) or you need to do a hands-on evaluation. Period. Asking what's the "fastest processor available" for your task on the basis of a benchmarked FLOPS metric is like me asking what's the fastest PC for Photoshop USM based SR's "Office DriveMark."

Share this post


Link to post
Share on other sites
Alpha, could you please post the run times for the various cpu's/systems that you've currently tested this program with. It could make an interesting benchmark.

Unfortunately, my dual Opteron rig recently has been having major system stability problems, and I haven't been able to spend much time on trying to find out what's causing it.

So I won't be able to run the test on that.

I am currently running it on my laptop and one it is done, I will post the time.

alpha - since you said that you don't know much about how the code is written, my suggestion is that you contact that person and ask them to post here. "FLOP" metrics are incredibly generic. In the real world, CPUs execute floating point instructions differently. You can look up clock cycle execution times for different types of operations for the various CPUs being discussed here. All any "FLOP" benchmark will do is try out a sampling of different FP operations, assign an arbitrary weight to each of them, then dump a weighted average to your screen. If you're looking to optimize the platform to a piece of code, you either need to tell us what the code is doing (at a CPU instruction level, try profiling) or you need to do a hands-on evaluation. Period. Asking what's the "fastest processor available" for your task on the basis of a benchmarked FLOPS metric is like me asking what's the fastest PC for Photoshop USM based SR's "Office DriveMark."

I cannot optimize a platform to ONE code. Therefore, the FLOP benchmark gives me a pretty good indication of how it will perform on various codes.

That would be analogous to saying "use this system ONLY if you intend to do this task, in this order, in Photoshop."

Even then, it's just one part of the code, (out of three).

Share this post


Link to post
Share on other sites

Reviewing the thread, there are so many unknowns in this project that I couldn't even begin to make a suggestion. I couldn't make sense of the software requirements, hardware requirements, or why this program can't be multi-threaded/multi-instance. From what I can gather, if it were my project, I'd be looking into setting up a cluster.

If you want to do yourself a favor, write a test harness to simulate your load and memory requirements. Other than that, your options are the fastest clocked A64/P4/Conroe (?) processors available, although as to which one would operate your specific program best, I can't say. The best solution may not be a general purpose processor, either.

The joys of FORTRAN programming. Some of them run so fast that on my dual Opteron, it only uses 25% CPU to run.

(i.e. more memory ops than FLOPS)

...if this is true, then this entire conversation amounts to lots of Internet bandwidth wasted. You have some other long pole that doesn't involve the CPU, or your program is flawed.

Edited by MisterDuck

Share this post


Link to post
Share on other sites
The joys of FORTRAN programming. Some of them run so fast that on my dual Opteron, it only uses 25% CPU to run.

(i.e. more memory ops than FLOPS)

...if this is true, then this entire conversation amounts to lots of Internet bandwidth wasted. You have some other long pole that doesn't involve the CPU, or your program is flawed.

A part of the program runs so fast that it can't even utilize all of the CPU time/power available.

But that's just the old one (as an example of what one of the code/modules is like by itself).

There is no flaw with the program itself.

It burns through 11 million iterations in 20 minutes.

Why would you say that the program is flaw just because it's fast?

Share this post


Link to post
Share on other sites
There is no flaw with the program itself.

It burns through 11 million iterations in 20 minutes.

Why would you say that the program is flaw just because it's fast?

I think you need to run the program through a profiler and see where it is spending its time. If you're only using 25% CPU at full chat on a dual Opteron system, there's scope to make it twice as quick before you even consider making it multithreaded.

Trust me on this one, profilers rock!

Share this post


Link to post
Share on other sites

There is no flaw with the program itself.

It burns through 11 million iterations in 20 minutes.

Why would you say that the program is flaw just because it's fast?

I think you need to run the program through a profiler and see where it is spending its time. If you're only using 25% CPU at full chat on a dual Opteron system, there's scope to make it twice as quick before you even consider making it multithreaded.

Trust me on this one, profilers rock!

Remember that that's just one example.

The current program doesn't do that; plus I have no way of predicting what the chemical reaction modules are going to do.

Hence why I don't want to cater the processor choice based on ONE profile of a program, especially when there is a large uncertainty about the behavior of future modules.

See what I mean?

Share this post


Link to post
Share on other sites
A part of the program runs so fast that it can't even utilize all of the CPU time/power available.

That's one way of looking at it--another much less delusional way of looking at it is some part of the program runs so slow that it can't utilize all of the CPU time.

You're asking people for a faster CPU, but according to what you've just told me it's quite clear that your application isn't limited by CPU. You have some other long pole in your system, be it memory bandwidth or disk IO, and it has nothing to do with your CPU.

There is no flaw with the program itself.

It burns through 11 million iterations in 20 minutes.

Why would you say that the program is flaw just because it's fast?

I beg to differ. And 11 million "iterations" of what? The statement means almost nothing, since processors can easily do 11 million iterations of things in well under a second.

From the sound of it, there's a lot that's wrong with it. This is precisely the reason software engineering shouldn't be left in the hands of people who have no appreciation for proper engineering practice. My rather inflammatory $.02: get a real software engineer, or consider buying soldiers body armor with the money you're about to waste.

Share this post


Link to post
Share on other sites

Even if there is a problem with the software a faster processor will help the situation and a new system may also address the other shortcomings. A total understanding of all the processes involved would be ideal but I doubt that is possible on a message board.

In a few weeks there should be benchmarks for the new processors and I doubt that a new system has to be purchased before we get a good look at the results.

Ok, lets get back to the subject of fast processors.

It's been fun so far.

Share this post


Link to post
Share on other sites

A part of the program runs so fast that it can't even utilize all of the CPU time/power available.

That's one way of looking at it--another much less delusional way of looking at it is some part of the program runs so slow that it can't utilize all of the CPU time.

You're asking people for a faster CPU, but according to what you've just told me it's quite clear that your application isn't limited by CPU. You have some other long pole in your system, be it memory bandwidth or disk IO, and it has nothing to do with your CPU.

There is no flaw with the program itself.

It burns through 11 million iterations in 20 minutes.

Why would you say that the program is flaw just because it's fast?

I beg to differ. And 11 million "iterations" of what? The statement means almost nothing, since processors can easily do 11 million iterations of things in well under a second.

From the sound of it, there's a lot that's wrong with it. This is precisely the reason software engineering shouldn't be left in the hands of people who have no appreciation for proper engineering practice. My rather inflammatory $.02: get a real software engineer, or consider buying soldiers body armor with the money you're about to waste.

I am going to repeat myself once more - that example is ONE aspect of the program.

That isn't the issue. The issue is to speed up a 97 minute 3D porous substrate run that is eventually going to include evaporation (primary and secondary) for microdroplets, adsorption, and chemical reaction between the drop and the substrate itself.

I cannot have the chip be catered to the simpliest of the 3 or 4 modules/aspects of it because I don't know how future modules are going to be behave or what they would do/use.

Some have said that this is a loaded question. I personally can't think of making it any simplier than I already have, other than me not knowing what the FLOP counts are like for the latest Intel offerings with regards to it's FP performance using Sandra 2005 or earlier.

Share this post


Link to post
Share on other sites

While I've never programmed in FORTRAN before, but I've seen what it can do and how fast and efficient it can be for numerical computation.

I think what happens is that a lot of people nowadays is so used to programming in a higher level language such as C/C++/Java that they forget what it's like or have no experience with programming in either Pascal or FORTRAN and realize how fast those programs run compared to their higher level counterparts.

Why do you think programs such as Fluent, STAR-CD, and Ansys are written in FORTRAN, and I know for a fact that at least with Ansys that their stuff is compiled using the PGI compiler?

Share this post


Link to post
Share on other sites
I think what happens is that a lot of people nowadays is so used to programming in a higher level language such as C/C++/Java that they forget what it's like or have no experience with programming in either Pascal or FORTRAN and realize how fast those programs run compared to their higher level counterparts.

The choice of programming language, once the code is optimised, isn't going to make a vast difference on a single thread. As it happens I've programmed in FORTRAN and the others you mention. I could take issue with your assertion that FORTAN is a lower level language than C but really, it's not important.

As far as I can make out the problem you wish to solve is,

  • I have a maths-centric program that I want to run quicker
  • This program has multiple elements
  • Further elements are to be added of an as yet undetermined form

Your solution is to get the fastest processor available, hence the title of this topic. That is indeed one solution.

However, how much faster on a single thread can you get from where you are, 25%, 50% faster? And a year from now, how much faster will processors be on a single thread? You will be getting incremental changes when what you want is a step change in performance.

The above is also assuming that the code is CPU bound, not IO bound. It would be pretty gutting to spend $1000 on the best CPU available, only to find that it makes 0% difference and you could have tripled the speed by getting a $500 hard drive.

My suggestion is to profile existing code and see where the bottlenecks are. Then work towards removing those bottlenecks with a combination of tuned code, multi-threading and upgraded hardware.

I reckon we'd all be keen to hear how you solve this one!

Share this post


Link to post
Share on other sites

I think what happens is that a lot of people nowadays is so used to programming in a higher level language such as C/C++/Java that they forget what it's like or have no experience with programming in either Pascal or FORTRAN and realize how fast those programs run compared to their higher level counterparts.

The choice of programming language, once the code is optimised, isn't going to make a vast difference on a single thread. As it happens I've programmed in FORTRAN and the others you mention. I could take issue with your assertion that FORTAN is a lower level language than C but really, it's not important.

As far as I can make out the problem you wish to solve is,

  • I have a maths-centric program that I want to run quicker
  • This program has multiple elements
  • Further elements are to be added of an as yet undetermined form

Your solution is to get the fastest processor available, hence the title of this topic. That is indeed one solution.

However, how much faster on a single thread can you get from where you are, 25%, 50% faster? And a year from now, how much faster will processors be on a single thread? You will be getting incremental changes when what you want is a step change in performance.

The above is also assuming that the code is CPU bound, not IO bound. It would be pretty gutting to spend $1000 on the best CPU available, only to find that it makes 0% difference and you could have tripled the speed by getting a $500 hard drive.

My suggestion is to profile existing code and see where the bottlenecks are. Then work towards removing those bottlenecks with a combination of tuned code, multi-threading and upgraded hardware.

I reckon we'd all be keen to hear how you solve this one!

Thank you for your reply.

Right now, I think that for the short program that if it is I/O restricted, I would have to (at minimum) quadruple the system bandwidth. I don't think that the standards for those are even available, at least not in the way of commodity hardware.

For the long program, I think that it averages about 1.2 iterations per second.

As far as hard disk I/O is concerned: on my laptop, it's 7.2krpm PATA Hitachi 100 GB. I haven't been able to run that code on my dual Opteron system because I've been having major issues with it.

I don't think that my hard drive is exactly "freaking" out when I am running the long program.

Okay, I don't think that it's hard drive I/O limited either.

Just ran it (first 200 iterations out of 4800). Before the first iteration starts, it processes the 3D grid and boundary conditions; and reading in about 30 MB and writing 96 MB. Then after that, it writes anywhere between 48 bytes to 64 bytes per iteration.

I haven't ran it long enough to be able to check the hard drive I/O patterns (I wished that there was like a data logger or something like that so I can just read the chart/trace.)

If I remember correctly, by the time the run is completed, it results in like a 1.something GB file (I forget the size precisely).

I think that at one point, I calculated an average write speed of 1.35 MB/s is required, which pretty much any modern mass storage device should be able to handle.

Parallelization is not an option at the moment because we're still working on getting the core to function work properly.

My boss just told me yesterday that he broke his code trying to add something to it and he's been working on fixing it so that I can do the next step in the sequence of validation and verification runs.

Once all of the modules are in and the program is fully up and running, that's when the parallelization will take place.

Until then, it's pretty much going to be a single threaded program; hence why I would like to speed up the development runs more so than the production runs.

Share this post


Link to post
Share on other sites
Thank you for your reply.

Right now, I think that for the short program that if it is I/O restricted, I would have to (at minimum) quadruple the system bandwidth. I don't think that the standards for those are even available, at least not in the way of commodity hardware.

Dual Opteron systems and the new AM2 chips have significantly more memory bandwidth than older chips. The soon-to-be-realeased conroe may be to your liking too, with more cache and possibly more memory b/w.
As far as hard disk I/O is concerned: on my laptop, it's 7.2krpm PATA Hitachi 100 GB. I haven't been able to run that code on my dual Opteron system because I've been having major issues with it.
that is surprising, considering that in most instances a dual-operton platform is considered to be superior to a notebook.
I don't think that my hard drive is exactly "freaking" out when I am running the long program.

Okay, I don't think that it's hard drive I/O limited either.

Just ran it (first 200 iterations out of 4800). Before the first iteration starts, it processes the 3D grid and boundary conditions; and reading in about 30 MB and writing 96 MB. Then after that, it writes anywhere between 48 bytes to 64 bytes per iteration.

from what I know about HDDs, they are much more efficient if you can commit larger amount of data at once (sequential write) rather than sub 512B chunks.

I haven't ran it long enough to be able to check the hard drive I/O patterns (I wished that there was like a data logger or something like that so I can just read the chart/trace.)
there is, it's called perfmon, assuming that you are on windows AND using system calls for HDD access it should be able to help you. Alternatively, you could run it on a virtual machine just to analyze the process better.
If I remember correctly, by the time the run is completed, it results in like a 1.something GB file (I forget the size precisely).

I think that at one point, I calculated an average write speed of 1.35 MB/s is required, which pretty much any modern mass storage device should be able to handle.

indeed that should not be a problem, but depending on the access pattern the "average" speed may have little reflection what speed the HDD needs to support at certain instances.

Until then, it's pretty much going to be a single threaded program; hence why I would like to speed up the development runs more so than the production runs.
well, if that is your goal, then you should take the person's advice and profile the program execution. While the profiling run will take longer, due to the overhead, it will tell you where your program is spending the time, which may be useful in determining not only the bottleneck, but will provide a starting point for future improvements as well.

Share this post


Link to post
Share on other sites
While I've never programmed in FORTRAN before, but I've seen what it can do and how fast and efficient it can be for numerical computation.

I think what happens is that a lot of people nowadays is so used to programming in a higher level language such as C/C++/Java that they forget what it's like or have no experience with programming in either Pascal or FORTRAN and realize how fast those programs run compared to their higher level counterparts.

Why do you think programs such as Fluent, STAR-CD, and Ansys are written in FORTRAN, and I know for a fact that at least with Ansys that their stuff is compiled using the PGI compiler?

Hmm, it seems a bit presumptuous to make comments about others when you yourself haven't programmed in some of these languages. I have programmed in Fortran, hell I've written thermodynamics models in x86 assembler. I think what really happens is that people rely on a programming language as a crutch to compensate for comparatively poor design and engineering skills that result in subpar performance. In this day and age, nicely optimized C code can run spectacularly fast--as fast as any compiled Fortran code--in the hands of the right programmer. "11 million cycles of what" indeed.

As everyone is saying, posting an analysis of either code or execution profiling would be a helpful first step.

Going back and designing parallelism into the codebase *prior* to completion would be another one. It would require biting the bullet, a short-term setback, but there would be benefits in the long-term. The question is whether your development tools can provide for it. This is where modern Fortran compilers start to lose their relevancy in this haughty age of parallelism.

If you want to get creative, there are numerous efforts out there to expose GPU horsepower to application software. The typical nVidia or ATI gaming card has an incredible amount of raw FPU horsepower that slowly is being brought to bear on non-gaming applications. This would be worth a few minutes of searching to check out.

Share this post


Link to post
Share on other sites

Why do people talk about optimized code so much?

(I don't understand it.)

If a code is well written, regardless of what language it is, shouldn't the unoptimized version be a baseline comparison for it's performance; and as such, the most significant?

While I do agree with you that I have absolutely no experience whatsoever to speak of with regards to programming, I just think that it would be/should be easier to write a program to solve for example, the Navier-Stokes equations for porous media substrate on a 3D grid.

I would also presume that because FORTRAN already contains a lot of the necessary "commands" to do a lot of the numeric computations, that C/C++ would have to make calls to other libraries in other to have that.

I have yet to see a CFD solver (be it commercial or proprietary) that is written in C/C++. I know that there are many GUIs and interactive front-ends are that written in C/C++ and then uses FORTRAN for back-end processing. I even seen a few that uses Java for the front-end *shudders* and FORTRAN for back-end.

The reason why we are not parallelizing the code right now is because the core functionality of it isn't even there without the parallelization. According to my boss, parallelization such as MPI or OpenMP are gizmos, that help speed up the execution of it. It's a gadget. And I also have to tend to agree with him on it in that if a code itself does not run inherently, parallelization via OpenMP isn't going to help.

Believe me, I used to think that building in the parallelization while developing it was the way to go too. But I gotta admit, my boss does have a point.

*edit*

How do I profile a code? And can someone use that information to like "reverse engineer" what the program is or what it's doing?

(Hence why I am hesitant about doing it).

Edited by alpha754293

Share this post


Link to post
Share on other sites
Why do people talk about optimized code so much?

(I don't understand it.)

If a code is well written, regardless of what language it is, shouldn't the unoptimized version be a baseline comparison for it's performance; and as such, the most significant?

Well written code can mean different things to different people. It could be easy to understand at first read, or maintainable and easily extensible (or all of these things). It could occupy as few KB as possible. While all these attributes may be good, they ain't neccessarily fast.

I design my code to perform well in the situations I think it will have the most work to do. I try to keep it strong in the first three attributes when I write it. Then, if performance is an issue I profile it to see where the hold-up is.

I would also presume that because FORTRAN already contains a lot of the necessary "commands" to do a lot of the numeric computations, that C/C++ would have to make calls to other libraries in other to have that.

In your case whatever you program in, its eventually going to be translated in to x86 machine code. FORTAN is predominantly used in scientific programming and hence I would expect its maths routines to be good and quick. However, there's no reason a good C++ library couldn't be just as quick, or quicker.

I have yet to see a CFD solver (be it commercial or proprietary) that is written in C/C++. I know that there are many GUIs and interactive front-ends are that written in C/C++ and then uses FORTRAN for back-end processing. I even seen a few that uses Java for the front-end *shudders* and FORTRAN for back-end.

Maybe that's because the people who want CFD solver routines are mostly using FORTRAN. C++ solvers do exist however, see here.

The reason why we are not parallelizing the code right now is because the core functionality of it isn't even there without the parallelization. According to my boss, parallelization such as MPI or OpenMP are gizmos, that help speed up the execution of it. It's a gadget. And I also have to tend to agree with him on it in that if a code itself does not run inherently, parallelization via OpenMP isn't going to help.

Parallelisation would not compromise the drive to improve program efficiency on a single thread. It may have been easier to implement it earlier, but what the hey. To paraphrase you, if a code itself does not run inherently, throwing money at a new processor isn't going to help (much). If your code is rubbish, making it multithreaded could make it n times less rubbish on an n core system which might be the most efficient way to spend your time and money.

How do I profile a code? And can someone use that information to like "reverse engineer" what the program is or what it's doing?

You will need to work in a FORTRAN environment that has an integrated profiler application. Once you've run your code through it, it will show you the relative time spent in each function. The way it does this is implementation specific, but modern profilers will have a graphical interface and allow you to view the timing information in a number of different ways. Maybe you can get some eval versions - I have no recent experience of FORTRAN development environments.

As an example I wrote a business analysis system for Vodafone in Java. Although it was accurate, it wasn't fast enough under the anticipated user load. By using a profiler my efforts were focussed on tuning the VM, optimising how and when I created Java objects and spending $200 on a library to improve the perfomance of one specific maths function. System was ten times quicker, everybody happy.

Share this post


Link to post
Share on other sites

If, for example, you were profiling function invocation the profiler would overwrite the function calls with its own, which would be wrappers to your functions and keep track of the number of invocations.

A similar technique would be used to compute the time that your programs spend in each function.

I am assuming that fortran has some form of a function/ procedure call.

By looking at the time that the executing program spends in each function, you'd be able to consider where you could look for improvements.

Share this post


Link to post
Share on other sites

Why do people talk about optimized code so much?

(I don't understand it.)

If a code is well written, regardless of what language it is, shouldn't the unoptimized version be a baseline comparison for it's performance; and as such, the most significant?

Well written code can mean different things to different people. It could be easy to understand at first read, or maintainable and easily extensible (or all of these things). It could occupy as few KB as possible. While all these attributes may be good, they ain't neccessarily fast.

I design my code to perform well in the situations I think it will have the most work to do. I try to keep it strong in the first three attributes when I write it. Then, if performance is an issue I profile it to see where the hold-up is.

I would also presume that because FORTRAN already contains a lot of the necessary "commands" to do a lot of the numeric computations, that C/C++ would have to make calls to other libraries in other to have that.

In your case whatever you program in, its eventually going to be translated in to x86 machine code. FORTAN is predominantly used in scientific programming and hence I would expect its maths routines to be good and quick. However, there's no reason a good C++ library couldn't be just as quick, or quicker.

I have yet to see a CFD solver (be it commercial or proprietary) that is written in C/C++. I know that there are many GUIs and interactive front-ends are that written in C/C++ and then uses FORTRAN for back-end processing. I even seen a few that uses Java for the front-end *shudders* and FORTRAN for back-end.

Maybe that's because the people who want CFD solver routines are mostly using FORTRAN. C++ solvers do exist however, see here.

The reason why we are not parallelizing the code right now is because the core functionality of it isn't even there without the parallelization. According to my boss, parallelization such as MPI or OpenMP are gizmos, that help speed up the execution of it. It's a gadget. And I also have to tend to agree with him on it in that if a code itself does not run inherently, parallelization via OpenMP isn't going to help.

Parallelisation would not compromise the drive to improve program efficiency on a single thread. It may have been easier to implement it earlier, but what the hey. To paraphrase you, if a code itself does not run inherently, throwing money at a new processor isn't going to help (much). If your code is rubbish, making it multithreaded could make it n times less rubbish on an n core system which might be the most efficient way to spend your time and money.

How do I profile a code? And can someone use that information to like "reverse engineer" what the program is or what it's doing?

You will need to work in a FORTRAN environment that has an integrated profiler application. Once you've run your code through it, it will show you the relative time spent in each function. The way it does this is implementation specific, but modern profilers will have a graphical interface and allow you to view the timing information in a number of different ways. Maybe you can get some eval versions - I have no recent experience of FORTRAN development environments.

As an example I wrote a business analysis system for Vodafone in Java. Although it was accurate, it wasn't fast enough under the anticipated user load. By using a profiler my efforts were focussed on tuning the VM, optimising how and when I created Java objects and spending $200 on a library to improve the perfomance of one specific maths function. System was ten times quicker, everybody happy.

Interesting.

Without getting into specifics, please see the link in my signature for a brief overview of what we/I do. If you have already done so, then I guess that you shouldn't be surprised when I tell you that I cannot view the source code, so I have no idea what it looks like.

Given that disclaimer, I won't have a way to do any kind of profiling on the program like you guys suggested, but I do appreciate it though.

Having also said that; that is also the reason why I cannot publish the executable. Or even excerpts from it.

(Otherwise, I probably would have by now if it weren't protected.)

From what my boss has told me, he said that he developed a lot of the modules over the years from various other projects. Originally written for Cray and SGI (parallelized using MPI), he said that some of the modules put quite a strain on systems computationally.

He also mentioned that he kind of "grew up" with the original guys that develop the original versions of what feeds most of today's commerical CFD code; and he ended up writing his own because he didn't like the results that were coming out of them. (Modelling is only as good as the modeller).

He also says that he doesn't use any of the tools that comes with the compiler because he believes that if there is a problem with it, that his mind stores about a million lines of code and he goes through it to figure out where the program is.

He also says that a lot of people who program for him doesn't understand the difference between a really small number and +/- Inf or NaN because they lack of the background in understanding what it means physically.

Our front-end programmer right now is actually picking up a fluids mechanic book to be able to fill those gaps that isn't covered in a typical CE/CS/EE curriculum (which is fine).

You know how typically in the early developmental phases (be it porting or writing from scratch) how it takes a couple of tries to get the code to work the way you want it to; we'll we're kind of at that phase.

From what I've seen thus far in the literature reviews, no one's really actually analysis porous media like we are doing and the few facilities that have done it in the past either try to simplify the model as one or a series of 2D regions, or that it is relative small 3D region.

Eventually we would like to build a representative "chunk" of a city block and run the code on that.

The "run" of the program that I have been referring to is a 10 microlitre drop of a substance onto sand; and it models how the drop sinks in and spreads. (i.e. no evaporation, no adsorption, no chemical reaction). I think that the substrate grid was about like....32 cm^3 (I don't remember exactly cuz we deal with internal units) and for 4800 iterations, it takes 97 minutes to run on my laptop.

Hopefully that should provide some more background information on the program itself. And yes, I am purposely leaving out a lot of details. Again, see link below for reasoning.

The question that was originally presented was (in my opinion) a rather simple one.

I need the highest core FPU FLOP count (as reported by SiSoft Sandra 2005 SR3).

My unfamiliarity with Intel's offering brought me here to ask around to see if others know.

While I do acknowledge that FLOP between processors aren't identical, in that case, pretty much that would invalidate any of the synthetic testing.

But I am using it as a baseline representation, looking as it from a component perspective, and also from a systems perspective as well.

I've heard a lot about the latest Intel Conroe and Woodcrest processors and I know very little about them, or their computational abilities.

Therefore anybody that can help clarify that would be greatly appreciated.

Numbers would be nice.

The fastest system that I have right now is my dual Opteron system at 6190 MFLOPS core FPU (or about 3095 MFLOPS) and I am trying to find something that will "beat" that.

The aren't any plans for acquisition yet (pending on what the result/concensus is), and I don't know enough about the program itself (how it functions on all levels) to be able say whether dual core would be a benefit or a hinderence.

(hence the "no dual core" initially.)

I suppose that it MAY be possible for me to submit a request to see if they would authorize me to do two system builds, one based on dual core, and one just a straight forward 2P system.

AM2 may also be a possibility. I don't know yet.

Thanks.

Share this post


Link to post
Share on other sites
I cannot have the chip be catered to the simpliest of the 3 or 4 modules/aspects of it because I don't know how future modules are going to be behave or what they would do/use.

Essentially, you're asking people if a ferrari is "faster" than a hummer, but stating that you don't know what the road surface is going to be like, or the metric for success--which makes the question almost impossible to answer in anything but a generic sense.

If you don't know this very basic information, then nobody can give you a precise answer. You'll have to guess. The general answer is the fastest general purpose processor you can find, which is "generally" going to be the fastest clocked Opteron available at this time. Since we don't know if you're driving off-road or on, the answer is a Subaru Outback. :P

Pointless Chest Thumping From Here On Out, Feel Free To Ignore:

Some have said that this is a loaded question. I personally can't think of making it any simplier than I already have

That's because it isn't a simple question! You think you're asking a simple question, but it's anything but.

I think what happens is that a lot of people nowadays is so used to programming in a higher level language such as C/C++/Java that they forget what it's like or have no experience with programming in either Pascal or FORTRAN and realize how fast those programs run compared to their higher level counterparts.

Why do you think programs such as Fluent, STAR-CD, and Ansys are written in FORTRAN, and I know for a fact that at least with Ansys that their stuff is compiled using the PGI compiler?

I doubt it has anything to do with performance. See:

http://shootout.alioth.debian.org/

http://www.amath.washington.edu/~lf/softwa..._F90SciOOP.html

...for a few examples. In modern software engineering, most optimization now occurs in the compiler itself. What language is used isn't a useful predictor of final performance.

For something that'll further discredit this antiquated notion of performance, see this paper:

http://www-cs.canisius.edu/~hertzm/gcmalloc-oopsla-2005.pdf

...this paper found that garbage collected languages can actually meet/exceed the performance of native languages when given enough memory to work with, because it can make global descisions about memory allocation at runtime that programmers are not capable of making within their own application prior to compilation. The caveat was that performance was substantially less when the garbage collector didn't have enough memory to work with, but the point stands: assuming a "language" dictates program execution speed is fallacious.

Bottomline: it's absurd to assume that FORTRAN is faster than C which is faster than JAVA. Anyways, this subject doesn't really have anything to do with the topic at hand.

I haven't ran it long enough to be able to check the hard drive I/O patterns (I wished that there was like a data logger or something like that so I can just read the chart/trace.)

If you're on Win32, see sysinternals.com for a lengthy listing of free tools.

I recommend process explorer as a starting point; you can right click on a given process and get a readout of that process's individual CPU, memory, and I/O consumption. There are also some other metrics you can watch, and you can also view individual threads, as well as get a snapshot of a thread's callstack. They also have some other tools for monitoring that may be of interest.

This is hardly a substitute for profiling, but it is better than the crude metrics you'll get from task manager.

I would also presume that because FORTRAN already contains a lot of the necessary "commands" to do a lot of the numeric computations, that C/C++ would have to make calls to other libraries in other to have that.

You presume incorrectly. This has very little to do with the performance of the final product.

I have yet to see a CFD solver (be it commercial or proprietary) that is written in C/C++.

You have source code for proprietary programs...???

How do I profile a code? And can someone use that information to like "reverse engineer" what the program is or what it's doing?

No, it doesn't make reverse engineering any easier--generally you'll compile performance objects into the actual code that are not included for release builds.

He also says that he doesn't use any of the tools that comes with the compiler because he believes that if there is a problem with it, that his mind stores about a million lines of code and he goes through it to figure out where the program is.

...right... :rolleyes:

Share this post


Link to post
Share on other sites

Does that mean that the Opterons still have a higher core FPU FLOP count (per Sandra 2005) than Intel's latest offering?

Share this post


Link to post
Share on other sites

Athlon surely have the faster FPU. They always have, in huge difference. Because they inherit their FPU from Alpha. But, it is only about RAW FPU power. Besides, Athlon put FPU on different path than their ALU. It means that FPU can scale differently than the other processor parts.

Also, dual core on AthlonX2 doesn't hamper single thread performance at all, moreover, as AMD already noted, that they can fuse their dual core into single core system, thus making the data path twice as wide. 2 cores working together for one thread. They said, this July they will show them off. Check on "inverse hyperthreading" at the google.. Well, they invented Intel's Hyperthreading, and now they modified it, and built them into their CPU. It make sense.

This is differs a lot with Intel case, as they are limited by a their FSB design. It is so small, that they will suffer from it when running single thread application using single 200Mhz FSB link. As demonstrated by my test on dual SPi32M on Core Duo notebooks. AMD has a full 1.8Ghz link to memory at least, while Intel has only 266Mhz link to memory at most. That's a sheer bandwidth difference.

Moreover, AMD's multi socket Opteron can have quadruple channel if you put dual processor opteron with 4 pieces of RAM there. That means it solves a LOT of your bandwidth limitation. If you still limited by RAM bandwidth, you can go to 4P system, which has 8 channel RAM.

Also, I won't trust sandra too much. It is only 7 -10 lines of code being executed repeatedly. Doesn't show real world difference at all. The best benchmark is real world benchmark, that is, real world application, meaning, YOUR application.

So either way, you are asking for Opteron right now. Unless you can have your hand on woodcrest. But judging that you only run your program at 25% CPU, I don't think that you will need woodcrest.

Share this post


Link to post
Share on other sites
I'm looking for the single fastest processor available.

Performance metric is measured in MFLOPS.

Here are also additional conditions that it must meet:

- It cannot be any type multi-core solution.

- It can have SMT/HT, but will be disabled as it will be of no use/help to me.

- It must be an x86 solution, capable of running a single process Win32 application

What are my options?

That would probably be the Pentium 4, as its frequency is the highest of any processor avaliable on the market. If you wanted the highest performing processor avaliable, you should go for the Conroe L, which is single core.

Edited by Shining Arcanine

Share this post


Link to post
Share on other sites

I think I've now got a better understanding of your predicament. As you have no influence over the code, perhaps the best you can do is create some batch scripts that will do multiple runs in parallel and then opt for multicore Opteron/Woodcrest systems. As mentioned above, the only way to establish the fastest platform will be to test on each. Maybe a bit of begging/borrowing/stealing of hardware is required.

Has your boss suggested a solution to your problem?

Share this post


Link to post
Share on other sites

Since I have no way of acquiring each piece of hardware for testing, I was kind of hoping that I would be able to use Sandra 2005 FLOP counts as a synthetic metric to gage performance of processors relative to one another.

If I could do the testing myself, then I wouldn't have (needed) to ask.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now