Well written code can mean different things to different people. It could be easy to understand at first read, or maintainable and easily extensible (or all of these things). It could occupy as few KB as possible. While all these attributes may be good, they ain't neccessarily fast.
I design my code to perform well in the situations I think it will have the most work to do. I try to keep it strong in the first three attributes when I write it. Then, if performance is an issue I profile it to see where the hold-up is.
In your case whatever you program in, its eventually going to be translated in to x86 machine code. FORTAN is predominantly used in scientific programming and hence I would expect its maths routines to be good and quick. However, there's no reason a good C++ library couldn't be just as quick, or quicker.
Maybe that's because the people who want CFD solver routines are mostly using FORTRAN. C++ solvers do exist however, see here.
Parallelisation would not compromise the drive to improve program efficiency on a single thread. It may have been easier to implement it earlier, but what the hey. To paraphrase you, if a code itself does not run inherently, throwing money at a new processor isn't going to help (much). If your code is rubbish, making it multithreaded could make it n times less rubbish on an n core system which might be the most efficient way to spend your time and money.
You will need to work in a FORTRAN environment that has an integrated profiler application. Once you've run your code through it, it will show you the relative time spent in each function. The way it does this is implementation specific, but modern profilers will have a graphical interface and allow you to view the timing information in a number of different ways. Maybe you can get some eval versions - I have no recent experience of FORTRAN development environments.
As an example I wrote a business analysis system for Vodafone in Java. Although it was accurate, it wasn't fast enough under the anticipated user load. By using a profiler my efforts were focussed on tuning the VM, optimising how and when I created Java objects and spending $200 on a library to improve the perfomance of one specific maths function. System was ten times quicker, everybody happy.