forums.silverfrost.com

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

For many years I have investigated the run time performance of the Fortran code I write, especially for solution of large sets of linear equations, as is found in finite element calculations.
In this calculation there are two main vector calculations:
dot. product = Vector_A . Vector_B, and
vector subtraction : Vector_A = Vector_A - constant x Vector_B
In recent years with FTN95, the vector subtraction has shown very poor performance times, with run times of up to 4 times that of other compilers. I have a sample problem where total time of these calculations is 50 seconds on LF95 and 206 seconds on FTN95. It puzzles me what is happening for the other 156 seconds, as clearly it is not the floating point calculations, which can be done in 50 seconds.
Published benchmark run time performance for Salford FTN95 on polyhedron.com has also consistently shown FTN95 to be lagging behind most others.

All this has now changed !!

I have recently obtained a new desktop ( HP-Z400 ) which has dual Xeon processors. While the vector subtraction calculations have changed from 50 to 47 seconds for LF95, it has changed from 206 to 33 seconds for FTN95.
By comparison dot_product has changed from 51/97 (lf/sal) to 31/31 on the Xeon. The old performance times are on my notebook Centrino, and my old desktop was a Core 2.
From an operations count analysis, Dot_Product and Vec_subtraction run times should be similar. I explain the differences by how the processor optimises (or hampers) the calculation.
My estimation for what has happened is there has been a significant shift in the optimisation approach within the Xeon processor in comparison to other processors (I have tested) The change for vector subtraction performance for LF95 shows the change in Xeon optimisation does not suit it.
It would appear that what I anticipate is the “forward calculation optimisation” in the Intel processors has changed in the Xeon to the benefit of FTN95.
If others are aware of similar changes or can offer a more accurate explanation to what I have observed, I would appreciate your comments, as I don’t think I fully understand this change.
It would appear that FTN95 poor run-time performance may have a reprieve. It would be good if we knew why!

John

LitusSaxonicum · Posted: Thu Nov 04, 2010 9:54 am Post subject:

John,

That's an excellent result.

I downloaded the Polyhedron benchmarks, and noticed that they had all been mangled by being run through their code prettyfier program (no criticism intended, as I'm sure the resulting code suits some people's programming style). On the basis of nothing more than a hunch, I concluded that this introduced some constructs that didn't work well with FTN95, and as I've always been comfortable with FTN95 performance but program in a much closer to classical Fortran-77 style, felt that my hunch was perhaps justified by my own experience.

What you are showing now is that the FTN95 EXE may suit some breeds of processor far better than others.

However, I mainly use FTN95 for Clearwin, and there comes a point - when dealing with a human user - such that further increases in speed on the target computer (for the sort of things I'm doing) are of little or no benefit. Indeed, when I try the optimisation settings, my programs don't run, so I know that I'm operating suboptimally anyway.

What operating system(s) are you using, and do you think that, or memory size, has an effect?

Eddie

DanRRight · Posted: Sun Nov 07, 2010 2:10 pm Post subject:

John, you have to research if parallelization will work for you. Optimization has 4 times potential speedup, parallelization with 12-core AMD times 2-4 processors gives 24-48. It can use DLLs made in many other compilers so you would get up to your 4 times additionally from day one. Also, to my surprise 2-3 years ago, AMD was faster then Intel in parallel tasks by factor ~1.5 (JohnHorspool was doing AMD tests for me, while i was doing Intel on some parallel libraries from the site equation dot com)

EKruck · Joined: 09 Jan 2010 Posts: 224 Location: Aalen, Germany

Dan and John,

solution of large linear equation systems and inversion are very important parts of our software package. Therefor your points are on high interest for me.

I would like to know some more about your XEON processor and the workstation data. I will test this as well. Perhaps I shall recommend to our customers then to have a XEON workstation.

Please remember as well our discussion "64-Bit FTN95 Compiler" earlier this year.
In the mean time I have transferred our big FTN95 program - which includes the equation solution - to Intel-Fortran. There I can use my 8 GB of RAM in full. It became ready 10 days ago.

My first experience with run-times: The Intel-compiled program is considerably faster. Run-time is reduced from 3 h 15 min. to 1 h 58 min. without special optimisation or multicore usage. Watching the Windows task manager, I see possibilities of further time reduction.

Furthermore it would be great to use parallel processing. Is there anyone having an idea how to use multicore?
As well on equation.com is no help available because my matrices do still not fit into memory:
1 900 000 unknowns
1 420 000 000 elements matrix size in profile
760 average bandwidth
540 000 000 0000 number of multiplications (pre-estimated)

Future jobs will become bigger. This is definitely no job for 32-bit Fortran anymore.

Paul, Salford will be left behind, if you do not change to 64 bits. Toshiba has published to have already now DRAM modules with 256 GByte! The next Windows version will probably have no 32 bit version anymore.

As well Microsoft will be very happy, if we can do faster processing. They have in their computer center several licenses of our software running 24 hours a day to produce BING MAPS Smile

Erwin

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Erwin,

The main problem I have is the performance testing I have done to identify run time problems is very empirical.
Certainly the recent results I had where I had .exe files produced by two compilers and ran them on Core2 and Xeon processors produced significantly different relative results. This has lead me to conclude that the optimisation in the processors must be different, with the Xeon more suited to FTN95.

For optimisation with FTN95 I have found a number of strategies for time critical code:
I only use /opt for the small amount of code that requires optimised performance. For some reason, if I use /OPT for all the code, I get different answers, especially on my other non-FEA programs. I'm yet to track this down (??)
Never use Array syntax in FTN95: it is just too slow. I typically use /debug compilation so that I can get the source code address if it crashes, but this makes array syntax very slow.
I write the basic vector calculations as simple functions and use /p6/opt for these few routines. These simple routines include:
Vec_Sum : const = Vec_A . Vec_B (dot_product)
Vec_Sub : Vec_A = Vec_A - const * Vec_B
Vec_Mlt : Vec_A = Vec_B * Vec_C
Vec_Mov : Vec_A = Vec_B
The first 2 are 99.9% of run time in my direct equation solver run time.
These are all simple 3 line routines with a single DO loop. They do respond to /p6 /OPT.

Vec_Sub is the most surprising one as when going from a Centrino (notebook) to a Xeon, FTN95 went from 200 sec to 33 seconds, while Lahey95 went from 50 to 47. Based on a op count approach and comparing with dot_product, the times it should take is 50 and 33 second for the 2 processors. I can never explain why this routine runs so slowly and now the two .exe files exhibits unexplained delays on either processor.

Your example of 2 million equations but a bandwidth of only 760 is a relatively small bandwidth, but 760 is significant for vector multiple size. Do you use a direct or itterative solver ? I use a direct skyline/crout solver (from 1980's)
To me, your comment about number of multiplications is an approach that is no longer relevant to modern processors, as it does not explain the relative run times of some blocks of code, especially the Vec_Sub example above. Even changing simple vector syntax :
TEMP(1:neq) = Eigen_Vectors(1:neq,K)
T2(1:NEQ) = TMASS(1:NEQ) * TEMP(1:NEQ)
to :
call vec_move (Eigen_Vectors(1,K), TEMP, NEQ)
call vecmlt (T2, TMASS, TEMP, NEQ)
had a significant effect on run time ( say 30% reduction in run time) in one part of the code.
The number of real*8 calculations is identical in both cases but other (word moves etc) can take longer than the r*8 calcs

I am about to try 64-bit compiler and hopefully multi-processor implementation for these few vector routines. I'm interested to see what I learn from this next stage.

The processor type is a new component of the optimiser mix.

Another problem I anticipate with 64-bit is if the prosesses start to share memory and you have to page out say an 8gb process. This could slow things down! Something new to test. You need lots of memory to avoid this.

John

Sebastian · Joined: 20 Feb 2008 Posts: 177

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Sebastian,

Thanks for your thoughts.
In relation to paging: I'd expect the program I'd run on Win-64 would have a very large array access; frequently accessing across many pages, rather than localised memory use. I think the answer is to not run multiple large processes and install a lot of memory.
I indicated that there are 2 routines that do most of the calculating. For the routine that causes most of the unpredictable performance; VECSUB, I don't think there are a lot of alternatives, but there is a difference between FTN95 and LF95 for:

Sebastian · Joined: 20 Feb 2008 Posts: 177

LitusSaxonicum · Posted: Sun Nov 28, 2010 7:20 pm Post subject:

I follow the threads about faster processing, of which the need for a 64-bit FTN95 is a sub-theme, with great interest, as for four decades it has been a holy grail quest of my own. I use the term “holy grail quest”, because it is pursued with quasi-religious fervour, but suspecting that the object of desire can never truly be obtained, and indeed, that the closer one approaches the goal, the more likely it is to elude one’s grasp!

One of the recourses one has is to move to a faster compiler. When the FTN compiler was the fastest of the bunch, it was the ultimate. Of course, if one wants Clearwin, then one sticks to FTN95. To be honest, most of the things done in response to a user “click” are so close to instant, and a modern PC is so fast, that this type of program gains no practical benefit from being faster, and quite probably would function perfectly adequately if running significantly slower. Most of what I am doing these days falls into this category, and I have to declare no interest in switching compilers, and would only look to 64-bit FTN95 if 32-bit codes were no longer supported by the current Windows.

To other classes of user, the quest for speed is still an urge. I was taken by the idea that some runs might take one or two hours: I well remember running a PC for just short of 24 hours and before that a mainframe overnight. These types of codes don’t have much in the way of user interactivity, generally read their data and write their results to disk files, and could well be written in such standards conforming code that the choice of compiler is immaterial. Clearly, the Bing Maps response is not measured in hours, and one is talking there of shaving off perhaps fractions of a second. Most of what I do takes me a long time to think about, but rarely any measurable run times.

With the assumption that one could actually get 64-bit FTN95, I am sure that it would be an extra cost upgrade. Failing that, one would need another compiler. The Intel compiler has been mentioned. It looked like more than a grand (£1000+) to me – so I stopped looking! Then one needs a PC that has enough RAM to make 64-bit windows sensible. I found 12 and 16 Gb memory kits at £300 to £400. Much more RAM won’t fit on a standard mainboard, Windows 64 doesn’t always use unlimited RAM, and much less than 12Gb (say 6 or 8) is not much on an upgrade. Presumably one also needs to cost in Windows 64 and other pieces of hardware. 64-bit Linux is a further option, but still needs the appropriate hardware. I therefore looked for a way to get better performance for a lot less than £1000.

I’m going to turn my attention to a SoHo user (like myself) who wants to run software he has developed for his own purposes (which might be commercial) on his own hardware. I’ll make the assumption that the software runs for some hours, and could run faster in a 64-bit environment, but because of RAM limitations, the software needs to keep backing up data. My mental picture is a big finite element application, where the global stiffness matrix is too big to store and operate on in RAM. (I think that this is what John Campbell is doing).

Method 1: An immediate improvement is obtained if a dedicated fast hard drive is used for storage. I’ve used the WD Velociraptor 10k rpm drives, and they are much faster than normal hard drives. A single drive costs less than £100 now, is plug and play, and is likely to produce a bigger speed up than a compiler change costing 10x as much. SSD drives are smaller and more expensive, but they are faster too. The ultimate limit on speed up is the interface, so speeds might also be increased by using SCSI drives.

Method 2: Speedups can also be gained by installing multiple drives in a RAID0 array (I’ve done that too). This could be cheaper than a Velociraptor or SSD to get 4 standard drives, and of course, one can do RAID arrays with Velociraptors or SSD drives.

(continued)

LitusSaxonicum · Posted: Sun Nov 28, 2010 7:22 pm Post subject:

Method 3: Assuming that one sticks with 32-bit Windows, large amounts of RAM can still be installed. The problem is that it isn’t addressed by Windows. Back in the days of MS DOS, extra RAM could be turned into the equivalent of a solid state disk using “ram disk” software. This type of utility is still produced, and provided that the RAM unused by Windows is enough for the “disk image” could speed up a program significantly as the “read and write to disk” is done at RAM speeds, minus the overhead of coding and encoding, and actually operating the ramdisk. It’s some time since I did this (vintage 386 computers) but the software abstraction operated then, and I see no reason why it should not now – assuming one gets the right ramdisk software! Costs include buying the RAM, and results are a bit indeterminate.

Method 4: This requires one to recode the critical routines in assembler. John Campbell indicated where 90+ % of his problems lie: simple dot product and vector addition/subtraction. In the book “8087 Applications and Programming for the IBM PC, XT, and AT”, published back in 1985 by Brady, and written by Richard Startz, he shows how to code various things in assembler to speed them up. His idea was to produce subroutines to interface with compiled MS Basic, but they worked with MS Fortran too. Taking John’s multiplication of two vectors to get a scalar, which I presume is coded in Fortran as something like:

LitusSaxonicum · Posted: Sun Nov 28, 2010 7:25 pm Post subject:

The callback to %dl checks for the existence of the startup command file, opens it, reads the paths, and checks that they are valid before launching the analysis program. If errors are found, it simply doesn’t launch. I’m sure that I could send a message back somehow, but I haven’t looked into that. Instead, I just got the analysis program to put a message in a file where I could find it, saying “Machine No. X has finished”.

My local computer shop sells second-user Windows XP system units for around £50 each, containing a 2.4 to 2.8 GHz single core cpu, and had I not had a bunch of old computers doing nothing already, then I might have bought a couple. Nobody wants old computers, and they are cheap. The critical thing with using “just a bunch of old computers” is not to generate too much heat, and not to use too much electricity (same thing really) , so one needs to chuck out superfluous floppy and CD drives, and think a lot about ventilation. The last time I did serious finite element analysis, I ran the program about 8 times with modified geometry and properties. Running this on a single computer meant doing a run, thinking about the results, modifying the input and re-running. Having a bunch of computers running in parallel means that the computer array can get on an work on several of the obvious permutations all at the same time. I’ve been able to prove the idea using old junk, and the throughput of (say) 10 computers running different problems is potentially huge. After an initial time while the first run is done, the results crop up thick and fast. It did strike me that the most efficient time management was to set the whole bunch running last thing in the working day, and then the results of at least one of the runs is there the following morning!

Because my mental picture is of a big FE analysis, I couldn’t help but wonder if part of the problem is an inefficient storage of the coefficients. This leads to the solution algorithm working on lots of zeroes. It’s a big job switching to a different solver, but in FE terms, the bigger the global stiffness matrix becomes, the more likely it is to get sparser and sparser. Time could be spent beneficially in optimising the storage layout of the coefficients. The term “bandwidth” was used in the discussion. Back in the 70’s I used an FE system that stored coefficients based on the “node number”. I found that it was easy to reduce bandwidth enormously by giving each node an alias number, finding which connections corresponded to the biggest bandwidth and and swapping the aliases. (I was never able to get down to the theoretical minimum for ring shaped meshes, but the impact on bandwidth was huge for small inputs of optimising time). I scanned and rescanned an array of connectivities to do this. The actual analysis used the aliases, but the results were mapped back to the original mesh numbering so that I could find my way round the results. Optimisation is a key part of the frontal solution approach.

The answer, I think, lies in much more than just the choice of compiler.

Eddie

EKruck · Joined: 09 Jan 2010 Posts: 224 Location: Aalen, Germany

Eddie,

Thank you for all your explanations and the time you have invested to write it down.
You are of course right with your conclusion that the compiler is only part of the solution. To avoid any misunderstanding: I love my FTN95 with its WINIO@. All our software and nearly all our GUIs are currently produced with this. The best would be a FTN95 compiler producing 64-bit code.

To understand my background: I am doing professional software development since more then 40 years, starting on Zuse Z11 and IBM 1130. For my doctoral thesis 1983 I was using mainframe computers like CDC Cyber 76 and Cray 1S. One important point of this thesis was to establish and optimize the processing of large equation systems in Photogrammetry. I evaluated all kinds of storage schemes and processing algorithms. For our typical sparse matrices I ended up with the profile storage scheme from A. Jennings, 1966 and for profile optimization with a modified Banker’s algorithm from R.A. Snay, 1976 (I was talking about “average bandwidth” in my response!). This can be the best combined with a Cholesky equation solution. On vector computers like Cray 1S and HP1000 this was a perfect solution – of course using Assembler programming for the inner cycles. Because the matrices did not fit into memory I optimized disk I/O.

Typically in Photogrammetry we have Taylor-linearised equation system to solve. In terms of optimization processing: Target function is least squares and edge conditions are mainly the geometric equations of taking serial aerial photos of the ground and as well many additional types of equations e.g. for ground control points, GPS in the plane, INS data (Inertial Navigation System) and others. Because our equations are linearised, we have to iterate to get a final result. Therefore the Cholesky equation solution is the most time consuming part of the software. After the last iteration, a part of the inverse will be estimated and some more matrices for statistic. All this is very very similar to solution of Finite Elements.

This program is part of our big software package with many graphical programs and many other tools having nice GUIs. Our main program is a pure batch program running in the background. Because it can take hours, we have to look for optimization. The compiler itself is not the principal question, but possibilities to use large memory areas. And here is a 64-bit compiler requested. Ram-disks are only crutches.

Fast processing is for us not a “holy grail quest”, but simply a question of commercial arguments. Our customers are frequently asking. All our tools around this main program have GUIs and processing speed is not important (only for our interactive graphical 3D viewer).
BTW: Our software is used for Bing Maps production – not for user interaction.

A few years ago we have transferred our whole software package to Suse Linux using QT and LF95. We have invested one man-year, and the speed was better, but we did not sell one license. In our business is nearly only Windows in use.

The prices for software tools, compilers and workstation hardware are neglectable for us. We play in another price range. Our development and testing workstations have of course fast RAID disk arrays with 4 to 6 disks using 64 bit Raid controllers from 3Ware and Adaptec. This is as well true for our customers. When you know, that an Aerial Survey Camera starts with a price for about US$ 600.000, then it is clear, that only high end workstations are used for data processing. And speed is very important if you have to process thousands of large photos (up to 750 MB per image). Where the image processing is distributed to several workstations, the Cholesky algorithm CANNOT be distributed, because for the next dot-product the result from the prior dot-product is needed. If you look for the discussion from last May “64-Bit FTN95 Compiler”, you can see, that even no multi-core is helpful for this purpose – Except, you can hold the complete matrix in memory.

see continuation

EKruck · Joined: 09 Jan 2010 Posts: 224 Location: Aalen, Germany

continuation

Then Cholesky decomposition can be used, which is probably the base for the software offered in equation.com.

As you can see, we have evaluated ALL possibilities which you mentioned; as well Assembler programming on a 386/387 PC years ago. I found, that my dot-product is the best. It is as well slightly faster then the DOT_PRODUCTs from Salford and Intel. It is using “unrolling”. The best factor is 7; probably because the inner cycle is still fitting into program stack of the CPU.

My software package is now since 25 years in the market. All the years we had to improve speed because of growing requirements (more photos, more complex data types).

My final conclusion now: We need 64-bit software to use as much memory as possible. Then we can take advantage of multi-core. Currently we are using only 25% of the power of our quad-core CPUs.

Erwin

EKruck · Joined: 09 Jan 2010 Posts: 224 Location: Aalen, Germany

John,

One comment to your Celeron / XEON test: As far as I know, the Celeron has no floating point unit. FTN95 is probably not designed for integer processing. FORTRAN means formula transformation – of course with floating point values. It looks like, Lahey has invested more in software for floating point computation with integer values.

Empirical testing on PCs is still a very useful method. I do it as well, because I cannot do any scientific research for processing anymore (no turnover from that sort of work).

Years ago I tried “/optimize” with FTN95. My software did not run. Now I tried /P6/optimize once more again. The differences are not huge, but in the range of 5 to 10 %.

Of course our sparse matrices are stored in a vector.

Dot-product and vector multiplications are the most time consuming operations as well in our software. You mentioned some products like Vec_Sum and skyline/crout solver. Could you give me hints where to find it to call it from DLLs?

I try to avoid memory paging completely. I think, there should be an API function to keep the whole process with data in memory, but I found up to now only “Lock pages in memory” which seems to be a different function. I have removed the paging file from the system, but I am not sure that it is already sufficient.

Some time ago I have seen a Microsoft Windows program to visualize all process activities graphically like an enhanced Task Manager. But I cannot find it anymore. This should be very helpful.

Erwin

EKruck · Joined: 09 Jan 2010 Posts: 224 Location: Aalen, Germany

John,

One comment to your Celeron / XEON test: As far as I know, the Celeron has no floating point unit. FTN95 is probably not designed for integer processing. FORTRAN means formula transformation – of course with floating point values. It looks like, Lahey has invested more in software for floating point computation with integer values.

Empirical testing on PCs is still a very useful method. I do it as well, because I cannot do any scientific research for processing anymore (no turnover from that sort of work).

Years ago I tried “/optimize” with FTN95. My software did not run. Now I tried /P6/optimize once more again. The differences are not huge, but in the range of 5 to 10 %.

Of course our sparse matrices are stored in a vector.

Dot-product and vector multiplications are the most time consuming operations as well in our software. You mentioned some products like Vec_Sum and skyline/crout solver. Could you give me hints where to find it to call it from DLLs?

I try to avoid memory paging completely. I think, there should be an API function to keep the whole process with data in memory, but I found up to now only “Lock pages in memory” which seems to be a different function. I have removed the paging file from the system, but I am not sure that it is already sufficient.

Some time ago I have seen a Microsoft Windows program to visualize all process activities graphically like an enhanced Task Manager. But I cannot find it anymore. This should be very helpful.

Erwin