Silverfrost Forums

Welcome to our forums

Salford FTN95 run time performance

3 Nov 2010 11:26 #7115

For many years I have investigated the run time performance of the Fortran code I write, especially for solution of large sets of linear equations, as is found in finite element calculations. In this calculation there are two main vector calculations: dot. product = Vector_A . Vector_B, and vector subtraction : Vector_A = Vector_A - constant x Vector_B In recent years with FTN95, the vector subtraction has shown very poor performance times, with run times of up to 4 times that of other compilers. I have a sample problem where total time of these calculations is 50 seconds on LF95 and 206 seconds on FTN95. It puzzles me what is happening for the other 156 seconds, as clearly it is not the floating point calculations, which can be done in 50 seconds. Published benchmark run time performance for Salford FTN95 on polyhedron.com has also consistently shown FTN95 to be lagging behind most others.

All this has now changed !!

I have recently obtained a new desktop ( HP-Z400 ) which has dual Xeon processors. While the vector subtraction calculations have changed from 50 to 47 seconds for LF95, it has changed from 206 to 33 seconds for FTN95. By comparison dot_product has changed from 51/97 (lf/sal) to 31/31 on the Xeon. The old performance times are on my notebook Centrino, and my old desktop was a Core 2. From an operations count analysis, Dot_Product and Vec_subtraction run times should be similar. I explain the differences by how the processor optimises (or hampers) the calculation. My estimation for what has happened is there has been a significant shift in the optimisation approach within the Xeon processor in comparison to other processors (I have tested) The change for vector subtraction performance for LF95 shows the change in Xeon optimisation does not suit it. It would appear that what I anticipate is the “forward calculation optimisation” in the Intel processors has changed in the Xeon to the benefit of FTN95. If others are aware of similar changes or can offer a more accurate explanation to what I have observed, I would appreciate your comments, as I don’t think I fully understand this change. It would appear that FTN95 poor run-time performance may have a reprieve. It would be good if we knew why!

John

4 Nov 2010 8:54 #7116

John,

That's an excellent result.

I downloaded the Polyhedron benchmarks, and noticed that they had all been mangled by being run through their code prettyfier program (no criticism intended, as I'm sure the resulting code suits some people's programming style). On the basis of nothing more than a hunch, I concluded that this introduced some constructs that didn't work well with FTN95, and as I've always been comfortable with FTN95 performance but program in a much closer to classical Fortran-77 style, felt that my hunch was perhaps justified by my own experience.

What you are showing now is that the FTN95 EXE may suit some breeds of processor far better than others.

However, I mainly use FTN95 for Clearwin, and there comes a point - when dealing with a human user - such that further increases in speed on the target computer (for the sort of things I'm doing) are of little or no benefit. Indeed, when I try the optimisation settings, my programs don't run, so I know that I'm operating suboptimally anyway.

What operating system(s) are you using, and do you think that, or memory size, has an effect?

Eddie

7 Nov 2010 1:10 (Edited: 7 Nov 2010 1:15) #7123

John, you have to research if parallelization will work for you. Optimization has 4 times potential speedup, parallelization with 12-core AMD times 2-4 processors gives 24-48. It can use DLLs made in many other compilers so you would get up to your 4 times additionally from day one. Also, to my surprise 2-3 years ago, AMD was faster then Intel in parallel tasks by factor ~1.5 (JohnHorspool was doing AMD tests for me, while i was doing Intel on some parallel libraries from the site equation dot com)

25 Nov 2010 3:12 #7157

Dan and John,

solution of large linear equation systems and inversion are very important parts of our software package. Therefor your points are on high interest for me.

I would like to know some more about your XEON processor and the workstation data. I will test this as well. Perhaps I shall recommend to our customers then to have a XEON workstation.

Please remember as well our discussion '64-Bit FTN95 Compiler' earlier this year. In the mean time I have transferred our big FTN95 program - which includes the equation solution - to Intel-Fortran. There I can use my 8 GB of RAM in full. It became ready 10 days ago.

My first experience with run-times: The Intel-compiled program is considerably faster. Run-time is reduced from 3 h 15 min. to 1 h 58 min. without special optimisation or multicore usage. Watching the Windows task manager, I see possibilities of further time reduction.

Furthermore it would be great to use parallel processing. Is there anyone having an idea how to use multicore? As well on equation.com is no help available because my matrices do still not fit into memory: 1 900 000 unknowns 1 420 000 000 elements matrix size in profile 760 average bandwidth 540 000 000 0000 number of multiplications (pre-estimated)

Future jobs will become bigger. This is definitely no job for 32-bit Fortran anymore.

Paul, Salford will be left behind, if you do not change to 64 bits. Toshiba has published to have already now DRAM modules with 256 GByte! The next Windows version will probably have no 32 bit version anymore.

As well Microsoft will be very happy, if we can do faster processing. They have in their computer center several licenses of our software running 24 hours a day to produce BING MAPS 😃

Erwin

26 Nov 2010 12:17 #7159

Erwin,

The main problem I have is the performance testing I have done to identify run time problems is very empirical. Certainly the recent results I had where I had .exe files produced by two compilers and ran them on Core2 and Xeon processors produced significantly different relative results. This has lead me to conclude that the optimisation in the processors must be different, with the Xeon more suited to FTN95.

For optimisation with FTN95 I have found a number of strategies for time critical code: I only use /opt for the small amount of code that requires optimised performance. For some reason, if I use /OPT for all the code, I get different answers, especially on my other non-FEA programs. I'm yet to track this down (??) Never use Array syntax in FTN95: it is just too slow. I typically use /debug compilation so that I can get the source code address if it crashes, but this makes array syntax very slow. I write the basic vector calculations as simple functions and use /p6/opt for these few routines. These simple routines include: Vec_Sum : const = Vec_A . Vec_B (dot_product) Vec_Sub : Vec_A = Vec_A - const * Vec_B Vec_Mlt : Vec_A = Vec_B * Vec_C Vec_Mov : Vec_A = Vec_B The first 2 are 99.9% of run time in my direct equation solver run time. These are all simple 3 line routines with a single DO loop. They do respond to /p6 /OPT.

Vec_Sub is the most surprising one as when going from a Centrino (notebook) to a Xeon, FTN95 went from 200 sec to 33 seconds, while Lahey95 went from 50 to 47. Based on a op count approach and comparing with dot_product, the times it should take is 50 and 33 second for the 2 processors. I can never explain why this routine runs so slowly and now the two .exe files exhibits unexplained delays on either processor.

Your example of 2 million equations but a bandwidth of only 760 is a relatively small bandwidth, but 760 is significant for vector multiple size. Do you use a direct or itterative solver ? I use a direct skyline/crout solver (from 1980's) To me, your comment about number of multiplications is an approach that is no longer relevant to modern processors, as it does not explain the relative run times of some blocks of code, especially the Vec_Sub example above. Even changing simple vector syntax : TEMP(1:neq) = Eigen_Vectors(1:neq,K) T2(1:NEQ) = TMASS(1:NEQ) * TEMP(1:NEQ) to : call vec_move (Eigen_Vectors(1,K), TEMP, NEQ) call vecmlt (T2, TMASS, TEMP, NEQ) had a significant effect on run time ( say 30% reduction in run time) in one part of the code. The number of real8 calculations is identical in both cases but other (word moves etc) can take longer than the r8 calcs

I am about to try 64-bit compiler and hopefully multi-processor implementation for these few vector routines. I'm interested to see what I learn from this next stage.

The processor type is a new component of the optimiser mix.

Another problem I anticipate with 64-bit is if the prosesses start to share memory and you have to page out say an 8gb process. This could slow things down! Something new to test. You need lots of memory to avoid this.

John

26 Nov 2010 10:08 #7161

Another problem I anticipate with 64-bit is if the prosesses start to share memory and you have to page out say an 8gb process.

Paging works on page-basis (4096 bytes) and it's not a per-process action. But this doesn't seem to imply any difference from a 32bit application.

If you can pinpoint one specific code part (like a subroutine that is timed to use up 90% of the total time spent) that is reasonably small and is different speed-wise in an order of magnitude, you could try to analyze the object files of these routines. It would be interesting if ftn95 uses some CPU technology that works very well on Xeon processors, or if the other compilers can be forced to generate similar code.

26 Nov 2010 11:02 #7162

Sebastian,

Thanks for your thoughts. In relation to paging: I'd expect the program I'd run on Win-64 would have a very large array access; frequently accessing across many pages, rather than localised memory use. I think the answer is to not run multiple large processes and install a lot of memory. I indicated that there are 2 routines that do most of the calculating. For the routine that causes most of the unpredictable performance; VECSUB, I don't think there are a lot of alternatives, but there is a difference between FTN95 and LF95 for: SUBROUTINE VECSUB (A, B, FAC, N) ! ! Performs the vector operation [A] = [A] - [B] * FAC ! integer4 n, i real8 a(), b(), fac, c ! c = -fac do i = n,1,-1 ! -ve step is faster a(i) = a(i) + b(i) * c end do return ! END

How the processor uses the resulting assembler code is a mystery to me. I could try mixing .obj files and see what happens.

John

27 Nov 2010 1:03 #7164

frequently accessing across many pages, rather than localised memory use.

This should not be a problem if the main part of the arrays you are accessing is less than the amount of physical memory (the operating system will care for this working set being in memory, so rarely paging kicks in). If you are accessing your whole arrays linearly most of the time, you are right this will cause lots of disk accesses if the physical memory is not sufficient.

28 Nov 2010 6:20 (Edited: 28 Nov 2010 6:23) #7166

I follow the threads about faster processing, of which the need for a 64-bit FTN95 is a sub-theme, with great interest, as for four decades it has been a holy grail quest of my own. I use the term “holy grail quest”, because it is pursued with quasi-religious fervour, but suspecting that the object of desire can never truly be obtained, and indeed, that the closer one approaches the goal, the more likely it is to elude one’s grasp!

One of the recourses one has is to move to a faster compiler. When the FTN compiler was the fastest of the bunch, it was the ultimate. Of course, if one wants Clearwin, then one sticks to FTN95. To be honest, most of the things done in response to a user “click” are so close to instant, and a modern PC is so fast, that this type of program gains no practical benefit from being faster, and quite probably would function perfectly adequately if running significantly slower. Most of what I am doing these days falls into this category, and I have to declare no interest in switching compilers, and would only look to 64-bit FTN95 if 32-bit codes were no longer supported by the current Windows.

To other classes of user, the quest for speed is still an urge. I was taken by the idea that some runs might take one or two hours: I well remember running a PC for just short of 24 hours and before that a mainframe overnight. These types of codes don’t have much in the way of user interactivity, generally read their data and write their results to disk files, and could well be written in such standards conforming code that the choice of compiler is immaterial. Clearly, the Bing Maps response is not measured in hours, and one is talking there of shaving off perhaps fractions of a second. Most of what I do takes me a long time to think about, but rarely any measurable run times.

With the assumption that one could actually get 64-bit FTN95, I am sure that it would be an extra cost upgrade. Failing that, one would need another compiler. The Intel compiler has been mentioned. It looked like more than a grand (£1000+) to me – so I stopped looking! Then one needs a PC that has enough RAM to make 64-bit windows sensible. I found 12 and 16 Gb memory kits at £300 to £400. Much more RAM won’t fit on a standard mainboard, Windows 64 doesn’t always use unlimited RAM, and much less than 12Gb (say 6 or 8) is not much on an upgrade. Presumably one also needs to cost in Windows 64 and other pieces of hardware. 64-bit Linux is a further option, but still needs the appropriate hardware. I therefore looked for a way to get better performance for a lot less than £1000.

I’m going to turn my attention to a SoHo user (like myself) who wants to run software he has developed for his own purposes (which might be commercial) on his own hardware. I’ll make the assumption that the software runs for some hours, and could run faster in a 64-bit environment, but because of RAM limitations, the software needs to keep backing up data. My mental picture is a big finite element application, where the global stiffness matrix is too big to store and operate on in RAM. (I think that this is what John Campbell is doing).

Method 1: An immediate improvement is obtained if a dedicated fast hard drive is used for storage. I’ve used the WD Velociraptor 10k rpm drives, and they are much faster than normal hard drives. A single drive costs less than £100 now, is plug and play, and is likely to produce a bigger speed up than a compiler change costing 10x as much. SSD drives are smaller and more expensive, but they are faster too. The ultimate limit on speed up is the interface, so speeds might also be increased by using SCSI drives.

Method 2: Speedups can also be gained by installing multiple drives in a RAID0 array (I’ve done that too). This could be cheaper than a Velociraptor or SSD to get 4 standard drives, and of course, one can do RAID arrays with Velociraptors or SSD drives.

(continued)

28 Nov 2010 6:22 (Edited: 28 Nov 2010 6:28) #7167

Method 3: Assuming that one sticks with 32-bit Windows, large amounts of RAM can still be installed. The problem is that it isn’t addressed by Windows. Back in the days of MS DOS, extra RAM could be turned into the equivalent of a solid state disk using “ram disk” software. This type of utility is still produced, and provided that the RAM unused by Windows is enough for the “disk image” could speed up a program significantly as the “read and write to disk” is done at RAM speeds, minus the overhead of coding and encoding, and actually operating the ramdisk. It’s some time since I did this (vintage 386 computers) but the software abstraction operated then, and I see no reason why it should not now – assuming one gets the right ramdisk software! Costs include buying the RAM, and results are a bit indeterminate.

Method 4: This requires one to recode the critical routines in assembler. John Campbell indicated where 90+ % of his problems lie: simple dot product and vector addition/subtraction. In the book “8087 Applications and Programming for the IBM PC, XT, and AT”, published back in 1985 by Brady, and written by Richard Startz, he shows how to code various things in assembler to speed them up. His idea was to produce subroutines to interface with compiled MS Basic, but they worked with MS Fortran too. Taking John’s multiplication of two vectors to get a scalar, which I presume is coded in Fortran as something like:

SUM = 0.0
DO I=1,N
 SUM = SUM + A(I)*B(I)
END DO

then all that Startz did was load the A and B terms into the 8087 coprocessor stack, multiply them, and add the result to an accumulator that was another register in the coprocessor stack. This saved the operations to re-store SUM every time, and to get it back for the next addition. The speed-up was collossal – at least 10x when I tried it. Since then, 25 years ago, all sorts of new hardware facilities have been introduced on cpus, and no doubt judicious reading up on the facilities of SSE, 3DNow etc would pay dividends. If truly more than 90% of a program is spent in particular simple operations, then speed gains there are worth having. If one can’t do it oneself, then it might not be too expensive to hire an Assembler programmer to do the hand coding of dot product for you.

Method 5: Use a second computer, and don’t worry how long the analysis takes. I’ve had spare computers since shortly after I got the first one. If one only has one computer, then clogging it up for one hour is nearly as bad as two. Worse, getting the run time down to maybe 10 minutes might make you sit and watch the egg-timer, instead of going off and doing something useful.

Method 6: Using several spares. Since one’s spare computer(s) can be networked, they don’t even need their own keyboards and monitors once they have booted up, and they can be commanded to run (say) a finite element code from a remote machine. I had a go at this, and the way I did it was to set a program running on each of the slave machines that kept looking in their own subdirectory on a network attached storage disk (actually, I used another computer with its own hard drive networked) for a particular file. This file contained the location of the input file as a directory path, and a similar path showing where to store the results. The analysis program was then invoked by START_PROCESS@. My monitor application looked for the command file periodically based on delay settings done with Clearwin’s %dl, and closed itself after launching the analysis program. I toyed with restarting the monitor program again with START_PROCESS@ before closing the analysis program, and also with making the monitoring part of the analysis program, but didn’t bother when I found that the general idea worked. Presumably, a real programmer and not a Fortran dilettante like me would look at network broadcast messages.

(continued)

28 Nov 2010 6:25 #7168

The callback to %dl checks for the existence of the startup command file, opens it, reads the paths, and checks that they are valid before launching the analysis program. If errors are found, it simply doesn’t launch. I’m sure that I could send a message back somehow, but I haven’t looked into that. Instead, I just got the analysis program to put a message in a file where I could find it, saying “Machine No. X has finished”.

My local computer shop sells second-user Windows XP system units for around £50 each, containing a 2.4 to 2.8 GHz single core cpu, and had I not had a bunch of old computers doing nothing already, then I might have bought a couple. Nobody wants old computers, and they are cheap. The critical thing with using “just a bunch of old computers” is not to generate too much heat, and not to use too much electricity (same thing really) , so one needs to chuck out superfluous floppy and CD drives, and think a lot about ventilation. The last time I did serious finite element analysis, I ran the program about 8 times with modified geometry and properties. Running this on a single computer meant doing a run, thinking about the results, modifying the input and re-running. Having a bunch of computers running in parallel means that the computer array can get on an work on several of the obvious permutations all at the same time. I’ve been able to prove the idea using old junk, and the throughput of (say) 10 computers running different problems is potentially huge. After an initial time while the first run is done, the results crop up thick and fast. It did strike me that the most efficient time management was to set the whole bunch running last thing in the working day, and then the results of at least one of the runs is there the following morning!

Because my mental picture is of a big FE analysis, I couldn’t help but wonder if part of the problem is an inefficient storage of the coefficients. This leads to the solution algorithm working on lots of zeroes. It’s a big job switching to a different solver, but in FE terms, the bigger the global stiffness matrix becomes, the more likely it is to get sparser and sparser. Time could be spent beneficially in optimising the storage layout of the coefficients. The term “bandwidth” was used in the discussion. Back in the 70’s I used an FE system that stored coefficients based on the “node number”. I found that it was easy to reduce bandwidth enormously by giving each node an alias number, finding which connections corresponded to the biggest bandwidth and and swapping the aliases. (I was never able to get down to the theoretical minimum for ring shaped meshes, but the impact on bandwidth was huge for small inputs of optimising time). I scanned and rescanned an array of connectivities to do this. The actual analysis used the aliases, but the results were mapped back to the original mesh numbering so that I could find my way round the results. Optimisation is a key part of the frontal solution approach.

The answer, I think, lies in much more than just the choice of compiler.

Eddie

30 Nov 2010 4:04 (Edited: 30 Nov 2010 4:40) #7175

Eddie,

Thank you for all your explanations and the time you have invested to write it down. You are of course right with your conclusion that the compiler is only part of the solution. To avoid any misunderstanding: I love my FTN95 with its WINIO@. All our software and nearly all our GUIs are currently produced with this. The best would be a FTN95 compiler producing 64-bit code.

To understand my background: I am doing professional software development since more then 40 years, starting on Zuse Z11 and IBM 1130. For my doctoral thesis 1983 I was using mainframe computers like CDC Cyber 76 and Cray 1S. One important point of this thesis was to establish and optimize the processing of large equation systems in Photogrammetry. I evaluated all kinds of storage schemes and processing algorithms. For our typical sparse matrices I ended up with the profile storage scheme from A. Jennings, 1966 and for profile optimization with a modified Banker’s algorithm from R.A. Snay, 1976 (I was talking about “average bandwidth” in my response!). This can be the best combined with a Cholesky equation solution. On vector computers like Cray 1S and HP1000 this was a perfect solution – of course using Assembler programming for the inner cycles. Because the matrices did not fit into memory I optimized disk I/O.

Typically in Photogrammetry we have Taylor-linearised equation system to solve. In terms of optimization processing: Target function is least squares and edge conditions are mainly the geometric equations of taking serial aerial photos of the ground and as well many additional types of equations e.g. for ground control points, GPS in the plane, INS data (Inertial Navigation System) and others. Because our equations are linearised, we have to iterate to get a final result. Therefore the Cholesky equation solution is the most time consuming part of the software. After the last iteration, a part of the inverse will be estimated and some more matrices for statistic. All this is very very similar to solution of Finite Elements.

This program is part of our big software package with many graphical programs and many other tools having nice GUIs. Our main program is a pure batch program running in the background. Because it can take hours, we have to look for optimization. The compiler itself is not the principal question, but possibilities to use large memory areas. And here is a 64-bit compiler requested. Ram-disks are only crutches.

Fast processing is for us not a “holy grail quest”, but simply a question of commercial arguments. Our customers are frequently asking. All our tools around this main program have GUIs and processing speed is not important (only for our interactive graphical 3D viewer). BTW: Our software is used for Bing Maps production – not for user interaction.

A few years ago we have transferred our whole software package to Suse Linux using QT and LF95. We have invested one man-year, and the speed was better, but we did not sell one license. In our business is nearly only Windows in use.

The prices for software tools, compilers and workstation hardware are neglectable for us. We play in another price range. Our development and testing workstations have of course fast RAID disk arrays with 4 to 6 disks using 64 bit Raid controllers from 3Ware and Adaptec. This is as well true for our customers. When you know, that an Aerial Survey Camera starts with a price for about US$ 600.000, then it is clear, that only high end workstations are used for data processing. And speed is very important if you have to process thousands of large photos (up to 750 MB per image). Where the image processing is distributed to several workstations, the Cholesky algorithm CANNOT be distributed, because for the next dot-product the result from the prior dot-product is needed. If you look for the discussion from last May “64-Bit FTN95 Compiler”, you can see, that even no multi-core is helpful for this purpose – Except, you can hold the complete matrix in memory.

see continuation

30 Nov 2010 4:08 #7176

continuation

Then Cholesky decomposition can be used, which is probably the base for the software offered in equation.com.

As you can see, we have evaluated ALL possibilities which you mentioned; as well Assembler programming on a 386/387 PC years ago. I found, that my dot-product is the best. It is as well slightly faster then the DOT_PRODUCTs from Salford and Intel. It is using “unrolling”. The best factor is 7; probably because the inner cycle is still fitting into program stack of the CPU.

My software package is now since 25 years in the market. All the years we had to improve speed because of growing requirements (more photos, more complex data types).

My final conclusion now: We need 64-bit software to use as much memory as possible. Then we can take advantage of multi-core. Currently we are using only 25% of the power of our quad-core CPUs.

Erwin

30 Nov 2010 4:41 #7177

John,

One comment to your Celeron / XEON test: As far as I know, the Celeron has no floating point unit. FTN95 is probably not designed for integer processing. FORTRAN means formula transformation – of course with floating point values. It looks like, Lahey has invested more in software for floating point computation with integer values.

Empirical testing on PCs is still a very useful method. I do it as well, because I cannot do any scientific research for processing anymore (no turnover from that sort of work).

Years ago I tried “/optimize” with FTN95. My software did not run. Now I tried /P6/optimize once more again. The differences are not huge, but in the range of 5 to 10 %.

Of course our sparse matrices are stored in a vector.

Dot-product and vector multiplications are the most time consuming operations as well in our software. You mentioned some products like Vec_Sum and skyline/crout solver. Could you give me hints where to find it to call it from DLLs?

I try to avoid memory paging completely. I think, there should be an API function to keep the whole process with data in memory, but I found up to now only “Lock pages in memory” which seems to be a different function. I have removed the paging file from the system, but I am not sure that it is already sufficient.

Some time ago I have seen a Microsoft Windows program to visualize all process activities graphically like an enhanced Task Manager. But I cannot find it anymore. This should be very helpful.

Erwin

30 Nov 2010 4:42 #7178

John,

One comment to your Celeron / XEON test: As far as I know, the Celeron has no floating point unit. FTN95 is probably not designed for integer processing. FORTRAN means formula transformation – of course with floating point values. It looks like, Lahey has invested more in software for floating point computation with integer values.

Empirical testing on PCs is still a very useful method. I do it as well, because I cannot do any scientific research for processing anymore (no turnover from that sort of work).

Years ago I tried “/optimize” with FTN95. My software did not run. Now I tried /P6/optimize once more again. The differences are not huge, but in the range of 5 to 10 %.

Of course our sparse matrices are stored in a vector.

Dot-product and vector multiplications are the most time consuming operations as well in our software. You mentioned some products like Vec_Sum and skyline/crout solver. Could you give me hints where to find it to call it from DLLs?

I try to avoid memory paging completely. I think, there should be an API function to keep the whole process with data in memory, but I found up to now only “Lock pages in memory” which seems to be a different function. I have removed the paging file from the system, but I am not sure that it is already sufficient.

Some time ago I have seen a Microsoft Windows program to visualize all process activities graphically like an enhanced Task Manager. But I cannot find it anymore. This should be very helpful.

Erwin

30 Nov 2010 8:01 #7179

Eddie and Erwin,

Thank you very much for your responses. Your points are spot on.

I agree that the quest for faster run times is not entirely logical. It is certainly blinkered, as for my programs, once the equation matrix does not fit into memory, disk I/O is certainly significant in run time performance. However problems requiring disk I/O should be avoided as when the matrix is fully in memory, it is the processor speed, and itterations are much faster. Hence the need for 64-bit.

Eddie, direct solution of linear equations is not suited to distributed processing, while itterative solvers (such as gauss-seidel and conjugate gradient methods) are suited to distributed processing. I don't have the time to understand how they work !!

Erwin, my equation solver is based on the skyline equation solver. It is more applicable for variable bandwidth problems. It dates back to late 70's and includes papers by Powel and Taylor from UCB. You basically have a profile array which is the bandwidth for each equation. For a symmetric banded matrix. The matrix is stored in a vector, with the profile array then pointing to the position of the diagonal. It can't do worse than a fixed band solver. I have always preferred the simplicity of this to the Frontal solver. The associated band width optimisers are based on papers by Hoit and Sloan (80's)

For linear equation solution, the matrix reduction is based on dot_product the 'load vector' is forward reduced by dot_product, but backward substitution is with vec_sub, which is where I have all the processor unpredictability. It would be interesting to see how equation.com could be used to utilise multiple processors for dot_product. It could give up to a 2 or 4x improvement. Dan may know of this although I'm not sure of the cost of using this software. I am going to try to see if I can produce a multi-processor version of dot_product to run on a dual and quad core machine.

The thread in my latest posts is that the way modern processors optimise code for dot_product, makes old optimisation approaches based on op count and assembler less significant, as I think I am right in saying that within the processor it does other calculations to try and optimise performance. My latest test on the Xeon may have indicated this.

We'll keep trying !!

30 Nov 2010 8:03 #7180

I have removed the paging file from the system, but I am not sure that it is already sufficient.

Disabling the swap files is enough.

Some time ago I have seen a Microsoft Windows program to visualize all process activities graphically like an enhanced Task Manager.

Probably Process Explorer and more from http://technet.microsoft.com/de-de/sysinternals/default.aspx

1 Dec 2010 2:57 #7185

Anyone interested with such libraries ? I have also some simple test benches of using these LIB files with FTN95. Let me know in PM.

My interest here is the following. I have two such libraries: one is built for Microsoft Fortran (may be i keep ones even for some other compilers somewhere) around 10 years ago and it is a bit slower. The latest one, about one year old, is faster, but it does not always works with FTN95 due to incompatibilities since it was built with Intel Fortran which is not super-compatible with FTN95. For example dense matrix solvers work fine, but some sparse ones don't, etcetcetc.

But that's easy to fix to anyone who knows Intel Vis Fortran and the way how to make DLL there. I even have the detailed instruction personally from Intel's Steve Lionel, but i do not have latest version of IVF and do not have time to devote to all that, the older version of library work fine and delivers speedup i was almost happy. Using newer library/DLL would be nice though since it is built for modern processors.

If anyone can transform this library into DLL that would be great help for me. The author refused to make DLLs himself by unknown reason while it would be great universal solution - his one single DLL would work across all Windows compilers, instead of making 5 different LIBs for each. But you all know how hard to convince people to change something substantial.

Also, on a different matters of this thread, i agree 64-bit FTN95 is urgently needed. But seems that does not depend on us much if we can not vote with the money. The developers do not clarify this question though. I am really puzzled where difficulty lies in transferring to 64 bits. Many compilers already have done that.

1 Dec 2010 3:13 #7188

Thanks, Sebastian.

There are nice tools on that page.

But I did not find out, how to disable the swap file from within a Fortran program ?

Erwin

1 Dec 2010 8:14 #7190

I'm not aware that this is possible from an application, only using the operating system configuration (which requires a reboot).

Btw. be sure to check the english (en-us) sysinternals page as well since it links to even more programs.

Please login to reply.