forums.silverfrost.com

jcherw · Joined: 27 Sep 2018 Posts: 57 Location: Australia

Could someone give me an indication what sort of speed improvement to expect when moving from 32 bit to 64 bit for a program that spends most it time solving a large sparsely populated tri-diagonal matrix (see eg https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/WR025i003p00551) algorithms in my program are very robust but a bit dated (1970s - 1980s).

mecej4 · Joined: 31 Oct 2006 Posts: 1949 Location: USA

The answer depends very much on what you mean by "32-bit" and "64-bit".

If you use the FTN95 compiler, 32-bit programs use X87 instructions and 64-bit programs use SSE2 instructions. SSE2 arithmetic is considerable faster than X87 arithmetic.

Other Fortran compilers can generate SSE2 arithmetic in 32-bit as well as 64-bit programs. They may produce EXEs whose 32-bit version runs faster than the corresponding 64-bit EXEs, because there is less memory to CPU data movement in the 32-bit case.

If speed is important, I suggest that the computational part of the program be written in standard Fortran, checked and debugged using FTN95, and then recompiled with another compiler such as Intel, for speed.

jcherw · Joined: 27 Sep 2018 Posts: 57 Location: Australia

I have been using ftn95 ver 8.30.0 combined with Plato 4.83. In plato I selected the Release Win32 and the Release x64 respectively. Subsequently I have run both executables from the command prompt. I am planning indeed to trial intel for speed, as various post on the web imply that that is optimal (see https://www.fortran.uk/fortran-compiler-comparisons/polyhedron-benchmarks-win64-on-intel/)

mecej4 · Joined: 31 Oct 2006 Posts: 1949 Location: USA

If, indeed, the tridiagonal solution is the main bottleneck, try using the MKL/Lapack routine ?GTSV instead of your own routine. You can call MKL routines from your FTN95 compiled program quite easily if you use the F77 interfaces.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

With FTN95 /64 you can also get access to SSE and AVX instructions if you use DOT_PRODUCT8@(x,y,n) and AXPY8@(y,x,n,a). These can produce good performance improvement. (note that integer*8 n)
You can switch between SSE and AVX via USE_AVX@(level).

For improving the performance of vector calculations with AVX instructions, it is essential to have some understanding of the interaction of memory and cache. Ignoring a potential memory <> cache bottleneck can produce very disappointing performance. (this is the best advice I was once given)

You need to understand/test that the possible improved performance can be very sensitive to cached array usage, so for larger arrays or random memory referencing the performance may not be automatic. This applies to both L1 and L2 cache so can be a bit of a dark art.

Faster computation requires faster memory transfer rates of large vectors so memory <> cache transfer rates can become a significant performance limiter. Note that 64-bit can imply larger arrays, so greater memory transfer demands, so slower performance.

You may need strategies to reduce the memory transfer rates so a greater proportion of arrays are already in cache, both L2 and L1. I have a pseudo blocked skyline solver which uses 0.5 * L2 cache size blocks to improve cache use efficiency with significant effect.

With FTN95 /64 it also depends on how well you can apply the SSE/AVX routines to your calculation.
Linear equation solution has localised code performance hot spots so for vector calculations it can be easy to apply.
For more complex calculations this might not be as easy to implement.

John

Regarding cache use efficiency: gFortran Ver 7 introduced a new version of MATMUL that is based on partitioning the matrices into 4x4 sub-matrices. These two 128 byte arrays fit into L1 cache and produce single thread AVX performance better than what can be achieved with other multi-thread solutions. This produces amazing performance for large arrays.
Unfortunately MATMUL is rarely used in my calculations.

jcherw · Joined: 27 Sep 2018 Posts: 57 Location: Australia

I fully agree that optimizing the code and better algorithms are the best path to more speed. However, the question is why the 32 bit version and 64 bit version created of the same code with same compiler (Silverfrost ftn95 v 8 as per above) give very similar results re. execution speed for a calculation conducted in double precision (ie Real*8, ie 64 bit float). I expedted the 64 bit to do better than the 32 bit ...

jcherw · Joined: 27 Sep 2018 Posts: 57 Location: Australia

John -

I am fully on board with optimising models by making them sensible. My first geological flow model in 1982 was ~10,000 nodes. It took a lot of thinking to conceptualise a natural system (in form of several scenarios of the unknown subsurface) and quite some work to get it running and run it in on a mainframe, but it resulted in some good insight. These days I get regularly exposed to multi-million node models put together in a whim with a graphical 3D model builder. And guess what, they often deliver much less understanding mostly because insufficient time is spent understanding nature vs. time spent doing computing.

Nevertheless, I'd like to understand the tool (compiler) I am using. Thus, I do like to understand the difference between using the 32 bit and 64 bit compiler option. Is it just the extra memory addressing space that can be used? or are there other additional differences?

PaulLaidler · Posted: Fri Aug 09, 2019 6:55 am Post subject:

jcherw

The 64 bit Polyhedron benchmark tests for FTN95 use v8.05 but optimisation was not introduced until v8.10. As I recall, we forgot to disable the switch in v8.05 so this is not to criticise Polyhedron.

At some point I will aim to run the tests again to see how much difference this makes.

jcherw · Joined: 27 Sep 2018 Posts: 57 Location: Australia

Here is an interesting link on this subject which I found after lots of googling

https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/298526

This is in line with what some earlier posts mentioned.

So from a speed point of view bigger (64 b) is not necessarily (a lot) better. The extra memory addressing space is obviously the main upside.

As per some of the posts, I am currently looking into optimizing algorithms and off course as always vigilant that most time is saved by thinking and building understanding before running complex modeling software.

Thanks

mecej4 · Joined: 31 Oct 2006 Posts: 1949 Location: USA

PaulLaidler · Posted: Fri Aug 09, 2019 1:17 pm Post subject:

mecej4

The Polyhedron results for 64 bit FTN95 are without optimisation. The switch /opt was permitted at v8.05 but had no effect. Optimisation was introduced later at v8.10.

mecej4 · Joined: 31 Oct 2006 Posts: 1949 Location: USA

BUT...

Polyhedron did not build for X64 (at least on the page for which I gave a link above). Below the table, under "Compiler switches", you can see for FTN95:

PaulLaidler · Posted: Fri Aug 09, 2019 5:47 pm Post subject:

OK thanks. Either way the results are not for FTN95 64 bit optimised code.

mecej4 · Joined: 31 Oct 2006 Posts: 1949 Location: USA

Jcherw: You may find this older thread relevant to your question regarding the performance of linear equation solvers.

http://forums.silverfrost.com/viewtopic.php?t=3063

In that thread, John Campbell, DanRRight and I ran tests on the performance of MKL, Pardiso and Laipe linear equation solvers during 2015 - 2017. Most of the posts in that thread predate the emergence of 64-bit FTN95, but whether the tests were run using 32-bit or 64-bit CPUs was not a major issue. Parallelism, FPU instuction sets and the ability to exploit matrix sparsity and structure were found to affect performance.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

mecej4,

You refer to an interesting thread. I should try to update this thread for using ftn95/64 and SSE/AVX instruction set routines. As a single thread solution, the approach of a "cache blocked' single thread solver should show significant improvement, due to being able to use these instructions in an efficient way.

Most of my applications are now running as 64-bit, although large memory solvers are not necessarily a faster solution, as they can be applied as a lazy cache inefficient approach.

As you note, the choice of linear equation solver approach will always be based on which one best exploits the matrix sparsity. These are available in both 32-bit or 64-bit.

A lot of the advantage expected from converting to 64-bit applications has been mitigated by simply using a 64-bit O/S with improved disk buffering.

The key advantages in moving to FTN95 /64 are availability of SIMD instructions and coding simplicity with larger arrays, providing it is not done in a cache lazy way. The old virtual memory coding approaches of the 70's are still very useful for /64.