View previous topic :: View next topic |
Author |
Message |
jcherw
Joined: 27 Sep 2018 Posts: 57 Location: Australia
|
Posted: Thu Aug 08, 2019 1:14 am Post subject: Speed improvement 32 vs 64 bit |
|
|
Could someone give me an indication what sort of speed improvement to expect when moving from 32 bit to 64 bit for a program that spends most it time solving a large sparsely populated tri-diagonal matrix (see eg https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/WR025i003p00551) algorithms in my program are very robust but a bit dated (1970s - 1980s). |
|
Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1899
|
Posted: Thu Aug 08, 2019 2:04 am Post subject: |
|
|
The answer depends very much on what you mean by "32-bit" and "64-bit".
If you use the FTN95 compiler, 32-bit programs use X87 instructions and 64-bit programs use SSE2 instructions. SSE2 arithmetic is considerable faster than X87 arithmetic.
Other Fortran compilers can generate SSE2 arithmetic in 32-bit as well as 64-bit programs. They may produce EXEs whose 32-bit version runs faster than the corresponding 64-bit EXEs, because there is less memory to CPU data movement in the 32-bit case.
If speed is important, I suggest that the computational part of the program be written in standard Fortran, checked and debugged using FTN95, and then recompiled with another compiler such as Intel, for speed. |
|
Back to top |
|
 |
jcherw
Joined: 27 Sep 2018 Posts: 57 Location: Australia
|
|
Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1899
|
Posted: Thu Aug 08, 2019 3:23 am Post subject: |
|
|
If, indeed, the tridiagonal solution is the main bottleneck, try using the MKL/Lapack routine ?GTSV instead of your own routine. You can call MKL routines from your FTN95 compiled program quite easily if you use the F77 interfaces. |
|
Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2615 Location: Sydney
|
Posted: Thu Aug 08, 2019 5:16 am Post subject: |
|
|
With FTN95 /64 you can also get access to SSE and AVX instructions if you use DOT_PRODUCT8@(x,y,n) and AXPY8@(y,x,n,a). These can produce good performance improvement. (note that integer*8 n)
You can switch between SSE and AVX via USE_AVX@(level).
For improving the performance of vector calculations with AVX instructions, it is essential to have some understanding of the interaction of memory and cache. Ignoring a potential memory <> cache bottleneck can produce very disappointing performance. (this is the best advice I was once given)
You need to understand/test that the possible improved performance can be very sensitive to cached array usage, so for larger arrays or random memory referencing the performance may not be automatic. This applies to both L1 and L2 cache so can be a bit of a dark art.
Faster computation requires faster memory transfer rates of large vectors so memory <> cache transfer rates can become a significant performance limiter. Note that 64-bit can imply larger arrays, so greater memory transfer demands, so slower performance.
You may need strategies to reduce the memory transfer rates so a greater proportion of arrays are already in cache, both L2 and L1. I have a pseudo blocked skyline solver which uses 0.5 * L2 cache size blocks to improve cache use efficiency with significant effect.
With FTN95 /64 it also depends on how well you can apply the SSE/AVX routines to your calculation.
Linear equation solution has localised code performance hot spots so for vector calculations it can be easy to apply.
For more complex calculations this might not be as easy to implement.
John
Regarding cache use efficiency: gFortran Ver 7 introduced a new version of MATMUL that is based on partitioning the matrices into 4x4 sub-matrices. These two 128 byte arrays fit into L1 cache and produce single thread AVX performance better than what can be achieved with other multi-thread solutions. This produces amazing performance for large arrays.
Unfortunately MATMUL is rarely used in my calculations. |
|
Back to top |
|
 |
jcherw
Joined: 27 Sep 2018 Posts: 57 Location: Australia
|
Posted: Thu Aug 08, 2019 6:46 am Post subject: |
|
|
I fully agree that optimizing the code and better algorithms are the best path to more speed. However, the question is why the 32 bit version and 64 bit version created of the same code with same compiler (Silverfrost ftn95 v 8 as per above) give very similar results re. execution speed for a calculation conducted in double precision (ie Real*8, ie 64 bit float). I expedted the 64 bit to do better than the 32 bit ... |
|
Back to top |
|
 |
jcherw
Joined: 27 Sep 2018 Posts: 57 Location: Australia
|
Posted: Thu Aug 08, 2019 11:07 am Post subject: |
|
|
John -
I am fully on board with optimising models by making them sensible. My first geological flow model in 1982 was ~10,000 nodes. It took a lot of thinking to conceptualise a natural system (in form of several scenarios of the unknown subsurface) and quite some work to get it running and run it in on a mainframe, but it resulted in some good insight. These days I get regularly exposed to multi-million node models put together in a whim with a graphical 3D model builder. And guess what, they often deliver much less understanding mostly because insufficient time is spent understanding nature vs. time spent doing computing.
Nevertheless, I'd like to understand the tool (compiler) I am using. Thus, I do like to understand the difference between using the 32 bit and 64 bit compiler option. Is it just the extra memory addressing space that can be used? or are there other additional differences? |
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
Posted: Fri Aug 09, 2019 6:55 am Post subject: |
|
|
jcherw
The 64 bit Polyhedron benchmark tests for FTN95 use v8.05 but optimisation was not introduced until v8.10. As I recall, we forgot to disable the switch in v8.05 so this is not to criticise Polyhedron.
At some point I will aim to run the tests again to see how much difference this makes. |
|
Back to top |
|
 |
jcherw
Joined: 27 Sep 2018 Posts: 57 Location: Australia
|
Posted: Fri Aug 09, 2019 9:48 am Post subject: |
|
|
Here is an interesting link on this subject which I found after lots of googling
https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/298526
This is in line with what some earlier posts mentioned.
So from a speed point of view bigger (64 b) is not necessarily (a lot) better. The extra memory addressing space is obviously the main upside.
As per some of the posts, I am currently looking into optimizing algorithms and off course as always vigilant that most time is saved by thinking and building understanding before running complex modeling software.
Thanks |
|
Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1899
|
Posted: Fri Aug 09, 2019 12:07 pm Post subject: Re: |
|
|
PaulLaidler wrote: |
The 64 bit Polyhedron benchmark tests for FTN95 use v8.05 but optimisation was not introduced until v8.10. As I recall, we forgot to disable the switch in v8.05 so this is not to criticise Polyhedron.
|
The Polyhedron results at https://www.fortran.uk/fortran-compiler-comparisons/polyhedron-benchmarks-win64-on-intel/ were obtained with /P6 /OPT using FTN95-8.05, so I find Paul's comment about optimisation puzzling.
I ran some of the Polyhedron benchmarks on a Win-10 PC with an I5-8400 CPU, using FTN95 8.51. Here are some results, all obtained with /OPT.
Code: | TEST 32-bit 64-bit
---- --- ----
AC 8.8 9.7
Aermod sqrt(-) 18.1
Air 5.7 7.6
Capacita 28.6 32.0
Channel2 137.3 194.1
Doduc 22.1 21.6 +
Fatigue2 180.7 211.3
Gas_Dyn2 120.2 75.2 +
Induct2 335.5 164.6 +
Linpk 4.1 4.6
MDBX 10.6 11.1
MP_Prop 532.3 583.4
NF 11.8 12.0
Protein 26.5 31.5
Rnflow 31.8 22.0 +
TestFPU2 156.8 111.6 +
TFFT2 46.1 54.7 |
The lines ending with '+' are the only cases where /64 gave faster runs. For Jcherw, the implication is that /64 will probably produce slightly slower EXEs. Little effort is needed to verify this assertion with his own application -- compile, run and time a test case with and without /64.
The AERMOD test is a strange case. The 32-bit EXE produced by FTN95 8.51 crashes with SQRT(-ve arg), but this does not happen if /OPT is not used. No such problem occurs with Version 7.20, so I suspect there is a new bug in 32-bit optimized compilations with versions 8.20 and later for this program. Given that the source file is over 50,000 lines, I have no incentive to track this down.
Last edited by mecej4 on Sun Aug 11, 2019 1:51 pm; edited 2 times in total |
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
Posted: Fri Aug 09, 2019 1:17 pm Post subject: |
|
|
mecej4
The Polyhedron results for 64 bit FTN95 are without optimisation. The switch /opt was permitted at v8.05 but had no effect. Optimisation was introduced later at v8.10. |
|
Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1899
|
Posted: Fri Aug 09, 2019 1:31 pm Post subject: |
|
|
BUT...
Polyhedron did not build for X64 (at least on the page for which I gave a link above). Below the table, under "Compiler switches", you can see for FTN95:
Quote: | FTN95 ftn95 /p6 /optimize (slink was used to increase the stack size) |
Note the presence of /p6. Therefore, they only produced and ran a 32-bit EXE. That the OS is reported as W64 is probably of no concern for comparison purposes.
Or, Paul, do you have a different Polyhedron page in mind?
PS: Some points that you made about 8.05 did not agree with my vague recollections, so I re-installed that old version from a backup that I had. I find that the 8.05 compiler aborts compilation when given /opt /64 :
Code: | S:\PolyHed\pb11\win\source>ftn95 /opt /64 ac.f90 /link
[FTN95/Win32 Ver. 8.05.0 Copyright (c) Silverfrost Ltd 1993-2016]
*** /OPTIMISE is not available in FTN95/64
1 ERROR [] - Compilation failed. |
|
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
Posted: Fri Aug 09, 2019 5:47 pm Post subject: |
|
|
OK thanks. Either way the results are not for FTN95 64 bit optimised code. |
|
Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1899
|
Posted: Sun Aug 11, 2019 2:05 pm Post subject: |
|
|
Jcherw: You may find this older thread relevant to your question regarding the performance of linear equation solvers.
http://forums.silverfrost.com/viewtopic.php?t=3063
In that thread, John Campbell, DanRRight and I ran tests on the performance of MKL, Pardiso and Laipe linear equation solvers during 2015 - 2017. Most of the posts in that thread predate the emergence of 64-bit FTN95, but whether the tests were run using 32-bit or 64-bit CPUs was not a major issue. Parallelism, FPU instuction sets and the ability to exploit matrix sparsity and structure were found to affect performance. |
|
Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2615 Location: Sydney
|
Posted: Sun Aug 18, 2019 5:21 am Post subject: |
|
|
mecej4,
You refer to an interesting thread. I should try to update this thread for using ftn95/64 and SSE/AVX instruction set routines. As a single thread solution, the approach of a "cache blocked' single thread solver should show significant improvement, due to being able to use these instructions in an efficient way.
Most of my applications are now running as 64-bit, although large memory solvers are not necessarily a faster solution, as they can be applied as a lazy cache inefficient approach.
As you note, the choice of linear equation solver approach will always be based on which one best exploits the matrix sparsity. These are available in both 32-bit or 64-bit.
A lot of the advantage expected from converting to 64-bit applications has been mitigated by simply using a 64-bit O/S with improved disk buffering.
The key advantages in moving to FTN95 /64 are availability of SIMD instructions and coding simplicity with larger arrays, providing it is not done in a cache lazy way. The old virtual memory coding approaches of the 70's are still very useful for /64. |
|
Back to top |
|
 |
|