forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Speed improvement 32 vs 64 bit
Goto page 1, 2  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
jcherw



Joined: 27 Sep 2018
Posts: 54
Location: Australia

PostPosted: Thu Aug 08, 2019 1:14 am    Post subject: Speed improvement 32 vs 64 bit Reply with quote

Could someone give me an indication what sort of speed improvement to expect when moving from 32 bit to 64 bit for a program that spends most it time solving a large sparsely populated tri-diagonal matrix (see eg https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/WR025i003p00551) algorithms in my program are very robust but a bit dated (1970s - 1980s).
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1198

PostPosted: Thu Aug 08, 2019 2:04 am    Post subject: Reply with quote

The answer depends very much on what you mean by "32-bit" and "64-bit".

If you use the FTN95 compiler, 32-bit programs use X87 instructions and 64-bit programs use SSE2 instructions. SSE2 arithmetic is considerable faster than X87 arithmetic.

Other Fortran compilers can generate SSE2 arithmetic in 32-bit as well as 64-bit programs. They may produce EXEs whose 32-bit version runs faster than the corresponding 64-bit EXEs, because there is less memory to CPU data movement in the 32-bit case.

If speed is important, I suggest that the computational part of the program be written in standard Fortran, checked and debugged using FTN95, and then recompiled with another compiler such as Intel, for speed.
Back to top
View user's profile Send private message
jcherw



Joined: 27 Sep 2018
Posts: 54
Location: Australia

PostPosted: Thu Aug 08, 2019 2:57 am    Post subject: Reply with quote

I have been using ftn95 ver 8.30.0 combined with Plato 4.83. In plato I selected the Release Win32 and the Release x64 respectively. Subsequently I have run both executables from the command prompt. I am planning indeed to trial intel for speed, as various post on the web imply that that is optimal (see https://www.fortran.uk/fortran-compiler-comparisons/polyhedron-benchmarks-win64-on-intel/)
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1198

PostPosted: Thu Aug 08, 2019 3:23 am    Post subject: Reply with quote

If, indeed, the tridiagonal solution is the main bottleneck, try using the MKL/Lapack routine ?GTSV instead of your own routine. You can call MKL routines from your FTN95 compiled program quite easily if you use the F77 interfaces.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2143
Location: Sydney

PostPosted: Thu Aug 08, 2019 5:16 am    Post subject: Reply with quote

With FTN95 /64 you can also get access to SSE and AVX instructions if you use DOT_PRODUCT8@(x,y,n) and AXPY8@(y,x,n,a). These can produce good performance improvement. (note that integer*8 n)
You can switch between SSE and AVX via USE_AVX@(level).

For improving the performance of vector calculations with AVX instructions, it is essential to have some understanding of the interaction of memory and cache. Ignoring a potential memory <> cache bottleneck can produce very disappointing performance. (this is the best advice I was once given)

You need to understand/test that the possible improved performance can be very sensitive to cached array usage, so for larger arrays or random memory referencing the performance may not be automatic. This applies to both L1 and L2 cache so can be a bit of a dark art.

Faster computation requires faster memory transfer rates of large vectors so memory <> cache transfer rates can become a significant performance limiter. Note that 64-bit can imply larger arrays, so greater memory transfer demands, so slower performance.

You may need strategies to reduce the memory transfer rates so a greater proportion of arrays are already in cache, both L2 and L1. I have a pseudo blocked skyline solver which uses 0.5 * L2 cache size blocks to improve cache use efficiency with significant effect.

With FTN95 /64 it also depends on how well you can apply the SSE/AVX routines to your calculation.
Linear equation solution has localised code performance hot spots so for vector calculations it can be easy to apply.
For more complex calculations this might not be as easy to implement.

John

Regarding cache use efficiency: gFortran Ver 7 introduced a new version of MATMUL that is based on partitioning the matrices into 4x4 sub-matrices. These two 128 byte arrays fit into L1 cache and produce single thread AVX performance better than what can be achieved with other multi-thread solutions. This produces amazing performance for large arrays.
Unfortunately MATMUL is rarely used in my calculations.
Back to top
View user's profile Send private message
jcherw



Joined: 27 Sep 2018
Posts: 54
Location: Australia

PostPosted: Thu Aug 08, 2019 6:46 am    Post subject: Reply with quote

I fully agree that optimizing the code and better algorithms are the best path to more speed. However, the question is why the 32 bit version and 64 bit version created of the same code with same compiler (Silverfrost ftn95 v 8 as per above) give very similar results re. execution speed for a calculation conducted in double precision (ie Real*8, ie 64 bit float). I expedted the 64 bit to do better than the 32 bit ...
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1226
Location: Aerospace Valley

PostPosted: Thu Aug 08, 2019 8:33 am    Post subject: Reply with quote

... already a fascinating post this one, thank you jcherw for raising the question.

As a simpleton FE user it's all 'below the bonnet' as far as I'm concerned.
It probably doesn't make a damn bit of difference for we 'mundane' 95% of users but for you high-octane FE developers and the like it's obviously 'critical' in these modern times obsessed with performance statistics.

but I do notice the trend in recent times for FE vendors to be obsessed with speed improvements they achieve.

I guess there are really x3 questions here.
1. Should x64 always be faster than x32 ? (which is the obvious layman's 'expectation')
2a. Is x64 faster than x32 ?
2b ... and if not, why not

As to whether or not we mundanies can contribute to the discussion, maybe there's a SF (or other) benchmark program we can all run to help focus on the actual performance figures , which after all is the reality ?
What do SF use internally to benchmark before release ?

Or maybe that would put too many feeble pigeons amongst the big cats and just create confusion ?


The Pragmatic (Common Sense) Approach to a Solution of FE Solving Speed
Of course, as far as FE is concerned, there's always been a much simpler way to DRAMATICALLY increase the productivity of FE models ..... you limit the maximum size of the damn models being created !!!!

The only reason (most) engineers create ridiculously large models is because the FE salesman tells them they can !!!

I blame the universities too.ing to
Does no one teach them that a 'refined mesh' doesn't mean make the overall model mesh density the same as that small region you need to hypermesh because of a local stress concentration ?
Does no one teach them these days that you mesh globally and then multiply the result by 3 or 5 and you get the same accuracy, or better, than trying to model every tiny 'feature' in CAD model ?
'Idealisation' is not taught these days at all.

As a result of this 2-pronged assault on their intelligence, a brain-washing if you like, there's no longer any regard by young engineers for the size of the beasts they create.

Let alone the computing speed problems it creates, the post-processing of data becomes an absolute nightmare !

All young FE engineers should be forced to wear a T-shirt at work which says 'I made the mistake of making it that large because I can' with a picture of their latest 'creation' below it.

Of course the best way is to not allow them to touch a computer until they pass a rigid test to prove they re not one of the 'damn the size of the model, it looks brill' brigade.

about 15 years ago I started a new job and my first task was to simplify an FE model of a full spacecraft, in part to make it run faster for dynamic analyses. The model had been 'developed' under the control of a stress engineer and just used for stress. The 2 are incompatible ... bring back seperate models for statics and dynamics ! is the rallying cry to be learned here !!!

When I first plotted it it was literally just a black blob on the paper !!!

When I'd finished the model was visible whenplotted and the runtime dropped from about 3 hours to 10 minutes or so ! ... with very good correlation achieved with the original stress model.

10 years before that I once witnessd a whole mechanical department bought to it's knees on a Friday afternoon.
I think I mentioned this before on post on here a long while back.
An engineer (yes she was young, yes she was a wo- ....) rushed through the department asking everyone to cansl all jobs so that her model wouldn't crash - she had to have the results by monday morning.

Of course, everyone complied to the demand of the fairer sex, but on further 'interogtion' (in the nicest posible way) one asked ...

'how big is the model ?'

the answer ... 500000 nodes (that was huge at the time, for any model)

'what's it a model of ?'.

the answer ........ a black box !!!!!!!

1998 - 500000 nodes for a black box !!!!!

Cue a mixture of hysterical laughter and/or open-mo
_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile "
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1226
Location: Aerospace Valley

PostPosted: Thu Aug 08, 2019 8:35 am    Post subject: Reply with quote

(DAMN that post limit !) ...

Now where was I, oh yes ....

Cue a mixture of hysterical laughter and/or open-mouthed gaping.

The FE vendor's obsession with trying to tell users that they can model everything in the finest detail to get perfect results (especially stress results) directly ... from an FE model, winning the day again !

In 1980 when I first started work we had 2000 node complete spacecraft models, and nothing ever went wrong because of it !
... because the structural analysts had the time (and inclination) to .. .well analyze the results.
Structural analysts for the most part today re no longer nalysts, they are just number-crunchers, autonomes programmed to churn out results, often without examining them in any detail whatsoever. Stuck in the groove believing what ever the 'master', the machine, tells them.

They all need to be forced to wear another T-shirt with my signature below on it too !

End of Part I of the philosophical argument.
_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile "
Back to top
View user's profile Send private message
jcherw



Joined: 27 Sep 2018
Posts: 54
Location: Australia

PostPosted: Thu Aug 08, 2019 11:07 am    Post subject: Reply with quote

John -

I am fully on board with optimising models by making them sensible. My first geological flow model in 1982 was ~10,000 nodes. It took a lot of thinking to conceptualise a natural system (in form of several scenarios of the unknown subsurface) and quite some work to get it running and run it in on a mainframe, but it resulted in some good insight. These days I get regularly exposed to multi-million node models put together in a whim with a graphical 3D model builder. And guess what, they often deliver much less understanding mostly because insufficient time is spent understanding nature vs. time spent doing computing.

Nevertheless, I'd like to understand the tool (compiler) I am using. Thus, I do like to understand the difference between using the 32 bit and 64 bit compiler option. Is it just the extra memory addressing space that can be used? or are there other additional differences?
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1226
Location: Aerospace Valley

PostPosted: Fri Aug 09, 2019 12:52 am    Post subject: Reply with quote

Fundamentally the basic interest is in the huge potential memory accessability.
Potential because much depends on the hardware and OS also.

The difficulty then is having the correct memory management strategy in place.

Speed wise the general advic is not to expect any significant improvement but also don't expct and significant speed reduction either !

Lots of small programs claim they're 6ý4 bit but in fact re nothing of the sort, they're just 64bit computer compatible (which all 32bit progrms are). and are installed in the 32bit programs ditýrectory !
Those varsions provide no advantages over 'true' 64bit programs.

It's a bit like usb memory sticks, they're all usb compatible, some are USB & UB2 ompatible and some are USB3, USB3 nd USB compativýble.

My own personal opinion is that except in the specialised high-end devlopment categories (like you'r involved in) 64bit is a bit of a red-herring for 95% of users.
_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile "
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 6069
Location: Salford, UK

PostPosted: Fri Aug 09, 2019 6:55 am    Post subject: Reply with quote

jcherw

The 64 bit Polyhedron benchmark tests for FTN95 use v8.05 but optimisation was not introduced until v8.10. As I recall, we forgot to disable the switch in v8.05 so this is not to criticise Polyhedron.

At some point I will aim to run the tests again to see how much difference this makes.
Back to top
View user's profile Send private message
jcherw



Joined: 27 Sep 2018
Posts: 54
Location: Australia

PostPosted: Fri Aug 09, 2019 9:48 am    Post subject: Reply with quote

Here is an interesting link on this subject which I found after lots of googling

https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/298526

This is in line with what some earlier posts mentioned.

So from a speed point of view bigger (64 b) is not necessarily (a lot) better. The extra memory addressing space is obviously the main upside.

As per some of the posts, I am currently looking into optimizing algorithms and off course as always vigilant that most time is saved by thinking and building understanding before running complex modeling software.

Thanks
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1198

PostPosted: Fri Aug 09, 2019 12:07 pm    Post subject: Re: Reply with quote

PaulLaidler wrote:

The 64 bit Polyhedron benchmark tests for FTN95 use v8.05 but optimisation was not introduced until v8.10. As I recall, we forgot to disable the switch in v8.05 so this is not to criticise Polyhedron.

The Polyhedron results at https://www.fortran.uk/fortran-compiler-comparisons/polyhedron-benchmarks-win64-on-intel/ were obtained with /P6 /OPT using FTN95-8.05, so I find Paul's comment about optimisation puzzling.

I ran some of the Polyhedron benchmarks on a Win-10 PC with an I5-8400 CPU, using FTN95 8.51. Here are some results, all obtained with /OPT.

Code:
  TEST     32-bit   64-bit
  ----       ---     ----
AC            8.8     9.7
Aermod       sqrt(-) 18.1
Air           5.7     7.6
Capacita     28.6    32.0
Channel2    137.3   194.1
Doduc        22.1    21.6 +
Fatigue2    180.7   211.3
Gas_Dyn2    120.2    75.2 +
Induct2     335.5   164.6 +
Linpk         4.1     4.6
MDBX         10.6    11.1
MP_Prop     532.3   583.4
NF           11.8    12.0
Protein      26.5    31.5
Rnflow       31.8    22.0 +
TestFPU2    156.8   111.6 +
TFFT2        46.1    54.7


The lines ending with '+' are the only cases where /64 gave faster runs. For Jcherw, the implication is that /64 will probably produce slightly slower EXEs. Little effort is needed to verify this assertion with his own application -- compile, run and time a test case with and without /64.

The AERMOD test is a strange case. The 32-bit EXE produced by FTN95 8.51 crashes with SQRT(-ve arg), but this does not happen if /OPT is not used. No such problem occurs with Version 7.20, so I suspect there is a new bug in 32-bit optimized compilations with versions 8.20 and later for this program. Given that the source file is over 50,000 lines, I have no incentive to track this down.


Last edited by mecej4 on Sun Aug 11, 2019 1:51 pm; edited 2 times in total
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 6069
Location: Salford, UK

PostPosted: Fri Aug 09, 2019 1:17 pm    Post subject: Reply with quote

mecej4

The Polyhedron results for 64 bit FTN95 are without optimisation. The switch /opt was permitted at v8.05 but had no effect. Optimisation was introduced later at v8.10.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1198

PostPosted: Fri Aug 09, 2019 1:31 pm    Post subject: Reply with quote

BUT...

Polyhedron did not build for X64 (at least on the page for which I gave a link above). Below the table, under "Compiler switches", you can see for FTN95:
Quote:
FTN95 ftn95 /p6 /optimize (slink was used to increase the stack size)

Note the presence of /p6. Therefore, they only produced and ran a 32-bit EXE. That the OS is reported as W64 is probably of no concern for comparison purposes.

Or, Paul, do you have a different Polyhedron page in mind?

PS: Some points that you made about 8.05 did not agree with my vague recollections, so I re-installed that old version from a backup that I had. I find that the 8.05 compiler aborts compilation when given /opt /64 :

Code:
S:\PolyHed\pb11\win\source>ftn95 /opt /64 ac.f90 /link
[FTN95/Win32 Ver. 8.05.0 Copyright (c) Silverfrost Ltd 1993-2016]
*** /OPTIMISE is not available in FTN95/64

    1 ERROR [] - Compilation failed.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group