forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Fortran modernisation workshop
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Fri Oct 14, 2016 2:18 pm    Post subject: Reply with quote

Thanks Mecej4. I took the hint from Dave Bailey's thread that 64-bit version didn't support REAL*10 and that SSEx and AVX were in it. Even today, I suspect that some idea on relative performance would be useful, for example if FTN95 already runs faster in 64-bit than 32 bit then it can only get better. If it is lots slower, then it will be an uphill struggle. There was also a hint in another thread that at the end of a long calculation, the results diverged. I have no idea what the implications of that are.

Eddie
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sat Oct 15, 2016 1:50 am    Post subject: Reply with quote

Eddie,

The present release of FTN95 /64 has removed support for 80 bit real*10.
My understanding is the only support for SSE/AVX instructions is via library functions, so these are not currently available in code as used for the Polyhedron benchmarks.
As a consequence, the present /64 compiler is slow for numerical intensive calculations, as optimisation is not yet provided. The present FTN95 /64 is significantly slower than FTN95 /32 for numerical intensive real*8 calculations. The Polyhedron benchmarks would not look good!

What optimisation will be provided in /64 will be interesting, as significant performance gains on other compilers are available via SSE/AVX vectorisation of inner DO loops. We will see if the FTN95 /64 /OPT project introduces this optimisation, or is limited to replicating the /32 optimisation approaches.

My experience is there is an insignificant performance penalty (none) transferring from 32-bit to 64-bit addressing, although more generally the changed instruction set does have both +ve and -ve effects. Changes from integer*4 to integer*8 is not a performance problem.
Certainly, with /64 and using larger arrays can result in more memory <> cache delays, especially if not addressing arrays with a single variable stride. Poor array addressing and cache overflow are penalised more with /64.

My understanding is FTN95 /32 uses x87 instructions; although I am not sure to what extent it provides 80 bit precision.(are there still 80 bit registers?) Most REAL*8 calculations retain only 64 bit, but if 80 bit registers are used, say for accumulators in a dot_product, higher precision can be available.
My understanding is for REAL*8, the Fortran 90/95 Standard requires calculations to be done to 64 bit, which negates the 80 bit precision. My recollection is that moving from F77 to F95 saw a loss of precision.
If you ran a program that needed REAL*10-80 bit precision, then you could notice a slight loss of accuracy. These are unusual cases.

The present FTN95 /64 is significantly slower than FTN95 /32 for numerical intensive real*8 calculations. What is interesting is I have not seen any complaints about this, as I suspect most users are more interested in the extra memory available.
I know that for graphics, I will not return to 32 bit clearwin+.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Sat Oct 15, 2016 8:52 am    Post subject: Reply with quote

I have run the Polyhedron bench tests and the current non-optimised 64 bit results are on a par with or some what faster than the corresponding non-optimised 32 bit results.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Oct 16, 2016 1:55 am    Post subject: Reply with quote

Hi Paul,

I am mistaken with my comments, as I was recalling comparison tests of FTN95 /opt with FTN95 /64, both of which are very slow.

I have adapted a test I have been using recently to compare performance of a dot_product routine. I have tested dot_product_new (a,b,n) with different compilation options and N in the range 1 to 50. ( for SSE instructions, larger n values give better performance)

FTN95 /32 0.335 Gflop/s (floating point multiplies per second)
FTN95 /64 0.423
FTN95 /OPT 0.743
FTN95 /64 1.225 ( using DOT_PRODUCT8@ )

I should resurrect the FTN95 /32 test with davidb's SSE routines, which I'd expect to get about 1.2 to 1.5.

Certainly Gflop values < 1.0 are slow.

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Sun Oct 16, 2016 8:52 am    Post subject: Reply with quote

John

Hopefully ftn95 /64 /opt will become fast enough otherwise (now that ClearWin+ is available separately) users could do their development using ftn95/ClearWin+ and then switch to a faster third party compiler for release.
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Sun Oct 16, 2016 12:01 pm    Post subject: Reply with quote

The Polyhedron benchmarks have become more and more artificial, since they have simply increased repeat counts or increased the number of mesh points to keep up with increased processor speeds. It is more useful to take an actual code that is of current interest/concern and run timings on such a code.

There are instances where I have noticed that code compiled by FTN95 is slower than is reasonable. Here is one instance, relating to an engineering code that I am currently looking at, namely, the package HST3D ( wwwbrr.cr.usgs.gov/projects/GW_Solute/hst/ ). I found one bug in FTN95 8.05 that I reported ( forums.silverfrost.com/viewtopic.php?t=3343 ) and I modified the data files to work around this bug. Here are the results (in seconds, Dell laptop with i7-2720QM, Windows 10 Pro X64, network disconnected and antivirus turned off for runs).

Software: HST3D 2.2.16

Code:
HST3D 2.2.16
                LF95     GFTN5.4(64) SILV(32)  SILV(64)  IFC17(64)  CVF6.6C
            
Elder_heat      3.810     7.427                            3.097     8.975
Elder_solute    3.176     6.313      178.374     2.876     7.727
Henry           0.407     0.767        1.613     2.565     0.407     1.061
Huyakorn       26.241    60.100                           20.493    73.923
Hydrocoin      39.295    86.114                           35.389   107.848


The compilers: Lahey LF95 7.1 (32-bit only) -Kfast,PENTIUM4,SSE2
GFTN5.4 : Gfortran 5.4, 64-bit, Cygwin-64 -O2
SILV : FTN95 8.05 /opt /p6 for 32-bit
IFC17 : Intel Parallel Studio 2017, 64-bit -O2 -Qxhost
CVF6.6C : Compaq Visual Fortran (32-bit only) /fast)

I have blanked out entries that were over 200 seconds.

It would be very helpful to have tools to hunt down and localize the bottlenecks in the EXEs that FTN95 produces.


Last edited by mecej4 on Mon Oct 17, 2016 12:43 pm; edited 5 times in total
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Oct 16, 2016 12:47 pm    Post subject: Reply with quote

mecej4,

I find the /timing option to be a good approach. It reports the elapsed time associated with each routine compiled with the /timing option.
From the source code, I generate 2 lists of files, based on if they are utility routines or if they are code I want to find their delays. I then include these files as a list of include statements, and compile the first as /debug and the second with /timing.
This encourages you to break up large subroutines into smaller bits and get timings for the bits, which can be a good thing when isolating code to improve.
Basically I only compile the source code with /timing that I want to review or does not take too long to run (eg exclude functions that are called millions of times). There is a timing call overhead on entry and exit to each routine (based on cpu_clock@/RDTSC_VAL@)

The following is a batch file I used for a large simulation I have
Code:
now                             >ftn95.tce
del *.obj                      >>ftn95.tce
del *.mod                      >>ftn95.tce
SET TIMINGOPTS=/TMO /DLM ,
ftn95 sim_ver1_tim     /timing >>ftn95.tce
ftn95 sutil            /debug  >>ftn95.tce
ftn95 util             /debug  >>ftn95.tce
slink  main_tim.txt            >>ftn95.tce
type ftn95.tce
dir aaa_tim.exe
rem run aaa_tim.exe
aaa_tim IH_2009_AB_g40_C205.txt  >sim_tim.tce
 


sim_ver1_tim.f95 is an INCLUDE 'xxx.f95'
main_tim.txt is the link list, which lists the .obj files plus some libraries.
Code:
lo sim_ver1_tim.obj
lo sutil.obj
lo util.obj
le \clearwin\saplib.mem\saplib.lib
map aaa_tim.map
file aaa_tim.exe

The timing output is 2 files; .tmo and .tmr, one is aaa_tim.tmo, which is a .csv file of accumulated elapsed times. It is easy to review in Excel.
You can see where all the time is being taken and may identify where the code has problems. I find it provides a lot of information at the routine level, which is more helpful than the /profile approach.

I would recommend this approach as worth testing. (I have not yet used this with /64.)

FTN95 is not good with array sections and long strides in array addressing. It can benefit from including SSE vector routines where available.

John
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Sun Oct 16, 2016 1:31 pm    Post subject: Reply with quote

Thanks, John. Indeed, /timing provides a nicely formatted report with a lot of useful information. Unfortunately, if I compile the same program (HST3D) with /timing, it enters a timer calibration loop and then aborts with the following message before doing any real calculation.
Code:
Access Violation.
The instruction at address 004bc598 attempted to read from location ace0930c

I suppose I could try /timing without /opt.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Sun Oct 16, 2016 6:34 pm    Post subject: Reply with quote

Quote:
FTN95 is not good with array sections and long strides in array addressing.

John, that was a perfect diagnosis.

After running with /timing (without /opt), I found that 98 percent of the time was consumed in an "envelope storage" solver for positive definite matrices. The solver consists of three subroutines, and the main solver passes pointers to array sections to the subsidiary subroutines. I examined the code and found that all the array sections had unit stride, which means that it would suffice to pass just the first element of each section as the subroutine argument.

These changes reduced the run time from 178 s to 4.6 s for the Elder_solute problem. The new runs showed that FTN95 32-bit produced code that was consistently comparable in speed to that produced by Gfortran.

Perhaps, there is scope for improvement in the code that FTN95 generates for passing array sections. I doubt that I would have believed the drastic slowdown if I had not experienced it myself. Had the sections been non-unit-stride sections, the conversion would have been more difficult, so help from the compiler would be valuable.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Oct 16, 2016 9:52 pm    Post subject: Reply with quote

It is often the case, that by using a F77 style wrapper, this can dramatically improve the run time performance for these types of calls in FTN95.
I find the opposite with ifort, as often they will increase the run time.

Years of using FTN95, have biased my programming style to simple F77 style calls, which work well. The KISS principal certainly got thrown out with F03/08. Perhaps Eddie will agree ?

I think the "bug" in FTN95 is not recognising when array sections are contiguous and that temporary copies are not required, although I'm not sure of cases that break this rule.

Ironic post, given the title of this thread.

John
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Mon Oct 17, 2016 10:53 am    Post subject: Reply with quote

As you challenged me, John, I will repl,y although apart from my previous posting I was keeping my head down on this. And before I start, I freely confess that I am a programming dinosaur. My needs were entirely met by Fortran 77 complemented by a graphics package and a few routines to access DOS functions, and for quite a while that graphics package was assembled from a handful of commands to program a plotter and a simple system for putting graphics on a VGA screen. Clearwin+ satisfies my needs for an extension to Fortran, and as 77 is a subset of 95, that’s fine by me.
I am not surprised that the genetics of FTN95 mean that the Fortran 77 way of doing things works better than Fortran 9x style, nor that IFORT as the other way around.
As far as optimisation is concerned, this is again a function of programming style. I think that common subexpression removal (for example) it is best done by a programmer, although where the common subexpression is a simple variable, this is probably best done by the compiler when it manages registers.
What I understand from your explanation to be the mechanism for passing array subsections seems to be clumsy in the extreme requiring big chunks of stack, and no wonder it’s slow and inefficient. I’d program that with three parts:
Code:
(array_name, lower_limit, limit_higher)

As this seems simple, but if
Code:
array_name (lower_limit..limit_higher)

is your preference, then why it isn’t implemented the same way just causes puzzlement in my mind.
I particularly wanted to talk about 80 bit, round off and efficiency. It’s about 30 years since I understood 8086/7 assembler, but I do remember playing around with an idea that came out of Richard Startz’s book on programming the 8087. Take the very common requirement to do something like this:

Code:
      C=0.0D0
      DO 100 I=1,NUMBER
      C = C + A(I)*B(I)
 100  CONTINUE


One could not avoid incrementing I, nor fetching A(I) and B(I) and multiplying them together, but one could avoid storing the result back in RAM which was not only slow but also truncated the result from 80 bit in the 8087 registers to 64 bits (assuming the temporary copy was REAL*8). It only took sensible management of the 8087 stack to hold C, and then not only was there only a tiny overhead instead of a big one. In the days of the 8086/7 even a quite short loop took appreciable time to execute, and my now distant recollection is that doing it the Startz way was at least 10 times faster than the way Microsoft Fortran did it.
Microsoft Fortran also at one stage had two libraries, one in which 8087 was assumed present, and one where the functions were done in software. It didn’t take a very complicated calculation for the two to produce different answers, and this is all due to round off.
I’ve no doubt that things are different with on-chip cache RAM and modern processor architectures, but I was left with a very cautious attitude to round-off, and a belief that there were productivity gains available if compiler writers were prepared to take them.
We then went into a period of incredibly rapid development in raw processor speed so that if one wanted to do things faster it was a function of buying an updated PC, and the gains from that outstripped what one could get playing with the software.
It was not always the case in the past that this was true, and at one time, it was possible to find oneself using the same computer for 8 to 10 years, and without an optimising compiler. In those days hand-optimisation using simple rules always gave significant run-time improvements and one just got used to programming in that way.
I also discovered that straightforward programming with lots of white space made source codes easy to understand many years after they were written.
Eddie
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Mon Oct 17, 2016 11:57 am    Post subject: Reply with quote

Eddie, if you wish to compile snippets of code (such as your dot-product code) and see the assembly output, there are sites such as www.godbolt.org that enable you do so in a browser window, without having to install compilers, etc. Since godbolt.org only has C/C++ support, I tried

Code:
double ddot(double *a,double *b,int *n){
double s=0.0;
for(int i=0; i<*n; i++)s+=*a++ * *b++;
return s;
}

with gcc -O2 and obtained this X64-SSE2 assembly listing, which is notably short (comments added by me):
Code:
 mov    edx,DWORD PTR [rdx]           # vector length
 test   edx,edx
 jle    L1
 pxor   xmm0,xmm0                     # s = 0
 xor    eax,eax                       # i = 0
 nop    DWORD PTR [rax+0x0]           # pad for alignment?
 L0:
 movsd  xmm1,QWORD PTR [rdi+rax*8]    # load a(i)
 mulsd  xmm1,QWORD PTR [rsi+rax*8]    # multiply by b(i)
 add    rax,0x1                       # increment i
 cmp    edx,eax                       # test if done
 addsd  xmm0,xmm1                     # update s
 jg     L0
 repz ret
 L1:
 pxor   xmm0,xmm0
 ret

The body of the loop contains only four instructions, including memory fetches for a(i) and b(i), multiply-and-accumulate, plus two more instructions to increment and test the index i. The result is kept and returned in xmm0.

This is not yet optimal code, since it is not "vectorized".
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Mon Oct 17, 2016 2:35 pm    Post subject: Reply with quote

Hi Mecej4,
Thanks for the useful link.
It does seem to me that compilers could always be improved, but so too can programmers’ stylistic efforts. I’m not sure that computer speeds are going up relatively as fast as they did a few years ago, but I remember my first PC costing about four month’s income but the fastest one I could buy retail today costs me less than a day’s income, and if it wasn’t for the fact that I program in a relatively straightforward if old-fashioned style I certainly wouldn’t waste time in hand optimisation today.
Whereas for you and perhaps John Campbell every speed gain is worth it, I’m not sure that’s always the case for everybody and not normally for me these days. If a response to user interaction is as far as I can tell instantaneous, then halving the time taken is rather meaningless. There are also other ways to get the job done, so for example in a structural analysis program solving multiple load cases it is probably cheaper to run each load case on a separate computer rather than labour for months to make it faster for running on a single computer.
Round-off and all the issues of finite precision arithmetic continue to perplex many folk (me included, generally speaking), but using the SSEx vectorised arithmetic instead of x87 will give different results for many algorithms, of that I’m sure.
Eddie
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Wed Oct 19, 2016 2:47 pm    Post subject: Reply with quote

But noticed how Mecej4 improved performance of FTN95 on one of examples making it even 2-3 times faster then Intel VF and GFortran ? That means that there exist yet a lot of potential for developers to make this compiler fly at superspeeds.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Sat Oct 22, 2016 12:33 am    Post subject: Reply with quote

But will add -- please make the debugger first and port Simpleplot %pl to 64bit Clearwin.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4  Next
Page 2 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group