|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2390 Location: Yateley, Hants, UK
|
Posted: Fri Oct 14, 2016 2:18 pm Post subject: |
|
|
Thanks Mecej4. I took the hint from Dave Bailey's thread that 64-bit version didn't support REAL*10 and that SSEx and AVX were in it. Even today, I suspect that some idea on relative performance would be useful, for example if FTN95 already runs faster in 64-bit than 32 bit then it can only get better. If it is lots slower, then it will be an uphill struggle. There was also a hint in another thread that at the end of a long calculation, the results diverged. I have no idea what the implications of that are.
Eddie |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Sat Oct 15, 2016 1:50 am Post subject: |
|
|
Eddie,
The present release of FTN95 /64 has removed support for 80 bit real*10.
My understanding is the only support for SSE/AVX instructions is via library functions, so these are not currently available in code as used for the Polyhedron benchmarks.
As a consequence, the present /64 compiler is slow for numerical intensive calculations, as optimisation is not yet provided. The present FTN95 /64 is significantly slower than FTN95 /32 for numerical intensive real*8 calculations. The Polyhedron benchmarks would not look good!
What optimisation will be provided in /64 will be interesting, as significant performance gains on other compilers are available via SSE/AVX vectorisation of inner DO loops. We will see if the FTN95 /64 /OPT project introduces this optimisation, or is limited to replicating the /32 optimisation approaches.
My experience is there is an insignificant performance penalty (none) transferring from 32-bit to 64-bit addressing, although more generally the changed instruction set does have both +ve and -ve effects. Changes from integer*4 to integer*8 is not a performance problem.
Certainly, with /64 and using larger arrays can result in more memory <> cache delays, especially if not addressing arrays with a single variable stride. Poor array addressing and cache overflow are penalised more with /64.
My understanding is FTN95 /32 uses x87 instructions; although I am not sure to what extent it provides 80 bit precision.(are there still 80 bit registers?) Most REAL*8 calculations retain only 64 bit, but if 80 bit registers are used, say for accumulators in a dot_product, higher precision can be available.
My understanding is for REAL*8, the Fortran 90/95 Standard requires calculations to be done to 64 bit, which negates the 80 bit precision. My recollection is that moving from F77 to F95 saw a loss of precision.
If you ran a program that needed REAL*10-80 bit precision, then you could notice a slight loss of accuracy. These are unusual cases.
The present FTN95 /64 is significantly slower than FTN95 /32 for numerical intensive real*8 calculations. What is interesting is I have not seen any complaints about this, as I suspect most users are more interested in the extra memory available.
I know that for graphics, I will not return to 32 bit clearwin+. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7933 Location: Salford, UK
|
Posted: Sat Oct 15, 2016 8:52 am Post subject: |
|
|
I have run the Polyhedron bench tests and the current non-optimised 64 bit results are on a par with or some what faster than the corresponding non-optimised 32 bit results. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Sun Oct 16, 2016 1:55 am Post subject: |
|
|
Hi Paul,
I am mistaken with my comments, as I was recalling comparison tests of FTN95 /opt with FTN95 /64, both of which are very slow.
I have adapted a test I have been using recently to compare performance of a dot_product routine. I have tested dot_product_new (a,b,n) with different compilation options and N in the range 1 to 50. ( for SSE instructions, larger n values give better performance)
FTN95 /32 0.335 Gflop/s (floating point multiplies per second)
FTN95 /64 0.423
FTN95 /OPT 0.743
FTN95 /64 1.225 ( using DOT_PRODUCT8@ )
I should resurrect the FTN95 /32 test with davidb's SSE routines, which I'd expect to get about 1.2 to 1.5.
Certainly Gflop values < 1.0 are slow.
John |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7933 Location: Salford, UK
|
Posted: Sun Oct 16, 2016 8:52 am Post subject: |
|
|
John
Hopefully ftn95 /64 /opt will become fast enough otherwise (now that ClearWin+ is available separately) users could do their development using ftn95/ClearWin+ and then switch to a faster third party compiler for release. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1891
|
Posted: Sun Oct 16, 2016 12:01 pm Post subject: |
|
|
The Polyhedron benchmarks have become more and more artificial, since they have simply increased repeat counts or increased the number of mesh points to keep up with increased processor speeds. It is more useful to take an actual code that is of current interest/concern and run timings on such a code.
There are instances where I have noticed that code compiled by FTN95 is slower than is reasonable. Here is one instance, relating to an engineering code that I am currently looking at, namely, the package HST3D ( wwwbrr.cr.usgs.gov/projects/GW_Solute/hst/ ). I found one bug in FTN95 8.05 that I reported ( forums.silverfrost.com/viewtopic.php?t=3343 ) and I modified the data files to work around this bug. Here are the results (in seconds, Dell laptop with i7-2720QM, Windows 10 Pro X64, network disconnected and antivirus turned off for runs).
Software: HST3D 2.2.16
Code: | HST3D 2.2.16
LF95 GFTN5.4(64) SILV(32) SILV(64) IFC17(64) CVF6.6C
Elder_heat 3.810 7.427 3.097 8.975
Elder_solute 3.176 6.313 178.374 2.876 7.727
Henry 0.407 0.767 1.613 2.565 0.407 1.061
Huyakorn 26.241 60.100 20.493 73.923
Hydrocoin 39.295 86.114 35.389 107.848
|
The compilers: Lahey LF95 7.1 (32-bit only) -Kfast,PENTIUM4,SSE2
GFTN5.4 : Gfortran 5.4, 64-bit, Cygwin-64 -O2
SILV : FTN95 8.05 /opt /p6 for 32-bit
IFC17 : Intel Parallel Studio 2017, 64-bit -O2 -Qxhost
CVF6.6C : Compaq Visual Fortran (32-bit only) /fast)
I have blanked out entries that were over 200 seconds.
It would be very helpful to have tools to hunt down and localize the bottlenecks in the EXEs that FTN95 produces.
Last edited by mecej4 on Mon Oct 17, 2016 12:43 pm; edited 5 times in total |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Sun Oct 16, 2016 12:47 pm Post subject: |
|
|
mecej4,
I find the /timing option to be a good approach. It reports the elapsed time associated with each routine compiled with the /timing option.
From the source code, I generate 2 lists of files, based on if they are utility routines or if they are code I want to find their delays. I then include these files as a list of include statements, and compile the first as /debug and the second with /timing.
This encourages you to break up large subroutines into smaller bits and get timings for the bits, which can be a good thing when isolating code to improve.
Basically I only compile the source code with /timing that I want to review or does not take too long to run (eg exclude functions that are called millions of times). There is a timing call overhead on entry and exit to each routine (based on cpu_clock@/RDTSC_VAL@)
The following is a batch file I used for a large simulation I have
Code: | now >ftn95.tce
del *.obj >>ftn95.tce
del *.mod >>ftn95.tce
SET TIMINGOPTS=/TMO /DLM ,
ftn95 sim_ver1_tim /timing >>ftn95.tce
ftn95 sutil /debug >>ftn95.tce
ftn95 util /debug >>ftn95.tce
slink main_tim.txt >>ftn95.tce
type ftn95.tce
dir aaa_tim.exe
rem run aaa_tim.exe
aaa_tim IH_2009_AB_g40_C205.txt >sim_tim.tce
|
sim_ver1_tim.f95 is an INCLUDE 'xxx.f95'
main_tim.txt is the link list, which lists the .obj files plus some libraries.
Code: | lo sim_ver1_tim.obj
lo sutil.obj
lo util.obj
le \clearwin\saplib.mem\saplib.lib
map aaa_tim.map
file aaa_tim.exe |
The timing output is 2 files; .tmo and .tmr, one is aaa_tim.tmo, which is a .csv file of accumulated elapsed times. It is easy to review in Excel.
You can see where all the time is being taken and may identify where the code has problems. I find it provides a lot of information at the routine level, which is more helpful than the /profile approach.
I would recommend this approach as worth testing. (I have not yet used this with /64.)
FTN95 is not good with array sections and long strides in array addressing. It can benefit from including SSE vector routines where available.
John |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1891
|
Posted: Sun Oct 16, 2016 1:31 pm Post subject: |
|
|
Thanks, John. Indeed, /timing provides a nicely formatted report with a lot of useful information. Unfortunately, if I compile the same program (HST3D) with /timing, it enters a timer calibration loop and then aborts with the following message before doing any real calculation.
Code: | Access Violation.
The instruction at address 004bc598 attempted to read from location ace0930c |
I suppose I could try /timing without /opt. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1891
|
Posted: Sun Oct 16, 2016 6:34 pm Post subject: |
|
|
Quote: | FTN95 is not good with array sections and long strides in array addressing. |
John, that was a perfect diagnosis.
After running with /timing (without /opt), I found that 98 percent of the time was consumed in an "envelope storage" solver for positive definite matrices. The solver consists of three subroutines, and the main solver passes pointers to array sections to the subsidiary subroutines. I examined the code and found that all the array sections had unit stride, which means that it would suffice to pass just the first element of each section as the subroutine argument.
These changes reduced the run time from 178 s to 4.6 s for the Elder_solute problem. The new runs showed that FTN95 32-bit produced code that was consistently comparable in speed to that produced by Gfortran.
Perhaps, there is scope for improvement in the code that FTN95 generates for passing array sections. I doubt that I would have believed the drastic slowdown if I had not experienced it myself. Had the sections been non-unit-stride sections, the conversion would have been more difficult, so help from the compiler would be valuable. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Sun Oct 16, 2016 9:52 pm Post subject: |
|
|
It is often the case, that by using a F77 style wrapper, this can dramatically improve the run time performance for these types of calls in FTN95.
I find the opposite with ifort, as often they will increase the run time.
Years of using FTN95, have biased my programming style to simple F77 style calls, which work well. The KISS principal certainly got thrown out with F03/08. Perhaps Eddie will agree ?
I think the "bug" in FTN95 is not recognising when array sections are contiguous and that temporary copies are not required, although I'm not sure of cases that break this rule.
Ironic post, given the title of this thread.
John |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2390 Location: Yateley, Hants, UK
|
Posted: Mon Oct 17, 2016 10:53 am Post subject: |
|
|
As you challenged me, John, I will repl,y although apart from my previous posting I was keeping my head down on this. And before I start, I freely confess that I am a programming dinosaur. My needs were entirely met by Fortran 77 complemented by a graphics package and a few routines to access DOS functions, and for quite a while that graphics package was assembled from a handful of commands to program a plotter and a simple system for putting graphics on a VGA screen. Clearwin+ satisfies my needs for an extension to Fortran, and as 77 is a subset of 95, that�s fine by me.
I am not surprised that the genetics of FTN95 mean that the Fortran 77 way of doing things works better than Fortran 9x style, nor that IFORT as the other way around.
As far as optimisation is concerned, this is again a function of programming style. I think that common subexpression removal (for example) it is best done by a programmer, although where the common subexpression is a simple variable, this is probably best done by the compiler when it manages registers.
What I understand from your explanation to be the mechanism for passing array subsections seems to be clumsy in the extreme requiring big chunks of stack, and no wonder it�s slow and inefficient. I�d program that with three parts:
Code: | (array_name, lower_limit, limit_higher) |
As this seems simple, but if
Code: | array_name (lower_limit..limit_higher) |
is your preference, then why it isn�t implemented the same way just causes puzzlement in my mind.
I particularly wanted to talk about 80 bit, round off and efficiency. It�s about 30 years since I understood 8086/7 assembler, but I do remember playing around with an idea that came out of Richard Startz�s book on programming the 8087. Take the very common requirement to do something like this:
Code: | C=0.0D0
DO 100 I=1,NUMBER
C = C + A(I)*B(I)
100 CONTINUE |
One could not avoid incrementing I, nor fetching A(I) and B(I) and multiplying them together, but one could avoid storing the result back in RAM which was not only slow but also truncated the result from 80 bit in the 8087 registers to 64 bits (assuming the temporary copy was REAL*8). It only took sensible management of the 8087 stack to hold C, and then not only was there only a tiny overhead instead of a big one. In the days of the 8086/7 even a quite short loop took appreciable time to execute, and my now distant recollection is that doing it the Startz way was at least 10 times faster than the way Microsoft Fortran did it.
Microsoft Fortran also at one stage had two libraries, one in which 8087 was assumed present, and one where the functions were done in software. It didn�t take a very complicated calculation for the two to produce different answers, and this is all due to round off.
I�ve no doubt that things are different with on-chip cache RAM and modern processor architectures, but I was left with a very cautious attitude to round-off, and a belief that there were productivity gains available if compiler writers were prepared to take them.
We then went into a period of incredibly rapid development in raw processor speed so that if one wanted to do things faster it was a function of buying an updated PC, and the gains from that outstripped what one could get playing with the software.
It was not always the case in the past that this was true, and at one time, it was possible to find oneself using the same computer for 8 to 10 years, and without an optimising compiler. In those days hand-optimisation using simple rules always gave significant run-time improvements and one just got used to programming in that way.
I also discovered that straightforward programming with lots of white space made source codes easy to understand many years after they were written.
Eddie |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1891
|
Posted: Mon Oct 17, 2016 11:57 am Post subject: |
|
|
Eddie, if you wish to compile snippets of code (such as your dot-product code) and see the assembly output, there are sites such as www.godbolt.org that enable you do so in a browser window, without having to install compilers, etc. Since godbolt.org only has C/C++ support, I tried
Code: | double ddot(double *a,double *b,int *n){
double s=0.0;
for(int i=0; i<*n; i++)s+=*a++ * *b++;
return s;
} |
with gcc -O2 and obtained this X64-SSE2 assembly listing, which is notably short (comments added by me):
Code: | mov edx,DWORD PTR [rdx] # vector length
test edx,edx
jle L1
pxor xmm0,xmm0 # s = 0
xor eax,eax # i = 0
nop DWORD PTR [rax+0x0] # pad for alignment?
L0:
movsd xmm1,QWORD PTR [rdi+rax*8] # load a(i)
mulsd xmm1,QWORD PTR [rsi+rax*8] # multiply by b(i)
add rax,0x1 # increment i
cmp edx,eax # test if done
addsd xmm0,xmm1 # update s
jg L0
repz ret
L1:
pxor xmm0,xmm0
ret |
The body of the loop contains only four instructions, including memory fetches for a(i) and b(i), multiply-and-accumulate, plus two more instructions to increment and test the index i. The result is kept and returned in xmm0.
This is not yet optimal code, since it is not "vectorized". |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2390 Location: Yateley, Hants, UK
|
Posted: Mon Oct 17, 2016 2:35 pm Post subject: |
|
|
Hi Mecej4,
Thanks for the useful link.
It does seem to me that compilers could always be improved, but so too can programmers� stylistic efforts. I�m not sure that computer speeds are going up relatively as fast as they did a few years ago, but I remember my first PC costing about four month�s income but the fastest one I could buy retail today costs me less than a day�s income, and if it wasn�t for the fact that I program in a relatively straightforward if old-fashioned style I certainly wouldn�t waste time in hand optimisation today.
Whereas for you and perhaps John Campbell every speed gain is worth it, I�m not sure that�s always the case for everybody and not normally for me these days. If a response to user interaction is as far as I can tell instantaneous, then halving the time taken is rather meaningless. There are also other ways to get the job done, so for example in a structural analysis program solving multiple load cases it is probably cheaper to run each load case on a separate computer rather than labour for months to make it faster for running on a single computer.
Round-off and all the issues of finite precision arithmetic continue to perplex many folk (me included, generally speaking), but using the SSEx vectorised arithmetic instead of x87 will give different results for many algorithms, of that I�m sure.
Eddie |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2826 Location: South Pole, Antarctica
|
Posted: Wed Oct 19, 2016 2:47 pm Post subject: |
|
|
But noticed how Mecej4 improved performance of FTN95 on one of examples making it even 2-3 times faster then Intel VF and GFortran ? That means that there exist yet a lot of potential for developers to make this compiler fly at superspeeds. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2826 Location: South Pole, Antarctica
|
Posted: Sat Oct 22, 2016 12:33 am Post subject: |
|
|
But will add -- please make the debugger first and port Simpleplot %pl to 64bit Clearwin. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|