forums.silverfrost.com

LitusSaxonicum · Posted: Fri Oct 14, 2016 2:18 pm Post subject:

Thanks Mecej4. I took the hint from Dave Bailey's thread that 64-bit version didn't support REAL*10 and that SSEx and AVX were in it. Even today, I suspect that some idea on relative performance would be useful, for example if FTN95 already runs faster in 64-bit than 32 bit then it can only get better. If it is lots slower, then it will be an uphill struggle. There was also a hint in another thread that at the end of a long calculation, the results diverged. I have no idea what the implications of that are.

Eddie

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney

Eddie,

The present release of FTN95 /64 has removed support for 80 bit real*10.
My understanding is the only support for SSE/AVX instructions is via library functions, so these are not currently available in code as used for the Polyhedron benchmarks.
As a consequence, the present /64 compiler is slow for numerical intensive calculations, as optimisation is not yet provided. The present FTN95 /64 is significantly slower than FTN95 /32 for numerical intensive real*8 calculations. The Polyhedron benchmarks would not look good!

What optimisation will be provided in /64 will be interesting, as significant performance gains on other compilers are available via SSE/AVX vectorisation of inner DO loops. We will see if the FTN95 /64 /OPT project introduces this optimisation, or is limited to replicating the /32 optimisation approaches.

My experience is there is an insignificant performance penalty (none) transferring from 32-bit to 64-bit addressing, although more generally the changed instruction set does have both +ve and -ve effects. Changes from integer*4 to integer*8 is not a performance problem.
Certainly, with /64 and using larger arrays can result in more memory <> cache delays, especially if not addressing arrays with a single variable stride. Poor array addressing and cache overflow are penalised more with /64.

My understanding is FTN95 /32 uses x87 instructions; although I am not sure to what extent it provides 80 bit precision.(are there still 80 bit registers?) Most REAL*8 calculations retain only 64 bit, but if 80 bit registers are used, say for accumulators in a dot_product, higher precision can be available.
My understanding is for REAL*8, the Fortran 90/95 Standard requires calculations to be done to 64 bit, which negates the 80 bit precision. My recollection is that moving from F77 to F95 saw a loss of precision.
If you ran a program that needed REAL*10-80 bit precision, then you could notice a slight loss of accuracy. These are unusual cases.

The present FTN95 /64 is significantly slower than FTN95 /32 for numerical intensive real*8 calculations. What is interesting is I have not seen any complaints about this, as I suspect most users are more interested in the extra memory available.
I know that for graphics, I will not return to 32 bit clearwin+.

PaulLaidler · Posted: Sat Oct 15, 2016 8:52 am Post subject:

I have run the Polyhedron bench tests and the current non-optimised 64 bit results are on a par with or some what faster than the corresponding non-optimised 32 bit results.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney

Hi Paul,

I am mistaken with my comments, as I was recalling comparison tests of FTN95 /opt with FTN95 /64, both of which are very slow.

I have adapted a test I have been using recently to compare performance of a dot_product routine. I have tested dot_product_new (a,b,n) with different compilation options and N in the range 1 to 50. ( for SSE instructions, larger n values give better performance)

FTN95 /32 0.335 Gflop/s (floating point multiplies per second)
FTN95 /64 0.423
FTN95 /OPT 0.743
FTN95 /64 1.225 ( using DOT_PRODUCT8@ )

I should resurrect the FTN95 /32 test with davidb's SSE routines, which I'd expect to get about 1.2 to 1.5.

Certainly Gflop values < 1.0 are slow.

John

PaulLaidler · Posted: Sun Oct 16, 2016 8:52 am Post subject:

John

Hopefully ftn95 /64 /opt will become fast enough otherwise (now that ClearWin+ is available separately) users could do their development using ftn95/ClearWin+ and then switch to a faster third party compiler for release.

mecej4 · Joined: 31 Oct 2006 Posts: 1891

The Polyhedron benchmarks have become more and more artificial, since they have simply increased repeat counts or increased the number of mesh points to keep up with increased processor speeds. It is more useful to take an actual code that is of current interest/concern and run timings on such a code.

There are instances where I have noticed that code compiled by FTN95 is slower than is reasonable. Here is one instance, relating to an engineering code that I am currently looking at, namely, the package HST3D ( wwwbrr.cr.usgs.gov/projects/GW_Solute/hst/ ). I found one bug in FTN95 8.05 that I reported ( forums.silverfrost.com/viewtopic.php?t=3343 ) and I modified the data files to work around this bug. Here are the results (in seconds, Dell laptop with i7-2720QM, Windows 10 Pro X64, network disconnected and antivirus turned off for runs).

Software: HST3D 2.2.16

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney

mecej4,

I find the /timing option to be a good approach. It reports the elapsed time associated with each routine compiled with the /timing option.
From the source code, I generate 2 lists of files, based on if they are utility routines or if they are code I want to find their delays. I then include these files as a list of include statements, and compile the first as /debug and the second with /timing.
This encourages you to break up large subroutines into smaller bits and get timings for the bits, which can be a good thing when isolating code to improve.
Basically I only compile the source code with /timing that I want to review or does not take too long to run (eg exclude functions that are called millions of times). There is a timing call overhead on entry and exit to each routine (based on cpu_clock@/RDTSC_VAL@)

The following is a batch file I used for a large simulation I have

mecej4 · Joined: 31 Oct 2006 Posts: 1891

Thanks, John. Indeed, /timing provides a nicely formatted report with a lot of useful information. Unfortunately, if I compile the same program (HST3D) with /timing, it enters a timer calibration loop and then aborts with the following message before doing any real calculation.

mecej4 · Joined: 31 Oct 2006 Posts: 1891

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney

It is often the case, that by using a F77 style wrapper, this can dramatically improve the run time performance for these types of calls in FTN95.
I find the opposite with ifort, as often they will increase the run time.

Years of using FTN95, have biased my programming style to simple F77 style calls, which work well. The KISS principal certainly got thrown out with F03/08. Perhaps Eddie will agree ?

I think the "bug" in FTN95 is not recognising when array sections are contiguous and that temporary copies are not required, although I'm not sure of cases that break this rule.

Ironic post, given the title of this thread.

John

LitusSaxonicum · Posted: Mon Oct 17, 2016 10:53 am Post subject:

As you challenged me, John, I will repl,y although apart from my previous posting I was keeping my head down on this. And before I start, I freely confess that I am a programming dinosaur. My needs were entirely met by Fortran 77 complemented by a graphics package and a few routines to access DOS functions, and for quite a while that graphics package was assembled from a handful of commands to program a plotter and a simple system for putting graphics on a VGA screen. Clearwin+ satisfies my needs for an extension to Fortran, and as 77 is a subset of 95, that�s fine by me.
I am not surprised that the genetics of FTN95 mean that the Fortran 77 way of doing things works better than Fortran 9x style, nor that IFORT as the other way around.
As far as optimisation is concerned, this is again a function of programming style. I think that common subexpression removal (for example) it is best done by a programmer, although where the common subexpression is a simple variable, this is probably best done by the compiler when it manages registers.
What I understand from your explanation to be the mechanism for passing array subsections seems to be clumsy in the extreme requiring big chunks of stack, and no wonder it�s slow and inefficient. I�d program that with three parts:

mecej4 · Joined: 31 Oct 2006 Posts: 1891

Eddie, if you wish to compile snippets of code (such as your dot-product code) and see the assembly output, there are sites such as www.godbolt.org that enable you do so in a browser window, without having to install compilers, etc. Since godbolt.org only has C/C++ support, I tried

LitusSaxonicum · Posted: Mon Oct 17, 2016 2:35 pm Post subject:

Hi Mecej4,
Thanks for the useful link.
It does seem to me that compilers could always be improved, but so too can programmers� stylistic efforts. I�m not sure that computer speeds are going up relatively as fast as they did a few years ago, but I remember my first PC costing about four month�s income but the fastest one I could buy retail today costs me less than a day�s income, and if it wasn�t for the fact that I program in a relatively straightforward if old-fashioned style I certainly wouldn�t waste time in hand optimisation today.
Whereas for you and perhaps John Campbell every speed gain is worth it, I�m not sure that�s always the case for everybody and not normally for me these days. If a response to user interaction is as far as I can tell instantaneous, then halving the time taken is rather meaningless. There are also other ways to get the job done, so for example in a structural analysis program solving multiple load cases it is probably cheaper to run each load case on a separate computer rather than labour for months to make it faster for running on a single computer.
Round-off and all the issues of finite precision arithmetic continue to perplex many folk (me included, generally speaking), but using the SSEx vectorised arithmetic instead of x87 will give different results for many algorithms, of that I�m sure.
Eddie

DanRRight · Posted: Wed Oct 19, 2016 2:47 pm Post subject:

But noticed how Mecej4 improved performance of FTN95 on one of examples making it even 2-3 times faster then Intel VF and GFortran ? That means that there exist yet a lot of potential for developers to make this compiler fly at superspeeds.

DanRRight · Posted: Sat Oct 22, 2016 12:33 am Post subject:

But will add -- please make the debugger first and port Simpleplot %pl to 64bit Clearwin.