forums.silverfrost.com

mecej4 · Joined: 31 Oct 2006 Posts: 1886

Paul's response addresses one of the two related problems that we reported. The other problem, i.e., a cost of over 10,000 CPU cycles to handle an underflow interrupt with default settings, remains to be addressed.

More poking around has caused me to stumble on another work-around. Isolate the subprograms with the most incidence of underflows. Compile the corresponding source file(s) separately, build a DLL with them and export the subroutine/function entry points. In your application, instead of linking with the OBJ file(s) for those routines, link with the DLL (SLink can link directly with DLLs). Here is a test driver:

PaulLaidler · Posted: Tue Feb 24, 2015 5:36 pm Post subject:

There may be some advantage to using SSE instructions but the primary question may relate to the handling of underflows.

It may be that the third party compiler does not handle underflows and hence has no associated overhead (or maybe the SSE instructions do not generate underflows). This raises the question, is it safe to ignore underflows? The authors of FTN95 took the view that, for safety, underflows need to be handled (by clearing the relevant processor flags and setting the value to zero). Having said this, it does appear that the cost is unduly high suggesting that there is more to it than this.

I am not sure that these issues can be resolved in the short term and they disappear with 64 bit FTN95 which for now is our primary focus.

For those who would prefer not to wait, run times might be reduced by programming out the underflows before they happen.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Paul,

You stated:

PaulLaidler · Posted: Wed Feb 25, 2015 8:23 am Post subject:

John

As I understand it, the problem presented in the code provided has been identified as occurring with WRITE and INTEGER*8.

If there is another problem then we would need different code that illustrates this.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Paul,

The error report I am getting with fpe_example.zip is:

mecej4 · Joined: 31 Oct 2006 Posts: 1886

John, as of now it appears that your best recourse is to (i) avoid using the environmental variable SALFENVAR during compilation or execution, (ii) compile using the options /opt /p6 leaving out sse_lib.f90 (see below for its replacement), (iii) link with the command slink *.obj sse_sal.dll /out:profile, and (iv) run with /l:5 as much as possible.

I think that at this time your test codes have fulfilled their purpose. Using the lessons learned from this exercise, you could go ahead and modify your real FEM application to call the routines within SSE_SAL.DLL. If the FEM code needs additional BLAS routines, SSE2 versions of those could be prepared.

The assembly source for sse_sal.dll, the assembled OBJ file and a DLL built from it are contained in https://dl.dropboxusercontent.com/u/88464747/sse_sal.zip . The DLL will serve as the replacement for sse_lib.f90. I derived the assembly source from DavidB's source file, and modified it to remove all X87 instructions except the FLD instructions mandated by the Microsoft 32-bit ABI for functions that return float/double values.

If you follow this procedure, the only underflows that will remain a problem will be those generated from code within your Fortran sources.

Following this procedure, I obtained the following timing results for /l:5 /m:-1, with various values of /d:nnn.

d t_gauss t_red1b
- ------- -------
2 16.85 11.07
3 17.40 11.29
4 17.68 11.37
5 17.98 11.45
6 17.83 11.53

These times do not exceed twice the values that I was able to obtain with optimizing compilers.

DanRRight · Posted: Wed Feb 25, 2015 5:52 pm Post subject:

"These times do not exceed twice the values that I was able to obtain with optimizing compilers"

Mecej4, I am still curious why you get in your tests factor of two-three loss versus other compilers? If you look at Polyhedron benchmark

http://www.polyhedron.com/fortran-compiler-comparisons/fortran-execution-time-benchmarks-64-bit-windows-7-intel-core-i5-2500k

you can see that sometimes (and specifically on linear algebra which John is doing) FTN95 performance is within ten-twenty percent from the top. Can you guys find where specifically this compiler loses its steam versus other compilers? FFT is also as fast as others which means that compiler can go very fast.

There are few examples where indeed loss of speed reach outrageous 700%. And one example trails 30x (!!!) slower. Was it due to underflows? Other terrible things? Site has Fortran source texts for all benchmarks, please look if you have time. Is it possible to make few short test examples for Silverfrost to pay attention in optimization? Are those speedups of other compilers due to SSE ? If not, then high precision timers can easily identify the slow part of the code which is typically very small. I remember when Intel became very fast in Polyhedron benchmarks a decade ago, Lahey and Absoft dramatically improved speeds of their compilers within just few weeks and months.

You probably can contribute to the TEST_FPU2 test there which is also a set of different linear algebra methods for different matrix sizes adding SSE2 option.

Paul, what is primary focus of developing 64bit compiler - is it to get mostly larger address space keeping FTN95 core features and speeds intact, add 2003/2008 features or improve overall performance (like on those Polyhedron tests)? As a comment - it was right decision of developers to consider underflow as an error. Like /undef, underflow can hint at hidden error. But they should also make a switch to ignore underflow

PaulLaidler · Posted: Wed Feb 25, 2015 6:15 pm Post subject:

John

It is quite possible that this fault arises because the numbers being output are no longer valid. For example IO_convert_long_double_to_ascii may not be designed to handle denormals.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Paul,

The numbers being output are expected to be valid. In earlier cases of this error, I have previously printed out the variables that are used to calculate the real*8 value.
Now, the runs without SALFENVAR=MASK_UNDERFLOW give an indication of the expected value.
you could even place the following before line 122:

PaulLaidler · Posted: Thu Feb 26, 2015 8:44 am Post subject:

John

It seems to me that we could spend a lot of time investigating these matters for you but it would be at the expense of focusing on porting to 64 bit.

There is an under lying issue here which needs to be investigated, namely the apparent high cost of handling underflows in the default unmasked state. The questions for us to consider are, a) is this an issue that widely affects users or is it specific to this type of computation and b) if it is wide spread then can the overhead be reduced?

Refining 32 bit FTN95 so that it becomes generally robust when underflows are masked really does not seem to be a sensible way forward. Hopefully we are not that far from 64 bit FTN95 where (I understand) this will not be an issue.

In the mean time I will log the above two questions for investigation.

mecej4 · Joined: 31 Oct 2006 Posts: 1886

Paul, (I hope that it is not inappropriate to ask!), should we expect in FTN95-64 the IEEE modules and intrinsic functions that are described in the Fortran 200X standards? If the answer is "yes", it would be possible to query and control FP exceptions in a standard way, instead of using functions such as UNDERFLOW_COUNT@, etc.

PaulLaidler · Posted: Thu Feb 26, 2015 7:43 pm Post subject:

The initial aim is to port the existing FTN95 release mode to 64 bits but we can keep this request in mind.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

mecej4 stated

mecej4 · Joined: 31 Oct 2006 Posts: 1886

Here is an update on the handling of underflows in John Campbell's test program in the newly released FTN95 64-bit Beta-3.

Using the modified source files in https://www.dropbox.com/s/0zc3khiou4lvywx/jctest.zip?dl=0 , which do not use any FTN95 extensions and non-standard functions, I find that the problem is completely done away with by using the new 64-bit compiler.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

mecej4,

Thanks for the update.
I have reviewed your tests and confirm that ftn95 /64 does not have the fpe slowdown problem. I tested both beta_2 and now beta_3.

This is a real improvement with /64

Unfortunately I don't share your euphoria as to the floating point performance of ftn95 /64 is not very good.
for beta_2 and /p6 /opt (/32) I get 0.6 mflops ( fpe problem )
for beta_2 and /64 I get 141 mflops
for beta_3 and /64 I get 152 mflops
for gFortran 4.92 -O3 -ffast-math I get 262 mflops

This is a "bit" misleading as gauss_old.f90 is the old version that has very poor cache usage.

If you correct the cache usage, by using gauss_tran.f90, then gFortran should change to about 2,000 mflops, while ftn95 /64 changes to about 300 mflops (it has been a while since I did these tests)
Using FTN95 /32 and the SSE routines, these get about 1,800 mflops.
Using FTN95 /32 /opt and no fpe errors, these get about 950 mflops.
Using FTN95 /32 /debug and no fpe errors, these get about 500 mflops.
It appeared to me that /64 performance was comparable to /32/debug, which is not good.
I have been asking for some SSE vector routines for /64 to fix this problem.

I do find FTN95 to be a significant step forward in other areas. I have been using /64 on my clearwin+ programs and am very impressed with the improvements that extra memory provide.

I consider FTN95 /64 handling of COMMON as a significant step forward, in comparison to the other 64-bit Fortran windows compilers I have tried, which appear to have a 2gb COMMON limit. FTN95 provides a way to expand arrays up to the limits of their integer*4 array subscripts, without significant code reworking for 64-bit.
I could be wrong, but the other compiler developers appear to say "Use modules and allocated arrays" but ignore all the existing F77 codes that could benefit from larger working arrays in COMMON. There is an arrogance that old Fortran users don't know how to use Fortran, when all they want to do is disguise Fortran as some variant of C.

Thanks again for the update.

(note: my definition of flop is multiply operations per second. If I include add then all the mflop reports double)