Topic: FTN95 run time with allocatable arrays in Support

JohnCampbell

Posts: 2526 Sydney

Back to Top

22 Feb 2015 11:05 #15703

Paul and Mecej4,

I am developing an updated version of the program, which demonstrates the FPE error, associated with calling problem of call mask_underflow@(). This program, can compile and run with /check.

If I call mask_underflow@() at the start of the program, then I get problems at the end of the testing, when outputting a valid real8 number, after outputting an integer8 number. There were previous writing of real*8 values that did not generate an error, and all numbers being reported are valid numbers ( in the range 0.01 to 1000). I am not getting the error during the test, associated with RedCol_Stats(), but at the reporting stage at the end of the main do ieq loop.

Alternatively, If I first call mask_underflow@() at the start of the main loop, then call unmask_underflow@ at the end of the main loop, before the write statements, then there is no error generated. Ftn95 documentation recommends that call mask_underflow@() be the first executable statement?

I also tried a test in the inner loop: if ( abs(Col(Jeq+I0)) < 1.0d-90 ) Col(Jeq+I0) = 0 This removes most small numbers being generated in colsol and removes FP Exceptions, but only for 1 of the solution methods. It may be that a well conditioned finite element matrix will not generate FPE's. This is disappointing, as I was hoping a source of this cronic delay problem may have been found. I shall send the link in a pm, together with documentation of how to generate the error.

John

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

22 Feb 2015 2:39 #15704

Keep looking at that issue. That way undeflow corruption will be finally addrssed. It caused a lot of lost time in the past in one of my subroutines which was doing a lot of exp(-a) with a exceeding log(1e-37) ~ 80. It essectially killed Jalih's (and i think Paul's latest too)parallel method for me since it is very sensitive to the underflow by crashing immediately. . May be even denormal numbers cause the problem. We discussed that last year here and even had a demo reproducer

JohnCampbell

Posts: 2526 Sydney

Back to Top

24 Feb 2015 12:16 #15721

Paul,

The following link provides a cut-down example of the FPE failure.

https://www.dropbox.com/s/066ghzblgcmca9s/fpe_example.zip?dl=0

To demonstrate the problem, unzip this link and run do_tests.bat The final failure with SALFENVAR=MASK_UNDERFLOW shows the Floating point stack fault occurring.

I am using FTN95 Ver 7.10.0

Only the program is run with set SALFENVAR=MASK_UNDERFLOW

If you change line 23 of prof.f90 to : eqn_option = 2000 you will then see the FPE delays becoming more significant. The last column report is the incremental count of FPE occurring.

Thanks to Mecej4 for his assistance in identifying this error.

John

mecej4

Posts: 1914

Back to Top

24 Feb 2015 10:04 #15722

Please note that the cut down example contains only about 240 lines and does not need any command line arguments to be supplied, whereas the original version had close to 3000 lines, and had (i) provisions for many alternative code paths and (ii) extensive instrumentation to time the program.

With the shortened example code and the batch file that he provides, John has made it easy to run and exhibit the two problems with the compiler: X87 stack overflow in a WRITE statement, and excessive time consumed in processing underflows. It is possible to work around only one of these problems.

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

24 Feb 2015 1:06 #15726

The immediate problem occurs when WRITE is used for an INTEGER*8 value. This has not been fixed yet but the temporary work-around is to avoid this situation.

To get this code working I keep op_count and last_count as INTEGER8 but assigned these to an INTEGER4 value before a calling WRITE using the INTEGER*4.

If I understand it correctly, in this context the non-zero underflow count is caused by the calls to WRITE when using an INTEGER*8 value.

I have logged this for further investigation.

mecej4

Posts: 1914

Back to Top

24 Feb 2015 3:15 #15727

Paul's response addresses one of the two related problems that we reported. The other problem, i.e., a cost of over 10,000 CPU cycles to handle an underflow interrupt with default settings, remains to be addressed.

More poking around has caused me to stumble on another work-around. Isolate the subprograms with the most incidence of underflows. Compile the corresponding source file(s) separately, build a DLL with them and export the subroutine/function entry points. In your application, instead of linking with the OBJ file(s) for those routines, link with the DLL (SLink can link directly with DLLs). Here is a test driver:

program tstvecsse
implicit none
integer, parameter :: N=Z'0FFFF'
double precision C,X(N),Y(N),Z(N),U(N),difnrm,vec_sum_sse
external vec_add_sse
integer ucnt,it

C=-acos(-1d0)                 ! use negative C to do VEC_SUB using VEC_SUM
do it=1,100
   call Random_Number(X)
   Call Random_Number(Y)
   X=X*9D-307                 ! set up to make trouble with underflows
   Y=Y*9D-307
   Z=Y+C*X                    ! F90 vector operation
   call vec_add_sse(Y,X,C,N)  ! SSE in assembler
   U=Y-Z                      ! difference
   difnrm=sqrt(vec_sum_sse(U,U,N))
   if(difnrm > 1d-12)write(*,*)it,difnrm
end do
call underflow_count@(ucnt)
write(*,*)' Underflow count = ',ucnt
end program

I compiled this driver and tested it by (i) linking with the OBJ file for DavidB's SSE BLAS routines ( https://forums.silverfrost.com/Forum/Topic/1894&postdays=0&postorder=asc&start=75, page 6, posted June 2, 2012 ), and (ii) with a DLL built from that OBJ file, as described above. The run times:

 1. OBJ used :  about 3 million underflows, 23 seconds run time
 2. DLL used :  0 underflows reported, 0.9 second run time.

My suspicion is that the DLL initialization sets up a dummy 'ignore and proceed' handler for FPU underflows, whereas a slow handler is included in the program initialization section of the EXE. The DLL startup code probably does not look for and process the SALFENVAR environment variable, either.

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

24 Feb 2015 4:36 #15728

There may be some advantage to using SSE instructions but the primary question may relate to the handling of underflows.

It may be that the third party compiler does not handle underflows and hence has no associated overhead (or maybe the SSE instructions do not generate underflows). This raises the question, is it safe to ignore underflows? The authors of FTN95 took the view that, for safety, underflows need to be handled (by clearing the relevant processor flags and setting the value to zero). Having said this, it does appear that the cost is unduly high suggesting that there is more to it than this.

I am not sure that these issues can be resolved in the short term and they disappear with 64 bit FTN95 which for now is our primary focus.

For those who would prefer not to wait, run times might be reduced by programming out the underflows before they happen.

JohnCampbell

Posts: 2526 Sydney

Back to Top

24 Feb 2015 8:59 #15729

Paul,

You stated:

The immediate problem occurs when WRITE is used for an INTEGER*8 value.

I thought the problem presents when writing a real8 after writing an integer8. I tried a very cut down version of this, but it did not fail, so I suspect the problem is more complex. I also noted in an earlier post that the problem presenting could be delayed by the use of format statements. I hope this problem can be located.

Thanks for your assistance.

John

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

25 Feb 2015 7:23 #15730

John

As I understand it, the problem presented in the code provided has been identified as occurring with WRITE and INTEGER*8.

If there is another problem then we would need different code that illustrates this.

JohnCampbell

Posts: 2526 Sydney

Back to Top

25 Feb 2015 11:55 #15731

Paul,

The error report I am getting with fpe_example.zip is: Runtime error from program:c:\temp\forum\fpe\prof_und.exe Floating point stack fault Floating point stack fault at address 10100cff

 10100c8b IO_convert_long_double_to_ascii [+0074]

 1010d0a2 R_WSF_main [+07e9]

 1010d03a D8__WSF [+0024]

 SK_ZERO -  in file gtran.f90 at line 122 [+07f8]

 GAUSSEAN_REDUCTION -  in file gtran.f90 at line 80 [+1109]

 DO_TESTS -  in file prof.f90 at line 46 [+01cc]

 PROFILE_V6 -  in file prof.f90 at line 30 [+0092]

 0040406a SALFStart [+06ff]


eax=00004005   ebx=0008fb8e   ecx=00000001
edx=fffffffa   esi=101dff0c   edi=00000000
ebp=0360cf38   esp=0360cedc   IOPL=0
ds=002b   es=002b   fs=0053
gs=002b   cs=0023   ss=002b
flgs=00010203 [CA OP NZ SN DN NV]

 10100cff  qfld     [esi] 
 10100d01  fmulp    st(1) 
 10100d03  add      esi,0xa

Line 122 is a real8 write, which is preceded by an integer4 write write (,) 'Smallest diagonal =',d

In the previous longer example, the error was a real8 write, following an integer8 write, while earlier real8 writes were successful. There had been previous integer8 writes in file gtran.f90 at line 44. In the larger example, line 44 equivalent included both real8 and integer8 and did not fail. The location of where the error is occurring has changed in the cut down example, which can occur with a stack corruption problem.

I hope this clarifies my earlier comments.

John

mecej4

Posts: 1914

Back to Top

25 Feb 2015 12:17 #15734

John, as of now it appears that your best recourse is to (i) avoid using the environmental variable SALFENVAR during compilation or execution, (ii) compile using the options /opt /p6 leaving out sse_lib.f90 (see below for its replacement), (iii) link with the command slink *.obj sse_sal.dll /out:profile, and (iv) run with /l:5 as much as possible.

I think that at this time your test codes have fulfilled their purpose. Using the lessons learned from this exercise, you could go ahead and modify your real FEM application to call the routines within SSE_SAL.DLL. If the FEM code needs additional BLAS routines, SSE2 versions of those could be prepared.

The assembly source for sse_sal.dll, the assembled OBJ file and a DLL built from it are contained in https://dl.dropboxusercontent.com/u/88464747/sse_sal.zip . The DLL will serve as the replacement for sse_lib.f90. I derived the assembly source from DavidB's source file, and modified it to remove all X87 instructions except the FLD instructions mandated by the Microsoft 32-bit ABI for functions that return float/double values.

If you follow this procedure, the only underflows that will remain a problem will be those generated from code within your Fortran sources.

Following this procedure, I obtained the following timing results for /l:5 /m:-1, with various values of /d:nnn.

d t_gauss t_red1b

2 16.85 11.07 3 17.40 11.29 4 17.68 11.37 5 17.98 11.45 6 17.83 11.53

These times do not exceed twice the values that I was able to obtain with optimizing compilers.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

25 Feb 2015 4:52 (Edited: 25 Feb 2015 8:30) #15741

'These times do not exceed twice the values that I was able to obtain with optimizing compilers'

Mecej4, I am still curious why you get in your tests factor of two-three loss versus other compilers? If you look at Polyhedron benchmark

http://www.polyhedron.com/fortran-compiler-comparisons/fortran-execution-time-benchmarks-64-bit-windows-7-intel-core-i5-2500k

you can see that sometimes (and specifically on linear algebra which John is doing) FTN95 performance is within ten-twenty percent from the top. Can you guys find where specifically this compiler loses its steam versus other compilers? FFT is also as fast as others which means that compiler can go very fast.

There are few examples where indeed loss of speed reach outrageous 700%. And one example trails 30x (!!!) slower. Was it due to underflows? Other terrible things? Site has Fortran source texts for all benchmarks, please look if you have time. Is it possible to make few short test examples for Silverfrost to pay attention in optimization? Are those speedups of other compilers due to SSE ? If not, then high precision timers can easily identify the slow part of the code which is typically very small. I remember when Intel became very fast in Polyhedron benchmarks a decade ago, Lahey and Absoft dramatically improved speeds of their compilers within just few weeks and months.

You probably can contribute to the TEST_FPU2 test there which is also a set of different linear algebra methods for different matrix sizes adding SSE2 option.

Paul, what is primary focus of developing 64bit compiler - is it to get mostly larger address space keeping FTN95 core features and speeds intact, add 2003/2008 features or improve overall performance (like on those Polyhedron tests)? As a comment - it was right decision of developers to consider underflow as an error. Like /undef, underflow can hint at hidden error. But they should also make a switch to ignore underflow

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

25 Feb 2015 5:15 #15743

John

It is quite possible that this fault arises because the numbers being output are no longer valid. For example IO_convert_long_double_to_ascii may not be designed to handle denormals.

JohnCampbell

Posts: 2526 Sydney

Back to Top

25 Feb 2015 11:20 #15751

Paul,

The numbers being output are expected to be valid. In earlier cases of this error, I have previously printed out the variables that are used to calculate the real*8 value. Now, the runs without SALFENVAR=MASK_UNDERFLOW give an indication of the expected value. you could even place the following before line 122:

      if ( d < 99. .or. d > 100. ) then
         write (*,*) 'unexpected value of d'
      else
         write (*,*) 'value of d is in range'
      end if

If the number becomes invalid, there is something unexplained that is causing this result.

John

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

26 Feb 2015 7:44 #15752

John

It seems to me that we could spend a lot of time investigating these matters for you but it would be at the expense of focusing on porting to 64 bit.

There is an under lying issue here which needs to be investigated, namely the apparent high cost of handling underflows in the default unmasked state. The questions for us to consider are, a) is this an issue that widely affects users or is it specific to this type of computation and b) if it is wide spread then can the overhead be reduced?

Refining 32 bit FTN95 so that it becomes generally robust when underflows are masked really does not seem to be a sensible way forward. Hopefully we are not that far from 64 bit FTN95 where (I understand) this will not be an issue.

In the mean time I will log the above two questions for investigation.

mecej4

Posts: 1914

Back to Top

26 Feb 2015 2:26 (Edited: 27 Feb 2015 12:09) #15758

Paul, (I hope that it is not inappropriate to ask!), should we expect in FTN95-64 the IEEE modules and intrinsic functions that are described in the Fortran 200X standards? If the answer is 'yes', it would be possible to query and control FP exceptions in a standard way, instead of using functions such as UNDERFLOW_COUNT@, etc.

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

26 Feb 2015 6:43 #15761

The initial aim is to port the existing FTN95 release mode to 64 bits but we can keep this request in mind.

JohnCampbell

Posts: 2526 Sydney

Back to Top

27 Feb 2015 4:16 #15769

mecej4 stated

If the FEM code needs additional BLAS routines, SSE2 versions of those could be prepared.

I am sure that a BLAS library that can be linked into FTN95_32 and FTN95_64 that uses SSE, SSE2 or AVX instructions would be a very welcome improvement.

Even a SSE DOT_PRODUCT and SUM Intrinsic replacement would be a very welcome addition. (could be in salflibc.dll)

Often it is only 1 or 2 inner DO loops that require SSE to make things better, although there is always the problem that slight changes are required. Then there is !$OMP. It never ends.

There are a number of FTN95 users who use other compilers for compute intensive programs to avoid the x87 deficiencies. In my work, there are only a few vector instructions that I require. Some sort of BLAS.DLL or blas.lib functionality distributed with FTN95 should please these users.

John

mecej4

Posts: 1914

Back to Top

6 Jan 2016 2:52 #17113

Here is an update on the handling of underflows in John Campbell's test program in the newly released FTN95 64-bit Beta-3.

Using the modified source files in https://www.dropbox.com/s/0zc3khiou4lvywx/jctest.zip?dl=0 , which do not use any FTN95 extensions and non-standard functions, I find that the problem is completely done away with by using the new 64-bit compiler.

FTN95 7.2 32-bit /opt /p6 : 316.  s
FTN95 8.0 Beta-3 64-bit   :   1.7 s   * different CPU
Gfortran 4.9, -O2         :   7.9 s 
Ifort 16, -O2             :   2.4 s
Absoft 16, -Ofast         :  11.0 s

Except for the FTN95 Beta-3 result, which was obtained on a newer CPU by a friend, the timings were obtained on a laptop with a T4300 CPU. For comparison purposes, you may double the second duration to 3.4 s.

Impressive performance from FTN95 8.0 Beta-3, so congratulations to the Silverfrost team! Note that when the new compiler is released it will allow /opt, which the Beta-3 version does not.

John (Campbell), if you read this and you have the Beta-3 compiler, please run all the tests on a single PC and share the results.

JohnCampbell

Posts: 2526 Sydney

Back to Top

6 Jan 2016 6:54 #17114

mecej4,

Thanks for the update. I have reviewed your tests and confirm that ftn95 /64 does not have the fpe slowdown problem. I tested both beta_2 and now beta_3.

This is a real improvement with /64

Unfortunately I don't share your euphoria as to the floating point performance of ftn95 /64 is not very good. for beta_2 and /p6 /opt (/32) I get 0.6 mflops ( fpe problem ) for beta_2 and /64 I get 141 mflops for beta_3 and /64 I get 152 mflops for gFortran 4.92 -O3 -ffast-math I get 262 mflops

This is a 'bit' misleading as gauss_old.f90 is the old version that has very poor cache usage.

If you correct the cache usage, by using gauss_tran.f90, then gFortran should change to about 2,000 mflops, while ftn95 /64 changes to about 300 mflops (it has been a while since I did these tests) Using FTN95 /32 and the SSE routines, these get about 1,800 mflops. Using FTN95 /32 /opt and no fpe errors, these get about 950 mflops. Using FTN95 /32 /debug and no fpe errors, these get about 500 mflops. It appeared to me that /64 performance was comparable to /32/debug, which is not good. I have been asking for some SSE vector routines for /64 to fix this problem.

I do find FTN95 to be a significant step forward in other areas. I have been using /64 on my clearwin+ programs and am very impressed with the improvements that extra memory provide.

I consider FTN95 /64 handling of COMMON as a significant step forward, in comparison to the other 64-bit Fortran windows compilers I have tried, which appear to have a 2gb COMMON limit. FTN95 provides a way to expand arrays up to the limits of their integer*4 array subscripts, without significant code reworking for 64-bit. I could be wrong, but the other compiler developers appear to say 'Use modules and allocated arrays' but ignore all the existing F77 codes that could benefit from larger working arrays in COMMON. There is an arrogance that old Fortran users don't know how to use Fortran, when all they want to do is disguise Fortran as some variant of C.

Thanks again for the update.

(note: my definition of flop is multiply operations per second. If I include add then all the mflop reports double)