forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

FTN95 run time with allocatable arrays
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support
View previous topic :: View next topic  
Author Message
mecej4



Joined: 31 Oct 2006
Posts: 1886

PostPosted: Tue Feb 24, 2015 4:15 pm    Post subject: Reply with quote

Paul's response addresses one of the two related problems that we reported. The other problem, i.e., a cost of over 10,000 CPU cycles to handle an underflow interrupt with default settings, remains to be addressed.

More poking around has caused me to stumble on another work-around. Isolate the subprograms with the most incidence of underflows. Compile the corresponding source file(s) separately, build a DLL with them and export the subroutine/function entry points. In your application, instead of linking with the OBJ file(s) for those routines, link with the DLL (SLink can link directly with DLLs). Here is a test driver:
Code:

program tstvecsse
implicit none
integer, parameter :: N=Z'0FFFF'
double precision C,X(N),Y(N),Z(N),U(N),difnrm,vec_sum_sse
external vec_add_sse
integer ucnt,it

C=-acos(-1d0)                 ! use negative C to do VEC_SUB using VEC_SUM
do it=1,100
   call Random_Number(X)
   Call Random_Number(Y)
   X=X*9D-307                 ! set up to make trouble with underflows
   Y=Y*9D-307
   Z=Y+C*X                    ! F90 vector operation
   call vec_add_sse(Y,X,C,N)  ! SSE in assembler
   U=Y-Z                      ! difference
   difnrm=sqrt(vec_sum_sse(U,U,N))
   if(difnrm > 1d-12)write(*,*)it,difnrm
end do
call underflow_count@(ucnt)
write(*,*)' Underflow count = ',ucnt
end program

I compiled this driver and tested it by (i) linking with the OBJ file for DavidB's SSE BLAS routines ( http://forums.silverfrost.com/viewtopic.php?t=2176&postdays=0&postorder=asc&start=75, page 6, posted June 2, 2012 ), and (ii) with a DLL built from that OBJ file, as described above. The run times:

1. OBJ used : about 3 million underflows, 23 seconds run time
2. DLL used : 0 underflows reported, 0.9 second run time.

My suspicion is that the DLL initialization sets up a dummy "ignore and proceed" handler for FPU underflows, whereas a slow handler is included in the program initialization section of the EXE. The DLL startup code probably does not look for and process the SALFENVAR environment variable, either.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7925
Location: Salford, UK

PostPosted: Tue Feb 24, 2015 5:36 pm    Post subject: Reply with quote

There may be some advantage to using SSE instructions but the primary question may relate to the handling of underflows.

It may be that the third party compiler does not handle underflows and hence has no associated overhead (or maybe the SSE instructions do not generate underflows). This raises the question, is it safe to ignore underflows? The authors of FTN95 took the view that, for safety, underflows need to be handled (by clearing the relevant processor flags and setting the value to zero). Having said this, it does appear that the cost is unduly high suggesting that there is more to it than this.

I am not sure that these issues can be resolved in the short term and they disappear with 64 bit FTN95 which for now is our primary focus.

For those who would prefer not to wait, run times might be reduced by programming out the underflows before they happen.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue Feb 24, 2015 9:59 pm    Post subject: Reply with quote

Paul,

You stated:
Quote:
The immediate problem occurs when WRITE is used for an INTEGER*8 value.

I thought the problem presents when writing a real*8 after writing an integer*8. I tried a very cut down version of this, but it did not fail, so I suspect the problem is more complex.
I also noted in an earlier post that the problem presenting could be delayed by the use of format statements.
I hope this problem can be located.

Thanks for your assistance.

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7925
Location: Salford, UK

PostPosted: Wed Feb 25, 2015 8:23 am    Post subject: Reply with quote

John

As I understand it, the problem presented in the code provided has been identified as occurring with WRITE and INTEGER*8.

If there is another problem then we would need different code that illustrates this.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Feb 25, 2015 12:55 pm    Post subject: Reply with quote

Paul,

The error report I am getting with fpe_example.zip is:
Code:
Runtime error from program:c:\temp\forum\fpe\prof_und.exe
Floating point stack fault
Floating point stack fault at address 10100cff

 10100c8b IO_convert_long_double_to_ascii [+0074]

 1010d0a2 R_WSF_main [+07e9]

 1010d03a D8__WSF [+0024]

 SK_ZERO -  in file gtran.f90 at line 122 [+07f8]

 GAUSSEAN_REDUCTION -  in file gtran.f90 at line 80 [+1109]

 DO_TESTS -  in file prof.f90 at line 46 [+01cc]

 PROFILE_V6 -  in file prof.f90 at line 30 [+0092]

 0040406a SALFStart [+06ff]


eax=00004005   ebx=0008fb8e   ecx=00000001
edx=fffffffa   esi=101dff0c   edi=00000000
ebp=0360cf38   esp=0360cedc   IOPL=0
ds=002b   es=002b   fs=0053
gs=002b   cs=0023   ss=002b
flgs=00010203 [CA OP NZ SN DN NV]

 10100cff  qfld     [esi]
 10100d01  fmulp    st(1)
 10100d03  add      esi,0xa


Line 122 is a real*8 write, which is preceded by an integer*4 write
Code:
      write (*,*) 'Smallest diagonal         =',d


In the previous longer example, the error was a real*8 write, following an integer*8 write, while earlier real*8 writes were successful.
There had been previous integer*8 writes in file gtran.f90 at line 44. In the larger example, line 44 equivalent included both real*8 and integer*8 and did not fail.
The location of where the error is occurring has changed in the cut down example, which can occur with a stack corruption problem.

I hope this clarifies my earlier comments.

John
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1886

PostPosted: Wed Feb 25, 2015 1:17 pm    Post subject: Reply with quote

John, as of now it appears that your best recourse is to (i) avoid using the environmental variable SALFENVAR during compilation or execution, (ii) compile using the options /opt /p6 leaving out sse_lib.f90 (see below for its replacement), (iii) link with the command slink *.obj sse_sal.dll /out:profile, and (iv) run with /l:5 as much as possible.

I think that at this time your test codes have fulfilled their purpose. Using the lessons learned from this exercise, you could go ahead and modify your real FEM application to call the routines within SSE_SAL.DLL. If the FEM code needs additional BLAS routines, SSE2 versions of those could be prepared.

The assembly source for sse_sal.dll, the assembled OBJ file and a DLL built from it are contained in https://dl.dropboxusercontent.com/u/88464747/sse_sal.zip . The DLL will serve as the replacement for sse_lib.f90. I derived the assembly source from DavidB's source file, and modified it to remove all X87 instructions except the FLD instructions mandated by the Microsoft 32-bit ABI for functions that return float/double values.

If you follow this procedure, the only underflows that will remain a problem will be those generated from code within your Fortran sources.

Following this procedure, I obtained the following timing results for /l:5 /m:-1, with various values of /d:nnn.

d t_gauss t_red1b
- ------- -------
2 16.85 11.07
3 17.40 11.29
4 17.68 11.37
5 17.98 11.45
6 17.83 11.53

These times do not exceed twice the values that I was able to obtain with optimizing compilers.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2816
Location: South Pole, Antarctica

PostPosted: Wed Feb 25, 2015 5:52 pm    Post subject: Reply with quote

"These times do not exceed twice the values that I was able to obtain with optimizing compilers"

Mecej4, I am still curious why you get in your tests factor of two-three loss versus other compilers? If you look at Polyhedron benchmark

http://www.polyhedron.com/fortran-compiler-comparisons/fortran-execution-time-benchmarks-64-bit-windows-7-intel-core-i5-2500k

you can see that sometimes (and specifically on linear algebra which John is doing) FTN95 performance is within ten-twenty percent from the top. Can you guys find where specifically this compiler loses its steam versus other compilers? FFT is also as fast as others which means that compiler can go very fast.

There are few examples where indeed loss of speed reach outrageous 700%. And one example trails 30x (!!!) slower. Was it due to underflows? Other terrible things? Site has Fortran source texts for all benchmarks, please look if you have time. Is it possible to make few short test examples for Silverfrost to pay attention in optimization? Are those speedups of other compilers due to SSE ? If not, then high precision timers can easily identify the slow part of the code which is typically very small. I remember when Intel became very fast in Polyhedron benchmarks a decade ago, Lahey and Absoft dramatically improved speeds of their compilers within just few weeks and months.

You probably can contribute to the TEST_FPU2 test there which is also a set of different linear algebra methods for different matrix sizes adding SSE2 option.

Paul, what is primary focus of developing 64bit compiler - is it to get mostly larger address space keeping FTN95 core features and speeds intact, add 2003/2008 features or improve overall performance (like on those Polyhedron tests)? As a comment - it was right decision of developers to consider underflow as an error. Like /undef, underflow can hint at hidden error. But they should also make a switch to ignore underflow


Last edited by DanRRight on Wed Feb 25, 2015 9:30 pm; edited 9 times in total
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7925
Location: Salford, UK

PostPosted: Wed Feb 25, 2015 6:15 pm    Post subject: Reply with quote

John

It is quite possible that this fault arises because the numbers being output are no longer valid. For example IO_convert_long_double_to_ascii may not be designed to handle denormals.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Thu Feb 26, 2015 12:20 am    Post subject: Reply with quote

Paul,

The numbers being output are expected to be valid. In earlier cases of this error, I have previously printed out the variables that are used to calculate the real*8 value.
Now, the runs without SALFENVAR=MASK_UNDERFLOW give an indication of the expected value.
you could even place the following before line 122:
Code:
      if ( d < 99. .or. d > 100. ) then
         write (*,*) 'unexpected value of d'
      else
         write (*,*) 'value of d is in range'
      end if

If the number becomes invalid, there is something unexplained that is causing this result.

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7925
Location: Salford, UK

PostPosted: Thu Feb 26, 2015 8:44 am    Post subject: Reply with quote

John

It seems to me that we could spend a lot of time investigating these matters for you but it would be at the expense of focusing on porting to 64 bit.

There is an under lying issue here which needs to be investigated, namely the apparent high cost of handling underflows in the default unmasked state. The questions for us to consider are, a) is this an issue that widely affects users or is it specific to this type of computation and b) if it is wide spread then can the overhead be reduced?

Refining 32 bit FTN95 so that it becomes generally robust when underflows are masked really does not seem to be a sensible way forward. Hopefully we are not that far from 64 bit FTN95 where (I understand) this will not be an issue.

In the mean time I will log the above two questions for investigation.
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1886

PostPosted: Thu Feb 26, 2015 3:26 pm    Post subject: Reply with quote

Paul, (I hope that it is not inappropriate to ask!), should we expect in FTN95-64 the IEEE modules and intrinsic functions that are described in the Fortran 200X standards? If the answer is "yes", it would be possible to query and control FP exceptions in a standard way, instead of using functions such as UNDERFLOW_COUNT@, etc.

Last edited by mecej4 on Fri Feb 27, 2015 1:09 am; edited 1 time in total
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7925
Location: Salford, UK

PostPosted: Thu Feb 26, 2015 7:43 pm    Post subject: Reply with quote

The initial aim is to port the existing FTN95 release mode to 64 bits but we can keep this request in mind.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Fri Feb 27, 2015 5:16 am    Post subject: Reply with quote

mecej4 stated
Quote:
If the FEM code needs additional BLAS routines, SSE2 versions of those could be prepared.


I am sure that a BLAS library that can be linked into FTN95_32 and FTN95_64 that uses SSE, SSE2 or AVX instructions would be a very welcome improvement.

Even a SSE DOT_PRODUCT and SUM Intrinsic replacement would be a very welcome addition. (could be in salflibc.dll)

Often it is only 1 or 2 inner DO loops that require SSE to make things better, although there is always the problem that slight changes are required. Then there is !$OMP. It never ends.

There are a number of FTN95 users who use other compilers for compute intensive programs to avoid the x87 deficiencies. In my work, there are only a few vector instructions that I require. Some sort of BLAS.DLL or blas.lib functionality distributed with FTN95 should please these users.

John
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1886

PostPosted: Wed Jan 06, 2016 3:52 am    Post subject: Reply with quote

Here is an update on the handling of underflows in John Campbell's test program in the newly released FTN95 64-bit Beta-3.

Using the modified source files in https://www.dropbox.com/s/0zc3khiou4lvywx/jctest.zip?dl=0 , which do not use any FTN95 extensions and non-standard functions, I find that the problem is completely done away with by using the new 64-bit compiler.
Code:

FTN95 7.2 32-bit /opt /p6 : 316.  s
FTN95 8.0 Beta-3 64-bit   :   1.7 s   * different CPU
Gfortran 4.9, -O2         :   7.9 s
Ifort 16, -O2             :   2.4 s
Absoft 16, -Ofast         :  11.0 s

Except for the FTN95 Beta-3 result, which was obtained on a newer CPU by a friend, the timings were obtained on a laptop with a T4300 CPU. For comparison purposes, you may double the second duration to 3.4 s.

Impressive performance from FTN95 8.0 Beta-3, so congratulations to the Silverfrost team! Note that when the new compiler is released it will allow /opt, which the Beta-3 version does not.

John (Campbell), if you read this and you have the Beta-3 compiler, please run all the tests on a single PC and share the results.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Jan 06, 2016 7:54 am    Post subject: Reply with quote

mecej4,

Thanks for the update.
I have reviewed your tests and confirm that ftn95 /64 does not have the fpe slowdown problem. I tested both beta_2 and now beta_3.

This is a real improvement with /64

Unfortunately I don't share your euphoria as to the floating point performance of ftn95 /64 is not very good.
for beta_2 and /p6 /opt (/32) I get 0.6 mflops ( fpe problem )
for beta_2 and /64 I get 141 mflops
for beta_3 and /64 I get 152 mflops
for gFortran 4.92 -O3 -ffast-math I get 262 mflops

This is a "bit" misleading as gauss_old.f90 is the old version that has very poor cache usage.

If you correct the cache usage, by using gauss_tran.f90, then gFortran should change to about 2,000 mflops, while ftn95 /64 changes to about 300 mflops (it has been a while since I did these tests)
Using FTN95 /32 and the SSE routines, these get about 1,800 mflops.
Using FTN95 /32 /opt and no fpe errors, these get about 950 mflops.
Using FTN95 /32 /debug and no fpe errors, these get about 500 mflops.
It appeared to me that /64 performance was comparable to /32/debug, which is not good.
I have been asking for some SSE vector routines for /64 to fix this problem.

I do find FTN95 to be a significant step forward in other areas. I have been using /64 on my clearwin+ programs and am very impressed with the improvements that extra memory provide.

I consider FTN95 /64 handling of COMMON as a significant step forward, in comparison to the other 64-bit Fortran windows compilers I have tried, which appear to have a 2gb COMMON limit. FTN95 provides a way to expand arrays up to the limits of their integer*4 array subscripts, without significant code reworking for 64-bit.
I could be wrong, but the other compiler developers appear to say "Use modules and allocated arrays" but ignore all the existing F77 codes that could benefit from larger working arrays in COMMON. There is an arrogance that old Fortran users don't know how to use Fortran, when all they want to do is disguise Fortran as some variant of C.

Thanks again for the update.

(note: my definition of flop is multiply operations per second. If I include add then all the mflop reports double)
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5  Next
Page 4 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group