|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Tue Feb 24, 2015 4:15 pm Post subject: |
|
|
Paul's response addresses one of the two related problems that we reported. The other problem, i.e., a cost of over 10,000 CPU cycles to handle an underflow interrupt with default settings, remains to be addressed.
More poking around has caused me to stumble on another work-around. Isolate the subprograms with the most incidence of underflows. Compile the corresponding source file(s) separately, build a DLL with them and export the subroutine/function entry points. In your application, instead of linking with the OBJ file(s) for those routines, link with the DLL (SLink can link directly with DLLs). Here is a test driver:
Code: |
program tstvecsse
implicit none
integer, parameter :: N=Z'0FFFF'
double precision C,X(N),Y(N),Z(N),U(N),difnrm,vec_sum_sse
external vec_add_sse
integer ucnt,it
C=-acos(-1d0) ! use negative C to do VEC_SUB using VEC_SUM
do it=1,100
call Random_Number(X)
Call Random_Number(Y)
X=X*9D-307 ! set up to make trouble with underflows
Y=Y*9D-307
Z=Y+C*X ! F90 vector operation
call vec_add_sse(Y,X,C,N) ! SSE in assembler
U=Y-Z ! difference
difnrm=sqrt(vec_sum_sse(U,U,N))
if(difnrm > 1d-12)write(*,*)it,difnrm
end do
call underflow_count@(ucnt)
write(*,*)' Underflow count = ',ucnt
end program
|
I compiled this driver and tested it by (i) linking with the OBJ file for DavidB's SSE BLAS routines ( http://forums.silverfrost.com/viewtopic.php?t=2176&postdays=0&postorder=asc&start=75, page 6, posted June 2, 2012 ), and (ii) with a DLL built from that OBJ file, as described above. The run times:
1. OBJ used : about 3 million underflows, 23 seconds run time
2. DLL used : 0 underflows reported, 0.9 second run time.
My suspicion is that the DLL initialization sets up a dummy "ignore and proceed" handler for FPU underflows, whereas a slow handler is included in the program initialization section of the EXE. The DLL startup code probably does not look for and process the SALFENVAR environment variable, either. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7925 Location: Salford, UK
|
Posted: Tue Feb 24, 2015 5:36 pm Post subject: |
|
|
There may be some advantage to using SSE instructions but the primary question may relate to the handling of underflows.
It may be that the third party compiler does not handle underflows and hence has no associated overhead (or maybe the SSE instructions do not generate underflows). This raises the question, is it safe to ignore underflows? The authors of FTN95 took the view that, for safety, underflows need to be handled (by clearing the relevant processor flags and setting the value to zero). Having said this, it does appear that the cost is unduly high suggesting that there is more to it than this.
I am not sure that these issues can be resolved in the short term and they disappear with 64 bit FTN95 which for now is our primary focus.
For those who would prefer not to wait, run times might be reduced by programming out the underflows before they happen. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Tue Feb 24, 2015 9:59 pm Post subject: |
|
|
Paul,
You stated: Quote: | The immediate problem occurs when WRITE is used for an INTEGER*8 value. |
I thought the problem presents when writing a real*8 after writing an integer*8. I tried a very cut down version of this, but it did not fail, so I suspect the problem is more complex.
I also noted in an earlier post that the problem presenting could be delayed by the use of format statements.
I hope this problem can be located.
Thanks for your assistance.
John |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7925 Location: Salford, UK
|
Posted: Wed Feb 25, 2015 8:23 am Post subject: |
|
|
John
As I understand it, the problem presented in the code provided has been identified as occurring with WRITE and INTEGER*8.
If there is another problem then we would need different code that illustrates this. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Wed Feb 25, 2015 12:55 pm Post subject: |
|
|
Paul,
The error report I am getting with fpe_example.zip is: Code: | Runtime error from program:c:\temp\forum\fpe\prof_und.exe
Floating point stack fault
Floating point stack fault at address 10100cff
10100c8b IO_convert_long_double_to_ascii [+0074]
1010d0a2 R_WSF_main [+07e9]
1010d03a D8__WSF [+0024]
SK_ZERO - in file gtran.f90 at line 122 [+07f8]
GAUSSEAN_REDUCTION - in file gtran.f90 at line 80 [+1109]
DO_TESTS - in file prof.f90 at line 46 [+01cc]
PROFILE_V6 - in file prof.f90 at line 30 [+0092]
0040406a SALFStart [+06ff]
eax=00004005 ebx=0008fb8e ecx=00000001
edx=fffffffa esi=101dff0c edi=00000000
ebp=0360cf38 esp=0360cedc IOPL=0
ds=002b es=002b fs=0053
gs=002b cs=0023 ss=002b
flgs=00010203 [CA OP NZ SN DN NV]
10100cff qfld [esi]
10100d01 fmulp st(1)
10100d03 add esi,0xa |
Line 122 is a real*8 write, which is preceded by an integer*4 write Code: | write (*,*) 'Smallest diagonal =',d |
In the previous longer example, the error was a real*8 write, following an integer*8 write, while earlier real*8 writes were successful.
There had been previous integer*8 writes in file gtran.f90 at line 44. In the larger example, line 44 equivalent included both real*8 and integer*8 and did not fail.
The location of where the error is occurring has changed in the cut down example, which can occur with a stack corruption problem.
I hope this clarifies my earlier comments.
John |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Wed Feb 25, 2015 1:17 pm Post subject: |
|
|
John, as of now it appears that your best recourse is to (i) avoid using the environmental variable SALFENVAR during compilation or execution, (ii) compile using the options /opt /p6 leaving out sse_lib.f90 (see below for its replacement), (iii) link with the command slink *.obj sse_sal.dll /out:profile, and (iv) run with /l:5 as much as possible.
I think that at this time your test codes have fulfilled their purpose. Using the lessons learned from this exercise, you could go ahead and modify your real FEM application to call the routines within SSE_SAL.DLL. If the FEM code needs additional BLAS routines, SSE2 versions of those could be prepared.
The assembly source for sse_sal.dll, the assembled OBJ file and a DLL built from it are contained in https://dl.dropboxusercontent.com/u/88464747/sse_sal.zip . The DLL will serve as the replacement for sse_lib.f90. I derived the assembly source from DavidB's source file, and modified it to remove all X87 instructions except the FLD instructions mandated by the Microsoft 32-bit ABI for functions that return float/double values.
If you follow this procedure, the only underflows that will remain a problem will be those generated from code within your Fortran sources.
Following this procedure, I obtained the following timing results for /l:5 /m:-1, with various values of /d:nnn.
d t_gauss t_red1b
- ------- -------
2 16.85 11.07
3 17.40 11.29
4 17.68 11.37
5 17.98 11.45
6 17.83 11.53
These times do not exceed twice the values that I was able to obtain with optimizing compilers. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2816 Location: South Pole, Antarctica
|
Posted: Wed Feb 25, 2015 5:52 pm Post subject: |
|
|
"These times do not exceed twice the values that I was able to obtain with optimizing compilers"
Mecej4, I am still curious why you get in your tests factor of two-three loss versus other compilers? If you look at Polyhedron benchmark
http://www.polyhedron.com/fortran-compiler-comparisons/fortran-execution-time-benchmarks-64-bit-windows-7-intel-core-i5-2500k
you can see that sometimes (and specifically on linear algebra which John is doing) FTN95 performance is within ten-twenty percent from the top. Can you guys find where specifically this compiler loses its steam versus other compilers? FFT is also as fast as others which means that compiler can go very fast.
There are few examples where indeed loss of speed reach outrageous 700%. And one example trails 30x (!!!) slower. Was it due to underflows? Other terrible things? Site has Fortran source texts for all benchmarks, please look if you have time. Is it possible to make few short test examples for Silverfrost to pay attention in optimization? Are those speedups of other compilers due to SSE ? If not, then high precision timers can easily identify the slow part of the code which is typically very small. I remember when Intel became very fast in Polyhedron benchmarks a decade ago, Lahey and Absoft dramatically improved speeds of their compilers within just few weeks and months.
You probably can contribute to the TEST_FPU2 test there which is also a set of different linear algebra methods for different matrix sizes adding SSE2 option.
Paul, what is primary focus of developing 64bit compiler - is it to get mostly larger address space keeping FTN95 core features and speeds intact, add 2003/2008 features or improve overall performance (like on those Polyhedron tests)? As a comment - it was right decision of developers to consider underflow as an error. Like /undef, underflow can hint at hidden error. But they should also make a switch to ignore underflow
Last edited by DanRRight on Wed Feb 25, 2015 9:30 pm; edited 9 times in total |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7925 Location: Salford, UK
|
Posted: Wed Feb 25, 2015 6:15 pm Post subject: |
|
|
John
It is quite possible that this fault arises because the numbers being output are no longer valid. For example IO_convert_long_double_to_ascii may not be designed to handle denormals. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Thu Feb 26, 2015 12:20 am Post subject: |
|
|
Paul,
The numbers being output are expected to be valid. In earlier cases of this error, I have previously printed out the variables that are used to calculate the real*8 value.
Now, the runs without SALFENVAR=MASK_UNDERFLOW give an indication of the expected value.
you could even place the following before line 122:
Code: | if ( d < 99. .or. d > 100. ) then
write (*,*) 'unexpected value of d'
else
write (*,*) 'value of d is in range'
end if
|
If the number becomes invalid, there is something unexplained that is causing this result.
John |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7925 Location: Salford, UK
|
Posted: Thu Feb 26, 2015 8:44 am Post subject: |
|
|
John
It seems to me that we could spend a lot of time investigating these matters for you but it would be at the expense of focusing on porting to 64 bit.
There is an under lying issue here which needs to be investigated, namely the apparent high cost of handling underflows in the default unmasked state. The questions for us to consider are, a) is this an issue that widely affects users or is it specific to this type of computation and b) if it is wide spread then can the overhead be reduced?
Refining 32 bit FTN95 so that it becomes generally robust when underflows are masked really does not seem to be a sensible way forward. Hopefully we are not that far from 64 bit FTN95 where (I understand) this will not be an issue.
In the mean time I will log the above two questions for investigation. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Thu Feb 26, 2015 3:26 pm Post subject: |
|
|
Paul, (I hope that it is not inappropriate to ask!), should we expect in FTN95-64 the IEEE modules and intrinsic functions that are described in the Fortran 200X standards? If the answer is "yes", it would be possible to query and control FP exceptions in a standard way, instead of using functions such as UNDERFLOW_COUNT@, etc.
Last edited by mecej4 on Fri Feb 27, 2015 1:09 am; edited 1 time in total |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7925 Location: Salford, UK
|
Posted: Thu Feb 26, 2015 7:43 pm Post subject: |
|
|
The initial aim is to port the existing FTN95 release mode to 64 bits but we can keep this request in mind. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Fri Feb 27, 2015 5:16 am Post subject: |
|
|
mecej4 stated Quote: | If the FEM code needs additional BLAS routines, SSE2 versions of those could be prepared. |
I am sure that a BLAS library that can be linked into FTN95_32 and FTN95_64 that uses SSE, SSE2 or AVX instructions would be a very welcome improvement.
Even a SSE DOT_PRODUCT and SUM Intrinsic replacement would be a very welcome addition. (could be in salflibc.dll)
Often it is only 1 or 2 inner DO loops that require SSE to make things better, although there is always the problem that slight changes are required. Then there is !$OMP. It never ends.
There are a number of FTN95 users who use other compilers for compute intensive programs to avoid the x87 deficiencies. In my work, there are only a few vector instructions that I require. Some sort of BLAS.DLL or blas.lib functionality distributed with FTN95 should please these users.
John |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Wed Jan 06, 2016 3:52 am Post subject: |
|
|
Here is an update on the handling of underflows in John Campbell's test program in the newly released FTN95 64-bit Beta-3.
Using the modified source files in https://www.dropbox.com/s/0zc3khiou4lvywx/jctest.zip?dl=0 , which do not use any FTN95 extensions and non-standard functions, I find that the problem is completely done away with by using the new 64-bit compiler.
Code: |
FTN95 7.2 32-bit /opt /p6 : 316. s
FTN95 8.0 Beta-3 64-bit : 1.7 s * different CPU
Gfortran 4.9, -O2 : 7.9 s
Ifort 16, -O2 : 2.4 s
Absoft 16, -Ofast : 11.0 s
|
Except for the FTN95 Beta-3 result, which was obtained on a newer CPU by a friend, the timings were obtained on a laptop with a T4300 CPU. For comparison purposes, you may double the second duration to 3.4 s.
Impressive performance from FTN95 8.0 Beta-3, so congratulations to the Silverfrost team! Note that when the new compiler is released it will allow /opt, which the Beta-3 version does not.
John (Campbell), if you read this and you have the Beta-3 compiler, please run all the tests on a single PC and share the results. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Wed Jan 06, 2016 7:54 am Post subject: |
|
|
mecej4,
Thanks for the update.
I have reviewed your tests and confirm that ftn95 /64 does not have the fpe slowdown problem. I tested both beta_2 and now beta_3.
This is a real improvement with /64
Unfortunately I don't share your euphoria as to the floating point performance of ftn95 /64 is not very good.
for beta_2 and /p6 /opt (/32) I get 0.6 mflops ( fpe problem )
for beta_2 and /64 I get 141 mflops
for beta_3 and /64 I get 152 mflops
for gFortran 4.92 -O3 -ffast-math I get 262 mflops
This is a "bit" misleading as gauss_old.f90 is the old version that has very poor cache usage.
If you correct the cache usage, by using gauss_tran.f90, then gFortran should change to about 2,000 mflops, while ftn95 /64 changes to about 300 mflops (it has been a while since I did these tests)
Using FTN95 /32 and the SSE routines, these get about 1,800 mflops.
Using FTN95 /32 /opt and no fpe errors, these get about 950 mflops.
Using FTN95 /32 /debug and no fpe errors, these get about 500 mflops.
It appeared to me that /64 performance was comparable to /32/debug, which is not good.
I have been asking for some SSE vector routines for /64 to fix this problem.
I do find FTN95 to be a significant step forward in other areas. I have been using /64 on my clearwin+ programs and am very impressed with the improvements that extra memory provide.
I consider FTN95 /64 handling of COMMON as a significant step forward, in comparison to the other 64-bit Fortran windows compilers I have tried, which appear to have a 2gb COMMON limit. FTN95 provides a way to expand arrays up to the limits of their integer*4 array subscripts, without significant code reworking for 64-bit.
I could be wrong, but the other compiler developers appear to say "Use modules and allocated arrays" but ignore all the existing F77 codes that could benefit from larger working arrays in COMMON. There is an arrogance that old Fortran users don't know how to use Fortran, when all they want to do is disguise Fortran as some variant of C.
Thanks again for the update.
(note: my definition of flop is multiply operations per second. If I include add then all the mflop reports double) |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|