Topic: Slow handling of FPU underflow interrupts in SALFLIBC.DLL in Support

mecej4

Posts: 1914

Back to Top

18 Feb 2015 12:40 (Edited: 18 Feb 2015 2:33) #15673

Here is a small code that displays the huge slowdowns caused by underflow interrupts occurring in DAXPY type code in Fortran. These slowdowns occur when underflow exceptions are handled in the default mode of code compiled with FTN95. Most of the issues were written up in another long thread, see https://forums.silverfrost.com/Forum/Topic/2171&postdays=0&postorder=asc&start=0, but here is a compact test code.

program xundfl
implicit none
integer, parameter :: NMAX=2000, NREP=50000
double precision, dimension(NMAX) :: X,Y,Z
double precision A,t1,t2
integer :: i,k,n,cnt1,cnt2

open(15,file='vecsub.bin',form='unformatted',status='old')
read(15)n,a,(x(i),y(i),i=1,n)
close(15)

call dclock@(t1)
call underflow_count@(cnt1)
do k=1,NREP
   do i = 1,n
      z(i) = y(i) - a*x(i)
   end do
end do
call underflow_count@(cnt2)
call dclock@(t2)
write(*,10)n,t2-t1,(cnt2-cnt1)/NREP
10 format('Vectors of size ',I4,2x,' time = ',f6.3,' ufls = ',i4)
end program

The data file for the test code is an unformatted Fortran data file that contains two double precision vectors x and y of length n =1679, and the test program performs the operation z = y - a x fifty thousand times. You can download the file (27 KB) from the public link https://dl.dropboxusercontent.com/u/88464747/vecsub.bin .

Compile with /opt /p6 and link. Run the program and record the output. Set the environment variable SALFENVAR=MASK_UNDERFLOW, and run again. You will see something similar to the following:

s:\\lang\\JCampBell\\SAL>set SALFENVAR=

s:\\lang\\JCampBell\\SAL>xundfl
Vectors of size 1679   time = 33.860 ufls =   88

s:\\lang\\JCampBell\\SAL>set SALFENVAR=MASK_UNDERFLOW

s:\\lang\\JCampBell\\SAL>xundfl
Vectors of size 1679   time =  0.688 ufls =    0

From these results we can estimate the time spent per one execution of the FPU underflow interrupt handler to be the equivalent of about 12,000 CPU cycles, which agrees with a similar estimate made in John Campbell's thread linked above, where the program can take hours to run with the default underflow processing versus about 20 seconds with SSE2 code.

mecej4

Posts: 1914

Back to Top

18 Feb 2015 2:28 #15675

Here is an even simpler example, for those who may be a bit worried about downloading a binary file from an 'unknown' person. This example does not need a data file, and does not do a DAXMY operation.

module globvars
logical :: uflseen = .false.
integer :: count = 0
end module

program testunderflow
use globvars
implicit none
integer i,k
!
real :: x,y,dx,s
double precision t1,t2,c1,c2
!
call dclock@(t1)
c1=cpu_clock@(c1)
do k=1,200000
   x=1e-30
   y=x+1e-36
   s=0
   do i=1,20
      dx=y-x
      s=s+dx
   !   write(*,*)i,dx
      y=y*0.5d0
      x=x*0.5d0
   end do
end do
call dclock@(t2)
call underflow_count@(count)
write(*,10)'count = ',count,' time = ',t2-t1
c2=cpu_clock@()
write(*,20)'CPU cycles used = ',c2-c1,',  cycles/underflow = ',(c2-c1)/count
10 format(1x,A,I20,3x,A,2x,F8.3,' s')
20 format(1x,A,EN10.2,A,F8.0)
end program

Make two runs, one with SALFENVAR set to MASK_UNDERFLOW, and one without.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

18 Feb 2015 4:19 #15676

Interesting. Why

call mask_underflow@()

does not do the same thing?

mecej4

Posts: 1914

Back to Top

18 Feb 2015 5:53 #15677

Exception handling is an arcane and complicated art. Interrupts/traps do not follow the orderly flow of logic in a program. If an exception is unmasked by calling a service routine within the program, the exception flags have to be restored before calling a library routine and when that routine returns. There may be OS conventions about saving exception control registers, setting your desired values and restoring the old values at subprogram return or other places. Conventions are also needed as to who (the user program or library routine that caused the trap, the OS, the default signal handler or the user-supplied replacement handler) is responsible for doing the saves and restores.

Finally, when your exception handler decides to increment a count of , say, underflow, it has to supply a reasonable value (such as zero) for the result of the expression whose evaluation caused the exception, return to the point in the user program or library routine where the exception was taken, and restore the program state; moreover, this cannot be done directly but through the OS and the compiler's runtime, and requires facilities and cooperation from the OS.

A scholarly article about this topic: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_handle.html .

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

18 Feb 2015 10:49 #15681

One more interesting thing: when i changed env.variables as above the small code above was compiled OK but compilation of my own ones started producing this error

*** Internal compiler error - floating point exception

Only removing env.variable and rebooting computer helped...

mecej4

Posts: 1914

Back to Top

18 Feb 2015 11:11 #15682

Interesting.

The compiler itself depends on Salflibc.DLL. If you have a constant expression of type REAL in your program, the compiler may invoke FPU routines in the DLL to evaluate them once for all, and put the resulting value into the OBJ and EXE. If the code in the DLL checks the environment variable, and the evaluation of the expression causes an FPU exception, and the compiler expects exceptions to be masked and handled in the default way, it may get stuck.

I am speculating, of course, since I have no real information about the innards of the DLL.

JohnCampbell

Posts: 2526 Sydney

Back to Top

19 Feb 2015 12:47 #15683

Mecej4,

Thanks very much for the simplified example of the problem.

I enhanced the first example above to: include the use of MASK_UNDERFLOW@, and 'estimate' the delay due to floating point exception handling

the revised program is: program xundfl implicit none integer, parameter :: NMAX=2000, NREP=50000 double precision, dimension(NMAX) :: X,Y,Z double precision A,t1,t2 integer :: i,k,n,cnt1,cnt2, pass, nfpe real4 :: op_count, mflops real10 :: c1,c2

 open (15,file='vecsub.bin',form='unformatted',status='old') 
 read (15) n,a,(x(i),y(i),i=1,n) 
 close(15)
 write (*,*) 'n =',n

 do pass = 1,2
    write (*,*) ' '
    if (pass==2) then
      write (*,*) 'Mask Underflow enabled'
      call MASK_UNDERFLOW@
    end if
    
    call dclock@ (t1)                ! elapsed time (seconds)
    c1 = cpu_clock@ ()               ! elapsed time (CPU cycles)
    
    call underflow_count@(cnt1) 
    do k=1,NREP 
       do i = 1,n 
          z(i) = y(i) - a*x(i) 
       end do 
    end do 
    call underflow_count@(cnt2) 
    call dclock@ (t2) 
    c2 = cpu_clock@ ()
!
    c2       = c2 - c1                     ! clock cycles
    t2       = t2 - t1                     ! elapsed seconds
    nfpe     = cnt2 - cnt1                 ! number of FPE
    op_count = float(nrep) * float(n) * 2  ! floating operation count
    mflops   = op_count / t2 / 1.e6        ! million floating point operations per second
!
    write (*,10) n, t2, nfpe/NREP 
    write (*,*) ' mflops =',mflops
    if ( nfpe > 0) &
    write (*,*) ' Floating point exception takes', c2/float(nfpe),' cycles'
 end do ! pass
 10 format('Vectors of size ',I4,2x,' time = ',f6.3,' ufls = ',i4) 
 
 end program

The batch file I am using is: ftn95 xundfl /link set SALFENVAR=

xundfl 

set SALFENVAR=MASK_UNDERFLOW 

xundfl 

ftn95 xundfl /p6 /opt /link
set SALFENVAR= 

xundfl 

set SALFENVAR=MASK_UNDERFLOW 

xundfl

Unfortunately I an not generating the other reported errors *** Internal compiler error - floating point exception floating point stack fault in IO_convert_long_double_to_ascii

I am estimating the FPE takes about 22,000 CPU cycles to process !! The next stage will be to get statistics on how frequently this is happening in real programs.

John

ps: I would not recommend the use of dclock@ (based on GetTickCount) as it is accurate to only 64 ticks per second. The intrinsic System_Clock (based on QueryPerformanceCounter) is a more accurate timer with FTN95. real*10 cpu_clock@ is based on RDTSC and typically provides the time in processor cycles.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

19 Feb 2015 7:15 (Edited: 19 Feb 2015 9:44) #15684

John Are you getting the compiler crash too or not? I misunderstood.

I was changing env variable in ControlPanel/System/Settings and seems changing it back by removing the line did not work until computer reboot.

By the way if you remember it was these damn underflows which caused the crash of the running code if you do multitasking jalih' s or Paul's way. Masking them by calling Silverfrost's @ utility helped to avoid crashes, that was good suggestion by Paul.

JohnCampbell

Posts: 2526 Sydney

Back to Top

19 Feb 2015 7:59 #15685

Dan,

I am setting the environment variable in the batch file or at the cmd.exe prompt.

I did not get the compiler error response using Ver. 7.10.0, that you have reported. When I run my profile_v6.exe program, for both gauss reduction and skyline solution, the program: runs the gauss reduction reports the results runs the skyline reduction produces the error during the report when attempting to write real*8 values.

At first it failed with 'write (,) 'mflop =', mflop' so I replaced ',*)' with a format statement number and went further, but further into the reporting the use of format statement numbers also failed. It looks like there is a stack corruption when there are lots of FPE's. (looks can be deceiving!)

I am going to try the 'call MASK_UNDERFLOW@' in the program and see if that changes the behaviour. To test if it is related to the number of FP exceptions, I shall try a range of /d:n options which modifies the number of FPE's being generated.

It looks like my profile_v6.exe program is being slowed down by FPE's. I now need to test the FPE count in other programs that have slow DAXMY performance.

John

mecej4

Posts: 1914

Back to Top

19 Feb 2015 12:07 #15687

DanRRight wrote: 'Only removing env.variable and rebooting computer helped...'. If you are working from a command line, you can easily change the local environment with a one line statement. Changing the environment in the Windows System:Advanced:Environment panel makes the changes apply to [u:2ab3e1ea20]all[/u:2ab3e1ea20] applications from then on, and is probably [u:2ab3e1ea20]not a good idea[/u:2ab3e1ea20]. Furthermore, if you are working in a command window and change the environment in the control panel, the changes do not take effect in the command window.

When you think that you want to run programs with frequent changes to environment variables, as we did in this thread, it is safest to do so within a batch file which changes the environment as desired and then runs your program(s) in the changed environment. When your program is run from the batch file and quits, the batch file also quits and the old environment is 'restored'. Following this procedure also keeps you safe from the changed environment causing disruptions to subsequent compilations, as you experienced. You can have even finer control over changes to the environment within a batch file using the builtin SetLocal...EndLocal commands.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

19 Feb 2015 3:52 #15690

I cornered the problem of internal compiler crash in relatively short piece of code which probably Paul has to look at closely. because if we will start using masking routinely we will get eventually this problem. But first please confirm that you see the effect. The fortran file has two subroutines written 30 years ago which worked OK all the time. Together they produce compiler crash if underflows masked. If compiled separately they end up OK.

The ZIP file contains fortran code and batch file which compiles with and without masking underflows (thanks John and mecej4 for the hint)

When you run the batch it produces two files zzzMask and zzzNotMask where compilation process is logged. At the end of them you will see either compilation was OK or failed. Here is the link on Microsoft Onedrive website

https://onedrive.live.com/?cid=51A5F9B536E1368D&id=51A5F9B536E1368D!105

mecej4

Posts: 1914

Back to Top

19 Feb 2015 4:31 #15691

I can reproduce the compile time error. The error message came from the compiler rather than SALFLIBC.DLL.

I observe that in your program there is quite a bit of mixing of REAL4 and REAL8 variables and constants in expressions. For example, line 370 has

      if(aIz_09.le.0d0) aIz_09 = 1.d-100

This line of code does nothing useful if the previous value of aIz_09 is zero, because aIz_09 is a REAL4 variable, and conversion of 1.d-100 to REAL4 causes underflow. If underflow is masked, the conversion just yields zero. The right hand side expression in this statement is a constant and the compiler probably tried to convert and save the value as a REAL*4, since that is the type of the variable on the left hand side.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

19 Feb 2015 6:15 #15692

Thanks Mecej4, i think it was good explanation. Changing this variable to real*8 solved the problem.

The only what FTN95 missing in this case is to diagnose such things with exact position and clear message, not just crash usually in the remote place from the offending subroutine. That type of errors will be hard to anticipate by the programmer.

But have you noticed that if there is no second subroutine in the fortran file which is clean the error does not appear? How this can be explained?

mecej4

Posts: 1914

Back to Top

19 Feb 2015 6:22 #15693

I am glad that you were able to fix the problem based on my suspicions.

I agree that the compiler should issue a more helpful message, but often programs (including compilers) are not given the capability to respond to exceptions, and therefore they give give cryptic messages, or fail with not even a cryptic message, when they unexpectedly find themselves facing an exception.

The response of a compiler to a program with errors is often unpredictable.

But have you noticed that if there is no second subroutine in the fortran file which is clean the error does not appear? I could not understand that sentence. Will you please rephrase it?

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

19 Feb 2015 7:34 #15694

My wish always was that all recent generations of compilers must have to be way more verbose and suggestive. User-fiendly is another word. Specifically Fortran has to be very simple and user-friendly if we want it to be actively accepted by the new programmers

Returning to this error, hopefully that was right guess and not some other devilry 😃. The question i asked in previous messages was this: the R.FOR file has two subroutines. The second is fine and has no problems. All the problems are in the first one. But the compiler crash only happens if both are in the same file like it is now. So if i separate two these subroutines there is no compiler crash. Do you see that too and can you explain this devilry?

mecej4

Posts: 1914

Back to Top

20 Feb 2015 12:47 #15695

Curiouser and curiouser! Create a single file with the two subroutines in reverse order compared to your first version, and with SALFENVAR=MASK_UNDERFLOW set, no crash occurs.

s:\\FTN95> fsplit.exe r.f
 Writing program unit: IONEQ100.f
   Starting line:  6
   Ending line  :  522
 Writing program unit: IONEQ.f
   Starting line:  526
   Ending line  :  567

s:\\ftn95>copy IONEQ.f+IONEQ100.f r1.f
s:\\ftn95>set salfenvar=MASK_UNDERFLOW
s:\\ftn95> ftn95 r1.f

I stripped down your code to a 39-line reproducer and posted it in a compiler bug report at http://forums.silverfrost.com/viewtopic.php?p=17331.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

20 Feb 2015 11:15 #15697

Thanks, Mecej4, great, one more pretty subtle bug will be diagnozed by the compiler

mecej4

Posts: 1914

Back to Top

28 Feb 2015 2:29 #15772

Here is a short program, instrumented to measure the response time of the FTN95/SALFLLIBC.DLL underflow exception handler.

program test_underflow
implicit none
integer i,nund,ucount
!
real :: x,y,dx
integer*8 cnt1,cnt2,dt,sdt,rdtsc
external rdtsc
!
sdt=0
nund=0
x=1e-30
y=x+1e-34
write(*,*)'  i         x            y          y-x           Cycle_count'
write(*,*)
           ! successively halve x and y, take difference
do i=1,20
   cnt1=rdtsc()
   dx=y-x
   cnt2 = rdtsc()
   dt=cnt2-cnt1
   if(dt .gt. 1000)then
      write(*,77)i,x,y,dx,dt
      nund=nund+1
      sdt=sdt+dt
   endif
   y=y*0.5d0
   x=x*0.5d0
end do
           ! successively double x and y, take difference
write(*,*)
do i=21,40
   y=y*2.d0
   x=x*2.d0
   cnt1=rdtsc()
   dx=y-x
   cnt2 = rdtsc()
   dt=cnt2-cnt1
   if(dt .gt. 1000)then
      write(*,77)i,x,y,dx,dt
      nund=nund+1
      sdt=sdt+dt
   endif
end do
write(*,*)
write(*,78)nund,float(sdt)/nund
call underflow_count@(ucount)
write(*,79)ucount
77 format(1x,i3,2x,3ES13.5,2x,i15)
78 format(1x,i3,2x,' underflows, average cycles/underflow = ',F7.0)
79 format(1x,i3,2x,' underflows reported by underflow_count@')
end program

Here is the assembly routine for the RDTSC counter:

.686
option dotname
.model flat
public _RDTSC
.text SEGMENT PUBLIC 'CODE'
_RDTSC PROC NEAR
rdtsc
ret
_RDTSC ENDP
.text ENDS
END

Build using (i) ftn95 undfl.f90 (ii) ml /coff /c rdtsc.asm (iii) slink undfl.obj rdtsc.obj. The typical output from the program:

   i         x            y          y-x           Cycle_count

  15    6.10352E-35  6.10413E-35  0.00000E+00           123356
  16    3.05176E-35  3.05206E-35  0.00000E+00           100901
  17    1.52588E-35  1.52603E-35  0.00000E+00            99889
  18    7.62939E-36  7.63016E-36  0.00000E+00            48466
  19    3.81470E-36  3.81508E-36  0.00000E+00            43033
  20    1.90735E-36  1.90754E-36  0.00000E+00            40541

  21    1.90735E-36  1.90754E-36  0.00000E+00            33125
  22    3.81470E-36  3.81508E-36  0.00000E+00            37586
  23    7.62939E-36  7.63016E-36  0.00000E+00            78620
  24    1.52588E-35  1.52603E-35  0.00000E+00            55593
  25    3.05176E-35  3.05206E-35  0.00000E+00            70569
  26    6.10352E-35  6.10413E-35  0.00000E+00            46229

  12   underflows, average cycles/underflow =  64826.
  12   underflows reported by underflow_count@

Note that the cycle counts vary from run to run. When no underflow occurs, the cycle counts are about 50 per iteration of the loop. However, when underflow occurs, the exception handler is run and the cycle count jumps to about 50,000.

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

1 Mar 2015 9:03 #15773

Try taking out the WRITE statements that are within the loop to see what effect this has on the underflow count. It is possible that underflows are also occurring within the standard output.

mecej4

Posts: 1914

Back to Top

1 Mar 2015 9:33 (Edited: 1 Mar 2015 5:22) #15774

Paul, I followed your suggestion, and did not see any change in the underflow counts reported by underflow_count@() with the WRITE statements commented out.

Let me also point out that the two statements that cause underflow are individually bracketed by calls to rdtsc(), so the RDTSC counts printed by the program do not cover any statements other than the two instances of dx = y - x .

By accident, I discovered something else that may help you with tracking down the speed issue: if the test program is compiled with /opt and run, the underflow count remains the same, but the cycle counts used up in underflow processing drop down from about 50000 (without /opt) to about 500 (with /opt) per underflow!

Earlier, when running John Campbell's much larger test code, I had always used /opt, and yet the underflow processing had consumed tens of of thousands of cycles per underflow. These conflicting findings cause me to wonder if, in compiling the larger and more complex code, there were instances of failure to optimize.

I will peek into the machine code generated with and without /opt to obtain some insight on what is happening, but I thought that I should let you know this strange fact without delay.

SDBG does not seem to support assembly level debugging, and the Registers display it can show is limited to CPU registers. On the other hand, debuggers such as WinDBG and the VS debugger do not understand the debug information provided by FTN95. Do you have any suggestions for resolving this?