forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Slow handling of FPU underflow interrupts in SALFLIBC.DLL
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support
View previous topic :: View next topic  
Author Message
mecej4



Joined: 31 Oct 2006
Posts: 1917

PostPosted: Fri Feb 20, 2015 1:47 pm    Post subject: Reply with quote

Curiouser and curiouser! Create a single file with the two subroutines in reverse order compared to your first version, and with SALFENVAR=MASK_UNDERFLOW set, no crash occurs.
Code:
s:\FTN95> fsplit.exe r.f
 Writing program unit: IONEQ100.f
   Starting line:  6
   Ending line  :  522
 Writing program unit: IONEQ.f
   Starting line:  526
   Ending line  :  567

s:\ftn95>copy IONEQ.f+IONEQ100.f r1.f
s:\ftn95>set salfenvar=MASK_UNDERFLOW
s:\ftn95> ftn95 r1.f


I stripped down your code to a 39-line reproducer and posted it in a compiler bug report at http://forums.silverfrost.com/viewtopic.php?p=17331.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2945
Location: South Pole, Antarctica

PostPosted: Sat Feb 21, 2015 12:15 am    Post subject: Reply with quote

Thanks, Mecej4, great, one more pretty subtle bug will be diagnozed by the compiler
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1917

PostPosted: Sat Feb 28, 2015 3:29 pm    Post subject: Reply with quote

Here is a short program, instrumented to measure the response time of the FTN95/SALFLLIBC.DLL underflow exception handler.
Code:

program test_underflow
implicit none
integer i,nund,ucount
!
real :: x,y,dx
integer*8 cnt1,cnt2,dt,sdt,rdtsc
external rdtsc
!
sdt=0
nund=0
x=1e-30
y=x+1e-34
write(*,*)'  i         x            y          y-x           Cycle_count'
write(*,*)
           ! successively halve x and y, take difference
do i=1,20
   cnt1=rdtsc()
   dx=y-x
   cnt2 = rdtsc()
   dt=cnt2-cnt1
   if(dt .gt. 1000)then
      write(*,77)i,x,y,dx,dt
      nund=nund+1
      sdt=sdt+dt
   endif
   y=y*0.5d0
   x=x*0.5d0
end do
           ! successively double x and y, take difference
write(*,*)
do i=21,40
   y=y*2.d0
   x=x*2.d0
   cnt1=rdtsc()
   dx=y-x
   cnt2 = rdtsc()
   dt=cnt2-cnt1
   if(dt .gt. 1000)then
      write(*,77)i,x,y,dx,dt
      nund=nund+1
      sdt=sdt+dt
   endif
end do
write(*,*)
write(*,78)nund,float(sdt)/nund
call underflow_count@(ucount)
write(*,79)ucount
77 format(1x,i3,2x,3ES13.5,2x,i15)
78 format(1x,i3,2x,' underflows, average cycles/underflow = ',F7.0)
79 format(1x,i3,2x,' underflows reported by underflow_count@')
end program

Here is the assembly routine for the RDTSC counter:
Code:

.686
option dotname
.model flat
public _RDTSC
.text SEGMENT PUBLIC 'CODE'
_RDTSC PROC NEAR
rdtsc
ret
_RDTSC ENDP
.text ENDS
END

Build using (i) ftn95 undfl.f90 (ii) ml /coff /c rdtsc.asm (iii) slink undfl.obj rdtsc.obj. The typical output from the program:
Code:

   i         x            y          y-x           Cycle_count

  15    6.10352E-35  6.10413E-35  0.00000E+00           123356
  16    3.05176E-35  3.05206E-35  0.00000E+00           100901
  17    1.52588E-35  1.52603E-35  0.00000E+00            99889
  18    7.62939E-36  7.63016E-36  0.00000E+00            48466
  19    3.81470E-36  3.81508E-36  0.00000E+00            43033
  20    1.90735E-36  1.90754E-36  0.00000E+00            40541

  21    1.90735E-36  1.90754E-36  0.00000E+00            33125
  22    3.81470E-36  3.81508E-36  0.00000E+00            37586
  23    7.62939E-36  7.63016E-36  0.00000E+00            78620
  24    1.52588E-35  1.52603E-35  0.00000E+00            55593
  25    3.05176E-35  3.05206E-35  0.00000E+00            70569
  26    6.10352E-35  6.10413E-35  0.00000E+00            46229

  12   underflows, average cycles/underflow =  64826.
  12   underflows reported by underflow_count@

Note that the cycle counts vary from run to run. When no underflow occurs, the cycle counts are about 50 per iteration of the loop. However, when underflow occurs, the exception handler is run and the cycle count jumps to about 50,000.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8283
Location: Salford, UK

PostPosted: Sun Mar 01, 2015 10:03 am    Post subject: Reply with quote

Try taking out the WRITE statements that are within the loop to see what effect this has on the underflow count.
It is possible that underflows are also occurring within the standard output.
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1917

PostPosted: Sun Mar 01, 2015 10:33 am    Post subject: Reply with quote

Paul, I followed your suggestion, and did not see any change in the underflow counts reported by underflow_count@() with the WRITE statements commented out.

Let me also point out that the two statements that cause underflow are individually bracketed by calls to rdtsc(), so the RDTSC counts printed by the program do not cover any statements other than the two instances of dx = y - x .

By accident, I discovered something else that may help you with tracking down the speed issue: if the test program is compiled with /opt and run, the underflow count remains the same, but the cycle counts used up in underflow processing drop down from about 50000 (without /opt) to about 500 (with /opt) per underflow!

Earlier, when running John Campbell's much larger test code, I had always used /opt, and yet the underflow processing had consumed tens of of thousands of cycles per underflow. These conflicting findings cause me to wonder if, in compiling the larger and more complex code, there were instances of failure to optimize.

I will peek into the machine code generated with and without /opt to obtain some insight on what is happening, but I thought that I should let you know this strange fact without delay.

SDBG does not seem to support assembly level debugging, and the Registers display it can show is limited to CPU registers. On the other hand, debuggers such as WinDBG and the VS debugger do not understand the debug information provided by FTN95. Do you have any suggestions for resolving this?


Last edited by mecej4 on Sun Mar 01, 2015 6:22 pm; edited 1 time in total
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8283
Location: Salford, UK

PostPosted: Sun Mar 01, 2015 4:13 pm    Post subject: Reply with quote

Thanks for the information.

SDBG did support assembly level debugging (F11) not long ago. I will log this as a regression and pass it on for investigation.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2623
Location: Sydney

PostPosted: Sun Mar 01, 2015 10:58 pm    Post subject: Reply with quote

For those who do not have access to the assembler compiler, the following provides a FTN95 interface to rdtsc.
Code:
  integer*8 function rdtsc ()
   real*10 time
 
   time  = cpu_clock@ ()
   rdtsc = time
  end function rdtsc


this gives the following similar results on my i5-2300
Code:
Program entered
   i         x            y          y-x           Cycle_count

  15    6.10352E-35  6.10413E-35  0.00000E+00            71480
  16    3.05176E-35  3.05206E-35  0.00000E+00            40664
  17    1.52588E-35  1.52603E-35  0.00000E+00            37320
  18    7.62939E-36  7.63016E-36  0.00000E+00            35856
  19    3.81470E-36  3.81508E-36  0.00000E+00            40096
  20    1.90735E-36  1.90754E-36  0.00000E+00            31964

  21    1.90735E-36  1.90754E-36  0.00000E+00            31284
  22    3.81470E-36  3.81508E-36  0.00000E+00            30648
  23    7.62939E-36  7.63016E-36  0.00000E+00            30640
  24    1.52588E-35  1.52603E-35  0.00000E+00            30376
  25    3.05176E-35  3.05206E-35  0.00000E+00            30260
  26    6.10352E-35  6.10413E-35  0.00000E+00            30328

  12   underflows, average cycles/underflow =  36743.
  12   underflows reported by underflow_count@


I also change the write statements to be outside the if test, as
write(*,77)i,x,y,dx,dt
if(dt .gt. 1000)then

Now the program fails with /opt but runs without /opt.
Another related issue to the fpe problem ?

John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2623
Location: Sydney

PostPosted: Mon Mar 02, 2015 12:11 am    Post subject: Reply with quote

More twists in this FPE problem !
I changed the test to:
Code:
program test_underflow
 implicit none
 integer i,nund,ucount
 !
 real :: x,y,dx
 integer*8 cnt1,cnt2,dt,sdt,rdtsc
 external rdtsc
 !
 sdt=0
 nund=0
 x=1e-30
 y=x+1e-34
 write(*,*)'  i         x            y          y-x           Cycle_count'
 write(*,*)
            ! successively halve x and y, take difference
 do i=1,20
    cnt1=rdtsc()
    dx=y-x
    cnt2 = rdtsc()
    dt=cnt2-cnt1
       write(*,77)i,x,y,dx,dt
    if(dt .gt. 400)then
       nund=nund+1
       sdt=sdt+dt
    endif
    y=y*0.5d0
    x=x*0.5d0
 end do
            ! successively double x and y, take difference
 write(*,*)
 do i=21,40
    y=y*2.d0
    x=x*2.d0
    cnt1=rdtsc()
    dx=y-x
    cnt2 = rdtsc()
    dt=cnt2-cnt1
       write(*,77)i,x,y,dx,dt
    if(dt .gt. 400)then
       nund=nund+1
       sdt=sdt+dt
    endif
 end do

 write (*,*) ' '
 write (*,*) ' nund=',nund
 write (*,78) nund, float(sdt)/nund
 call underflow_count@ (ucount)
 write (*,79) ucount

 77 format(1x,i3,2x,3ES13.5,2x,i15)
 78 format(1x,i3,2x,' underflows, average cycles/underflow = ',F7.0)
 79 format(1x,i3,2x,' underflows reported by underflow_count@')
 end program

  integer*8 function rdtsc ()
   real*10 time
 
   time  = cpu_clock@ ()
   rdtsc = time
  end function rdtsc


My test .bat file is
Code:
echo test /debug
ftn95 fpe2 /debug /link
fpe2
 
echo test /p6
ftn95 fpe2 /p6 /link
fpe2
 
echo test /opt
ftn95 fpe2 /opt /link
fpe2

echo test /opt /p6
ftn95 fpe2 /opt /p6 /link
fpe2


I now get /opt runs much faster, than with previous tests.
Could cpu_clock@ be messing with this latest test ?

John
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 196

PostPosted: Mon Mar 02, 2015 6:56 am    Post subject: Re: Reply with quote

JohnCampbell wrote:
For those who do not have access to the assembler compiler, the following provides a FTN95 interface to rdtsc.

Using RDTSC for high resolution timing on todays processors with multiple cores, hyperthreading support and out of order execution is not recommended and should be used with care. Preferred approach for high resolution timing on Windows platform is to use the QueryPerformanceCounter() and QueryPerformanceFrequency().
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8283
Location: Salford, UK

PostPosted: Mon Mar 02, 2015 8:01 am    Post subject: Reply with quote

For FTN95, the standard intrinsic SYSTEM_CLOCK is based on QueryPerformanceCounter() and QueryPerformanceFrequency().


mecej4

SDBG and function key F11 gives assembly level debugging on my machine but only when

a) running SDBG from a command line and
b) running SDBG from Plato but with the option "Integrate with SDBG" selected.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2623
Location: Sydney

PostPosted: Mon Mar 02, 2015 9:06 am    Post subject: Reply with quote

Jalih,

I don't agree with your comment on RDTSC.
My understanding is that this gives a value of the system clock, which is based on a CPU clock rate. No other timer gives this precision.
While there can be all the problems of multi-threading and multiple cores, this is still a high precision time estimate.
Apart from QueryPerformanceCounter, most other windows timers are accurate to 1/64 second, which is much worse than any error that rdtsc may provide.
The big advantage of rdtsc is that it gives an estimate of cpu cycles.
On my pc, QueryPerformanceCounter is accurate to 2,728,242 ticks per second, which is about 1,000 CPU cycles. given it's problems, rdtsc is typically much better than this.

It would be good if there was an Integer*8 variant of CPU_CLOCK@, say CPU_CLOCK_TICK@, so that real*10 conversion was not required and minimise the call overhead.
My estimate is that cpu_clock@ has about a 50 cpu cycle overhead, while QueryPerformanceCounter is about 200 cpu cycles. ( Gettickcount has a lower overhead, but the answer is to 1/64 seconds. )

John
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 196

PostPosted: Mon Mar 02, 2015 10:08 am    Post subject: Re: Reply with quote

JohnCampbell wrote:
Jalih,

I don't agree with your comment on RDTSC.
My understanding is that this gives a value of the system clock, which is based on a CPU clock rate. No other timer gives this precision.

Unfortunately it's not that easy. On prosessors supporting "out-of-order" execution, you require serializing instruction like cpuid before using rdtsc. And what about motherboard with multiple processors, are these syncronized? Also todays processors use constant rate time stamp counter, and with these processors the time stamp counter always reads the maximum rate regardless of the actual CPU running rate. Remember, on todays hardware even the cpu speed may vary due to power saving or system may be hibernated and later resumed resetting the time stamp counter.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2623
Location: Sydney

PostPosted: Mon Mar 02, 2015 11:31 am    Post subject: Reply with quote

Jalih,

Yes, it is not that easy. Depending on what problems you are prepared to ignore, you may need a different clock. Unfortunately most of the other timers have very poor precision. I have always found the lack of precision of most alternatives to be the biggest problem. There is a lot of activity that can occur in .015 seconds.

John
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1917

PostPosted: Mon Mar 02, 2015 11:51 am    Post subject: Reply with quote

As far as the small test program is concerned, I do not think that we really care to time it accurately. It is enough to use some measure of the CPU cycles that get consumed when two numbers are loaded from memory into the X87 registers, the difference taken, and the result saved back to memory. Secondly, we are interested in how this measure, however approximate it may be or related to a non-constant rate clock, comes out for two event sequences : (i) load, subtract, store, measure dt when no underflow occurs, (ii) the same, when an FPU underflow exception handler is invoked.

The assembler code segment is as follows, where I have added numbers to indicate the instruction latency for a Haswell CPU:
Code:

call        _RDTSC                   
mov         dword ptr [ebp-500h],eax  2
mov         dword ptr [ebp-4FCh],edx  2
mov         edi,dword ptr [ebp-500h]  2
mov         ecx,dword ptr [ebp-4FCh]  2
mov         dword ptr [ebp-414h],ecx  2
mov         dword ptr [ebp-418h],edi  2
fld         dword ptr [ebp-4E4h]      3
fsub        dword ptr [ebp-4E8h]      3
fstp        dword ptr [ebp-10h]       4
call        _RDTSC

The two calls to _RDTSC could be replaced by inline RDTSC instructions. The total count is around 25 cycles. Now, on my laptop, which I have in "battery saving mode" the CPU may be running at half the maximum speed, but the RDTSC counter reports as if the CPU were running at the maximum speed. Therefore, the reported change in the RDTSC counter would be about 50 cycles, which agrees quite well with what I have seen from many runs of the program.

If an underflow and/or denormal exception occurs, the ISR is called between the FSUB and the RDTSC instructions. With the same program compiled with another compiler targeting X87, I see the count going to a few hundred when an exception is taken. With FTN95, we see around 20,000. The comparison between these three (50 w/o exception, 500 with exception-other compiler, 20000 with exception-FTN95), it seems to me, leads us to clear conclusions, even though individual numbers may be inaccurate or dependent on the specific CPU, timing technique, optimization, etc.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2623
Location: Sydney

PostPosted: Mon Mar 02, 2015 10:37 pm    Post subject: Reply with quote

mecej4,

Try the adaptation I provided in the post above on Mon Mar 02, 2015 12:11 am.
Using the .bat file, I get different FPE cycle delay estimates, depending on the compile options.
Code:

/debug       12   underflows, average cycles/underflow =  31874.
no options  12   underflows, average cycles/underflow =  32019.
/p6            12   underflows, average cycles/underflow =  32023.
/opt           12   underflows, average cycles/underflow =    442.
/opt /p6     12   underflows, average cycles/underflow =    457.

This is not the case with different compile options with the larger test programs. Dan has often noted this problem of changing performance when moving to larger test programs.

John
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support All times are GMT + 1 Hour
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group