forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Slow handling of FPU underflow interrupts in SALFLIBC.DLL
Goto page Previous  1, 2, 3
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support
View previous topic :: View next topic  
Author Message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2623
Location: Sydney

PostPosted: Tue Mar 03, 2015 3:17 am    Post subject: Reply with quote

Paul,

Is my previous statement basically correct ?
Quote:
It would be good if there was an Integer*8 variant of CPU_CLOCK@, say CPU_CLOCK_TICK@, so that real*10 conversion was not required and minimise the call overhead.
My estimate is that cpu_clock@ has about a 50 cpu cycle overhead, while QueryPerformanceCounter is about 200 cpu cycles.


Could a routine, say INTEGER*8 FUNCTION CPU_CLOCK_TICK@ (in salflibc.dll) provide a faster access to rdtsc, as this is basically how it is used.
This could be a useful addition, including for FTN95 /timing, if not already in use. I note also that CPU_CLOCK@ is treated as a special function in FTN95, knowing that it is real*10 and reporting on potential problems, which I have never identified. Is this warning still valid ?

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8283
Location: Salford, UK

PostPosted: Tue Mar 03, 2015 8:40 am    Post subject: Reply with quote

John

I see no reason why this should not be added and I will put it on the wish list.

This function is expanded inline by the compiler (see the /explist that it produces). I don't have any information about the reason for the warning.
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1917

PostPosted: Tue Mar 03, 2015 2:34 pm    Post subject: Re: Reply with quote

JohnCampbell wrote:
Dan has often noted this problem of changing performance when moving to larger test programs.
Yes. With a larger program, there can be more places where exceptions can arise. In your larger test program, there are a few places where FP underflow occurs, and other places where "denormal" exceptions may occur.

If the statements concerned are located in the midst of complex code, the compiler's optimiser may fail to honor /opt, and the exceptions are going to be taken regardless of whether or not /opt was specified.

This is why I wrote about "missed optimizations".
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8283
Location: Salford, UK

PostPosted: Tue Mar 03, 2015 4:53 pm    Post subject: Reply with quote

John

I have had a look at your request for CPU_CLOCK_TICK@. It turns out that there are two problems with this.

1) The new function would still need to create a temporary with the result that it would still require 5 machine instructions. There may be little or no improvement.

2) 32 bit FTN95 has used the floating point stack for this 64 bit integer so that it can hold the result in a single register. To do this in a different way would be a non-trivial task.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2623
Location: Sydney

PostPosted: Wed Mar 04, 2015 5:13 am    Post subject: Reply with quote

Paul,

Thanks for reviewing this. It looks like the existing cpu_clock@ provides a good interface to rdtsc. Your description of "the use of the floating point stack for this 64 bit integer" may be the basis of the FTN95 warning.

I was not able to compile the alternative assembler version of rdtsc.asm that mecej4 provided, using " ml /coff /c rdtsc.asm ".
Could a .obj version of that, compatible with FTN95 and SLINK, be available to see if it improves on the real*10 interface.
I am always worried about mixing .obj from other compilers, in case FTN95 does not have a compatible call interface.

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8283
Location: Salford, UK

PostPosted: Wed Mar 04, 2015 7:29 am    Post subject: Reply with quote

John

I will have another look at this when I get a moment. Maybe you only need the lower DWORD.
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1917

PostPosted: Wed Mar 04, 2015 7:47 am    Post subject: Reply with quote

John: FTN95 uses X87 instructions to process ALL variables of type INTEGER*8. This design complicates using RDTSC in the middle of X87 code which is also likely to cause X87 exceptions to occur. If you do not have/want to use the MASM assemblers ML and ML64, which come with Visual Studio/VC and SDKs from Microsoft, you can download and unpack the NASM assembler (just 500K w/o documentation) from www.nasm.us. Here is code for the INTEGER*8 RDTSC function:
Code:

SECTION .text
global _RDTSC
_RDTSC:
      RDTSC
      RET

You can assemble this using the command nasm -f coff rdtsc.asm -o rdtsc.obj.

The following inline assembler code works in a limited fashion -- limited because its usage appears to inhibit optimization, so if you want to track bugs that occur only when /OPT is used, you cannot use this.
Code:

INTEGER*8 cnt1
...
code
   rdtsc
   mov cnt1,eax
   mov cnt1[4],edx
edoc

is equivalent to
Code:

INTEGER*8 cnt1,rdtsc
...
cnt1=rdtsc()

but does not use the FPU.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2623
Location: Sydney

PostPosted: Wed Mar 04, 2015 9:12 am    Post subject: Reply with quote

Thanks Paul and Mecej4,

The following code appears to test alternatives I am considering. It looks like the code/edoc approach is giving good performance as an integer*8 function. I tested the following program for alternative ftn95 compile options of:
/check
/debug
(none)
/opt

I shall see where this goes for reliability.

I would expect that having code/edoc isolated to a function call should not inhibit optimisation of inner loops ?

John
Code:
!  program to test rdtsc
!
    integer*8 rdtsc_tick, itick
    integer*8 cpu_clock_tick, ctick, t0, jt, lt, st, ti, ctickl
    real*10   rtick
    integer*4 n
    external  cpu_clock_tick, rdtsc_tick
!
!  test accuracy
    ctickl = cpu_clock_tick ()
    write (*,*) '   n   clk_tik    clock@     rdtsc     write'
    lt = cpu_clock_tick ()
    do n = 1,10
      t0    = cpu_clock_tick ()
      rtick = cpu_clock@ ()
      itick = rdtsc_tick ()
      ctick = cpu_clock_tick ()
      jt    = rtick
      write (*,fmt='(i5,4i10)') n, t0-lt, jt-t0, itick-jt, lt-ctickl
      lt = cpu_clock_tick ()
      ctickl = ctick
    end do
!
! test call speed
    st = cpu_clock_tick ()
    t0 = cpu_clock_tick ()
    do n = 1,10000
      ti = cpu_clock_tick ()
    end do
    t0 = (ti - t0)/n
    lt = cpu_clock_tick ()
    write (*,*) 'cpu_clock_tick', (lt-st)/n, t0
!
    st = cpu_clock_tick ()
    t0 = cpu_clock@ ()
    do n = 1,10000
      rtick = cpu_clock@ ()
    end do
    t0 = (rtick-t0)/n
    lt = cpu_clock_tick ()
    write (*,*) 'cpu_clock@    ', (lt-st)/n, t0
!
    st = cpu_clock_tick ()
    t0 = rdtsc_tick ()
    do n = 1,10000
      itick = rdtsc_tick ()
    end do
    t0 = (itick - t0)/n
    lt = cpu_clock_tick ()
    write (*,*) 'rdtsc_tick    ', (lt-st)/n, t0
!
    end

integer*8 function rdtsc_tick ()
integer*8 cnt1
!
!  get rdtsc value
    code
      rdtsc
      mov cnt1,eax
      mov cnt1[4],edx
    edoc
!
    rdtsc_tick = cnt1
end function rdtsc_tick

integer*8 function cpu_clock_tick ()
    real*10 rtick
    rtick = cpu_clock@ ()
    cpu_clock_tick = rtick
end function cpu_clock_tick
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1917

PostPosted: Wed Mar 04, 2015 10:36 am    Post subject: Reply with quote

John said:
Quote:
I would expect that having code/edoc isolated to a function call should not inhibit optimisation of inner loops ?

Correct, and I had already tried that out. The problem, however, is that FTN95 adds a lot of routine entry and exit code:
Code:

push        ebp
mov         ebp,esp
push        ebx
push        esi
push        edi
push        eax
sub         esp,20h
rdtsc
mov         dword ptr [ebp-18h],eax
mov         dword ptr [ebp-14h],edx
mov         ecx,dword ptr [ebp-18h]
mov         eax,dword ptr [ebp-14h]
mov         dword ptr [ebp-1Ch],eax
mov         dword ptr [ebp-20h],ecx
mov         eax,dword ptr [ebp-20h]
mov         edx,dword ptr [ebp-1Ch]
lea         esp,[ebp-0Ch]
pop         edi
pop         esi
pop         ebx
pop         ebp
ret

where all but the "rdtsc" instruction and the "ret" instruction are useless. You can tally up the latency cost of all those instructions.
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 196

PostPosted: Wed Mar 04, 2015 11:37 am    Post subject: Re: Reply with quote

mecej4 wrote:
where all but the "rdtsc" instruction and the "ret" instruction are useless. You can tally up the latency cost of all those instructions.

Before the code part to time begins, the cpuid instruction should be issued before rdtsc instruction to serialize the instruction execution. Also, the result should probably be read using rdtscp instruction, if it's supported (can be tested with cpuid). The rdtscp instruction waits until all previous instructions have been executed before reading the counter. If done otherwise, the "out-of-order" execution may skew the results.

Before the timing code, it may also be a good idea to set the processor affinity so execution stays on the same processor/core.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1917

PostPosted: Wed Mar 04, 2015 2:06 pm    Post subject: Reply with quote

Jalih's guidelines will become important when precise timing is desired, but as of now we are still at the phase of looking for ways in which we can do some timing in a program compiled with /opt, from source code containing INTEGER*8 variables, whose runs cause underflows. No reliable way of avoiding run-time and compile-time crashes (other than not using /opt) has yet been found.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8283
Location: Salford, UK

PostPosted: Wed Mar 04, 2015 3:50 pm    Post subject: Reply with quote

I have added code to the compiler so that

Code:

integer*8 tt
tt = cpu_clock@()


is expanded inline as

Code:
      rdtsc     
      mov       TT[4],edx
      mov       TT,eax


I don't know if this will have an adverse effect on /opt.
The change will be in the next release.
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1917

PostPosted: Wed Mar 04, 2015 7:43 pm    Post subject: Reply with quote

Thanks, Paul. Here is a test program that you could use to ascertain whether using the new CPU_CLOCK@() function inhibits optimization.
Code:

program test_underflow
implicit none
integer i,nund,ucount
!
real :: x,y,dx,dxs
integer*8 cnt1,cnt2,dt
!
nund=0
x=1e-34
y=x+2e-37
dxs=0
write(*,*)'  i       x           y        y - x      CPU-cyc'
                      ! successively halve x and y, take difference
do i=1,10
   cnt1 = cpu_clock@()
   dx=y-x             ! dx is always non negative
   !dxs=dxs+dx         ! <<<=== this statement slows loop by a factor of 100
   cnt2 = cpu_clock@()
   dt=cnt2-cnt1
   if(dt .gt. 250)nund=nund+1
   write(*,77)i,x,y,dx,dt
   y=y*0.5d0
   x=x*0.5d0
end do
!
call underflow_count@(ucount)
write(*,78)nund,ucount

77 format(i3,3ES12.3,2x,I8)
78 format(1x,i3,' underflows',/, &
          1x,i3,' underflows reported by underflow_count@')
end program

With the 7.10 compiler, using /opt /p6, the output obtained is
Code:

   i       x           y        y - x      CPU-cyc
  1   1.000E-34   1.002E-34   2.000E-37        24
  2   5.000E-35   5.010E-35   1.000E-37        31
  3   2.500E-35   2.505E-35   5.000E-38        31
  4   1.250E-35   1.252E-35   2.500E-38        30
  5   6.250E-36   6.262E-36   1.250E-38        36
  6   3.125E-36   3.131E-36   0.000E+00       703
  7   1.563E-36   1.566E-36   0.000E+00       746
  8   7.813E-37   7.828E-37   0.000E+00       703
  9   3.906E-37   3.914E-37   0.000E+00       703
 10   1.953E-37   1.957E-37   0.000E+00       667
   5 underflows
   5 underflows reported by underflow_count@

For i = 6..10, underflow occurred and the ISR took several hundred cycles, and this is reasonable.

If Line-17 of the program source is activated, with /opt one might expect the run to be unaffected since the variable dxs could be optimised away. The output, however, is drastically different.
Code:

   i       x           y        y - x      CPU-cyc
  1   1.000E-34   1.002E-34   2.000E-37        24
  2   5.000E-35   5.010E-35   1.000E-37        28
  3   2.500E-35   2.505E-35   5.000E-38        28
  4   1.250E-35   1.252E-35   2.500E-38        24
  5   6.250E-36   6.262E-36   1.250E-38        24
  6   3.125E-36   3.131E-36   0.000E+00    106052
  7   1.563E-36   1.566E-36   0.000E+00     60368
  8   7.813E-37   7.828E-37   0.000E+00     47272
  9   3.906E-37   3.914E-37   0.000E+00     76106
 10   1.953E-37   1.957E-37   0.000E+00     74442
   5 underflows
   5 underflows reported by underflow_count@

Would you please try the unreleased new version of the compiler on this program?
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8283
Location: Salford, UK

PostPosted: Wed Mar 04, 2015 8:12 pm    Post subject: Reply with quote

Thank but I am very pressed for time.
I am happy to provide the functionality so that users can decide to use it or not as they see fit.
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1917

PostPosted: Thu Mar 05, 2015 1:34 am    Post subject: Reply with quote

Good, I shall run the tests when the new release becomes available. Thanks.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support All times are GMT + 1 Hour
Goto page Previous  1, 2, 3
Page 3 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group