 |
forums.silverfrost.com Welcome to the Silverfrost forums
|
| View previous topic :: View next topic |
| Author |
Message |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2623 Location: Sydney
|
Posted: Tue Mar 03, 2015 3:17 am Post subject: |
|
|
Paul,
Is my previous statement basically correct ?
| Quote: |
It would be good if there was an Integer*8 variant of CPU_CLOCK@, say CPU_CLOCK_TICK@, so that real*10 conversion was not required and minimise the call overhead.
My estimate is that cpu_clock@ has about a 50 cpu cycle overhead, while QueryPerformanceCounter is about 200 cpu cycles.
|
Could a routine, say INTEGER*8 FUNCTION CPU_CLOCK_TICK@ (in salflibc.dll) provide a faster access to rdtsc, as this is basically how it is used.
This could be a useful addition, including for FTN95 /timing, if not already in use. I note also that CPU_CLOCK@ is treated as a special function in FTN95, knowing that it is real*10 and reporting on potential problems, which I have never identified. Is this warning still valid ?
John |
|
| Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8283 Location: Salford, UK
|
Posted: Tue Mar 03, 2015 8:40 am Post subject: |
|
|
John
I see no reason why this should not be added and I will put it on the wish list.
This function is expanded inline by the compiler (see the /explist that it produces). I don't have any information about the reason for the warning. |
|
| Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1917
|
Posted: Tue Mar 03, 2015 2:34 pm Post subject: Re: |
|
|
| JohnCampbell wrote: |
Dan has often noted this problem of changing performance when moving to larger test programs.
|
Yes. With a larger program, there can be more places where exceptions can arise. In your larger test program, there are a few places where FP underflow occurs, and other places where "denormal" exceptions may occur.
If the statements concerned are located in the midst of complex code, the compiler's optimiser may fail to honor /opt, and the exceptions are going to be taken regardless of whether or not /opt was specified.
This is why I wrote about "missed optimizations". |
|
| Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8283 Location: Salford, UK
|
Posted: Tue Mar 03, 2015 4:53 pm Post subject: |
|
|
John
I have had a look at your request for CPU_CLOCK_TICK@. It turns out that there are two problems with this.
1) The new function would still need to create a temporary with the result that it would still require 5 machine instructions. There may be little or no improvement.
2) 32 bit FTN95 has used the floating point stack for this 64 bit integer so that it can hold the result in a single register. To do this in a different way would be a non-trivial task. |
|
| Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2623 Location: Sydney
|
Posted: Wed Mar 04, 2015 5:13 am Post subject: |
|
|
Paul,
Thanks for reviewing this. It looks like the existing cpu_clock@ provides a good interface to rdtsc. Your description of "the use of the floating point stack for this 64 bit integer" may be the basis of the FTN95 warning.
I was not able to compile the alternative assembler version of rdtsc.asm that mecej4 provided, using " ml /coff /c rdtsc.asm ".
Could a .obj version of that, compatible with FTN95 and SLINK, be available to see if it improves on the real*10 interface.
I am always worried about mixing .obj from other compilers, in case FTN95 does not have a compatible call interface.
John |
|
| Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8283 Location: Salford, UK
|
Posted: Wed Mar 04, 2015 7:29 am Post subject: |
|
|
John
I will have another look at this when I get a moment. Maybe you only need the lower DWORD. |
|
| Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1917
|
Posted: Wed Mar 04, 2015 7:47 am Post subject: |
|
|
John: FTN95 uses X87 instructions to process ALL variables of type INTEGER*8. This design complicates using RDTSC in the middle of X87 code which is also likely to cause X87 exceptions to occur. If you do not have/want to use the MASM assemblers ML and ML64, which come with Visual Studio/VC and SDKs from Microsoft, you can download and unpack the NASM assembler (just 500K w/o documentation) from www.nasm.us. Here is code for the INTEGER*8 RDTSC function:
| Code: |
SECTION .text
global _RDTSC
_RDTSC:
RDTSC
RET
|
You can assemble this using the command nasm -f coff rdtsc.asm -o rdtsc.obj.
The following inline assembler code works in a limited fashion -- limited because its usage appears to inhibit optimization, so if you want to track bugs that occur only when /OPT is used, you cannot use this.
| Code: |
INTEGER*8 cnt1
...
code
rdtsc
mov cnt1,eax
mov cnt1[4],edx
edoc
|
is equivalent to
| Code: |
INTEGER*8 cnt1,rdtsc
...
cnt1=rdtsc()
|
but does not use the FPU. |
|
| Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2623 Location: Sydney
|
Posted: Wed Mar 04, 2015 9:12 am Post subject: |
|
|
Thanks Paul and Mecej4,
The following code appears to test alternatives I am considering. It looks like the code/edoc approach is giving good performance as an integer*8 function. I tested the following program for alternative ftn95 compile options of:
/check
/debug
(none)
/opt
I shall see where this goes for reliability.
I would expect that having code/edoc isolated to a function call should not inhibit optimisation of inner loops ?
John
| Code: |
! program to test rdtsc
!
integer*8 rdtsc_tick, itick
integer*8 cpu_clock_tick, ctick, t0, jt, lt, st, ti, ctickl
real*10 rtick
integer*4 n
external cpu_clock_tick, rdtsc_tick
!
! test accuracy
ctickl = cpu_clock_tick ()
write (*,*) ' n clk_tik clock@ rdtsc write'
lt = cpu_clock_tick ()
do n = 1,10
t0 = cpu_clock_tick ()
rtick = cpu_clock@ ()
itick = rdtsc_tick ()
ctick = cpu_clock_tick ()
jt = rtick
write (*,fmt='(i5,4i10)') n, t0-lt, jt-t0, itick-jt, lt-ctickl
lt = cpu_clock_tick ()
ctickl = ctick
end do
!
! test call speed
st = cpu_clock_tick ()
t0 = cpu_clock_tick ()
do n = 1,10000
ti = cpu_clock_tick ()
end do
t0 = (ti - t0)/n
lt = cpu_clock_tick ()
write (*,*) 'cpu_clock_tick', (lt-st)/n, t0
!
st = cpu_clock_tick ()
t0 = cpu_clock@ ()
do n = 1,10000
rtick = cpu_clock@ ()
end do
t0 = (rtick-t0)/n
lt = cpu_clock_tick ()
write (*,*) 'cpu_clock@ ', (lt-st)/n, t0
!
st = cpu_clock_tick ()
t0 = rdtsc_tick ()
do n = 1,10000
itick = rdtsc_tick ()
end do
t0 = (itick - t0)/n
lt = cpu_clock_tick ()
write (*,*) 'rdtsc_tick ', (lt-st)/n, t0
!
end
integer*8 function rdtsc_tick ()
integer*8 cnt1
!
! get rdtsc value
code
rdtsc
mov cnt1,eax
mov cnt1[4],edx
edoc
!
rdtsc_tick = cnt1
end function rdtsc_tick
integer*8 function cpu_clock_tick ()
real*10 rtick
rtick = cpu_clock@ ()
cpu_clock_tick = rtick
end function cpu_clock_tick |
|
|
| Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1917
|
Posted: Wed Mar 04, 2015 10:36 am Post subject: |
|
|
John said:
| Quote: |
I would expect that having code/edoc isolated to a function call should not inhibit optimisation of inner loops ?
|
Correct, and I had already tried that out. The problem, however, is that FTN95 adds a lot of routine entry and exit code:
| Code: |
push ebp
mov ebp,esp
push ebx
push esi
push edi
push eax
sub esp,20h
rdtsc
mov dword ptr [ebp-18h],eax
mov dword ptr [ebp-14h],edx
mov ecx,dword ptr [ebp-18h]
mov eax,dword ptr [ebp-14h]
mov dword ptr [ebp-1Ch],eax
mov dword ptr [ebp-20h],ecx
mov eax,dword ptr [ebp-20h]
mov edx,dword ptr [ebp-1Ch]
lea esp,[ebp-0Ch]
pop edi
pop esi
pop ebx
pop ebp
ret
|
where all but the "rdtsc" instruction and the "ret" instruction are useless. You can tally up the latency cost of all those instructions. |
|
| Back to top |
|
 |
jalih
Joined: 30 Jul 2012 Posts: 196
|
Posted: Wed Mar 04, 2015 11:37 am Post subject: Re: |
|
|
| mecej4 wrote: |
| where all but the "rdtsc" instruction and the "ret" instruction are useless. You can tally up the latency cost of all those instructions. |
Before the code part to time begins, the cpuid instruction should be issued before rdtsc instruction to serialize the instruction execution. Also, the result should probably be read using rdtscp instruction, if it's supported (can be tested with cpuid). The rdtscp instruction waits until all previous instructions have been executed before reading the counter. If done otherwise, the "out-of-order" execution may skew the results.
Before the timing code, it may also be a good idea to set the processor affinity so execution stays on the same processor/core. |
|
| Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1917
|
Posted: Wed Mar 04, 2015 2:06 pm Post subject: |
|
|
| Jalih's guidelines will become important when precise timing is desired, but as of now we are still at the phase of looking for ways in which we can do some timing in a program compiled with /opt, from source code containing INTEGER*8 variables, whose runs cause underflows. No reliable way of avoiding run-time and compile-time crashes (other than not using /opt) has yet been found. |
|
| Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8283 Location: Salford, UK
|
Posted: Wed Mar 04, 2015 3:50 pm Post subject: |
|
|
I have added code to the compiler so that
| Code: |
integer*8 tt
tt = cpu_clock@()
|
is expanded inline as
| Code: |
rdtsc
mov TT[4],edx
mov TT,eax
|
I don't know if this will have an adverse effect on /opt.
The change will be in the next release. |
|
| Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1917
|
Posted: Wed Mar 04, 2015 7:43 pm Post subject: |
|
|
Thanks, Paul. Here is a test program that you could use to ascertain whether using the new CPU_CLOCK@() function inhibits optimization.
| Code: |
program test_underflow
implicit none
integer i,nund,ucount
!
real :: x,y,dx,dxs
integer*8 cnt1,cnt2,dt
!
nund=0
x=1e-34
y=x+2e-37
dxs=0
write(*,*)' i x y y - x CPU-cyc'
! successively halve x and y, take difference
do i=1,10
cnt1 = cpu_clock@()
dx=y-x ! dx is always non negative
!dxs=dxs+dx ! <<<=== this statement slows loop by a factor of 100
cnt2 = cpu_clock@()
dt=cnt2-cnt1
if(dt .gt. 250)nund=nund+1
write(*,77)i,x,y,dx,dt
y=y*0.5d0
x=x*0.5d0
end do
!
call underflow_count@(ucount)
write(*,78)nund,ucount
77 format(i3,3ES12.3,2x,I8)
78 format(1x,i3,' underflows',/, &
1x,i3,' underflows reported by underflow_count@')
end program
|
With the 7.10 compiler, using /opt /p6, the output obtained is
| Code: |
i x y y - x CPU-cyc
1 1.000E-34 1.002E-34 2.000E-37 24
2 5.000E-35 5.010E-35 1.000E-37 31
3 2.500E-35 2.505E-35 5.000E-38 31
4 1.250E-35 1.252E-35 2.500E-38 30
5 6.250E-36 6.262E-36 1.250E-38 36
6 3.125E-36 3.131E-36 0.000E+00 703
7 1.563E-36 1.566E-36 0.000E+00 746
8 7.813E-37 7.828E-37 0.000E+00 703
9 3.906E-37 3.914E-37 0.000E+00 703
10 1.953E-37 1.957E-37 0.000E+00 667
5 underflows
5 underflows reported by underflow_count@ |
For i = 6..10, underflow occurred and the ISR took several hundred cycles, and this is reasonable.
If Line-17 of the program source is activated, with /opt one might expect the run to be unaffected since the variable dxs could be optimised away. The output, however, is drastically different.
| Code: |
i x y y - x CPU-cyc
1 1.000E-34 1.002E-34 2.000E-37 24
2 5.000E-35 5.010E-35 1.000E-37 28
3 2.500E-35 2.505E-35 5.000E-38 28
4 1.250E-35 1.252E-35 2.500E-38 24
5 6.250E-36 6.262E-36 1.250E-38 24
6 3.125E-36 3.131E-36 0.000E+00 106052
7 1.563E-36 1.566E-36 0.000E+00 60368
8 7.813E-37 7.828E-37 0.000E+00 47272
9 3.906E-37 3.914E-37 0.000E+00 76106
10 1.953E-37 1.957E-37 0.000E+00 74442
5 underflows
5 underflows reported by underflow_count@
|
Would you please try the unreleased new version of the compiler on this program? |
|
| Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8283 Location: Salford, UK
|
Posted: Wed Mar 04, 2015 8:12 pm Post subject: |
|
|
Thank but I am very pressed for time.
I am happy to provide the functionality so that users can decide to use it or not as they see fit. |
|
| Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1917
|
Posted: Thu Mar 05, 2015 1:34 am Post subject: |
|
|
| Good, I shall run the tests when the new release becomes available. Thanks. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|