 |
forums.silverfrost.com Welcome to the Silverfrost forums
|
| View previous topic :: View next topic |
| Author |
Message |
mecej4
Joined: 31 Oct 2006 Posts: 1917
|
Posted: Fri Feb 20, 2015 1:47 pm Post subject: |
|
|
Curiouser and curiouser! Create a single file with the two subroutines in reverse order compared to your first version, and with SALFENVAR=MASK_UNDERFLOW set, no crash occurs.
| Code: |
s:\FTN95> fsplit.exe r.f
Writing program unit: IONEQ100.f
Starting line: 6
Ending line : 522
Writing program unit: IONEQ.f
Starting line: 526
Ending line : 567
s:\ftn95>copy IONEQ.f+IONEQ100.f r1.f
s:\ftn95>set salfenvar=MASK_UNDERFLOW
s:\ftn95> ftn95 r1.f
|
I stripped down your code to a 39-line reproducer and posted it in a compiler bug report at http://forums.silverfrost.com/viewtopic.php?p=17331. |
|
| Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2945 Location: South Pole, Antarctica
|
Posted: Sat Feb 21, 2015 12:15 am Post subject: |
|
|
| Thanks, Mecej4, great, one more pretty subtle bug will be diagnozed by the compiler |
|
| Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1917
|
Posted: Sat Feb 28, 2015 3:29 pm Post subject: |
|
|
Here is a short program, instrumented to measure the response time of the FTN95/SALFLLIBC.DLL underflow exception handler.
| Code: |
program test_underflow
implicit none
integer i,nund,ucount
!
real :: x,y,dx
integer*8 cnt1,cnt2,dt,sdt,rdtsc
external rdtsc
!
sdt=0
nund=0
x=1e-30
y=x+1e-34
write(*,*)' i x y y-x Cycle_count'
write(*,*)
! successively halve x and y, take difference
do i=1,20
cnt1=rdtsc()
dx=y-x
cnt2 = rdtsc()
dt=cnt2-cnt1
if(dt .gt. 1000)then
write(*,77)i,x,y,dx,dt
nund=nund+1
sdt=sdt+dt
endif
y=y*0.5d0
x=x*0.5d0
end do
! successively double x and y, take difference
write(*,*)
do i=21,40
y=y*2.d0
x=x*2.d0
cnt1=rdtsc()
dx=y-x
cnt2 = rdtsc()
dt=cnt2-cnt1
if(dt .gt. 1000)then
write(*,77)i,x,y,dx,dt
nund=nund+1
sdt=sdt+dt
endif
end do
write(*,*)
write(*,78)nund,float(sdt)/nund
call underflow_count@(ucount)
write(*,79)ucount
77 format(1x,i3,2x,3ES13.5,2x,i15)
78 format(1x,i3,2x,' underflows, average cycles/underflow = ',F7.0)
79 format(1x,i3,2x,' underflows reported by underflow_count@')
end program
|
Here is the assembly routine for the RDTSC counter:
| Code: |
.686
option dotname
.model flat
public _RDTSC
.text SEGMENT PUBLIC 'CODE'
_RDTSC PROC NEAR
rdtsc
ret
_RDTSC ENDP
.text ENDS
END
|
Build using (i) ftn95 undfl.f90 (ii) ml /coff /c rdtsc.asm (iii) slink undfl.obj rdtsc.obj. The typical output from the program:
| Code: |
i x y y-x Cycle_count
15 6.10352E-35 6.10413E-35 0.00000E+00 123356
16 3.05176E-35 3.05206E-35 0.00000E+00 100901
17 1.52588E-35 1.52603E-35 0.00000E+00 99889
18 7.62939E-36 7.63016E-36 0.00000E+00 48466
19 3.81470E-36 3.81508E-36 0.00000E+00 43033
20 1.90735E-36 1.90754E-36 0.00000E+00 40541
21 1.90735E-36 1.90754E-36 0.00000E+00 33125
22 3.81470E-36 3.81508E-36 0.00000E+00 37586
23 7.62939E-36 7.63016E-36 0.00000E+00 78620
24 1.52588E-35 1.52603E-35 0.00000E+00 55593
25 3.05176E-35 3.05206E-35 0.00000E+00 70569
26 6.10352E-35 6.10413E-35 0.00000E+00 46229
12 underflows, average cycles/underflow = 64826.
12 underflows reported by underflow_count@
|
Note that the cycle counts vary from run to run. When no underflow occurs, the cycle counts are about 50 per iteration of the loop. However, when underflow occurs, the exception handler is run and the cycle count jumps to about 50,000. |
|
| Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8283 Location: Salford, UK
|
Posted: Sun Mar 01, 2015 10:03 am Post subject: |
|
|
Try taking out the WRITE statements that are within the loop to see what effect this has on the underflow count.
It is possible that underflows are also occurring within the standard output. |
|
| Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1917
|
Posted: Sun Mar 01, 2015 10:33 am Post subject: |
|
|
Paul, I followed your suggestion, and did not see any change in the underflow counts reported by underflow_count@() with the WRITE statements commented out.
Let me also point out that the two statements that cause underflow are individually bracketed by calls to rdtsc(), so the RDTSC counts printed by the program do not cover any statements other than the two instances of dx = y - x .
By accident, I discovered something else that may help you with tracking down the speed issue: if the test program is compiled with /opt and run, the underflow count remains the same, but the cycle counts used up in underflow processing drop down from about 50000 (without /opt) to about 500 (with /opt) per underflow!
Earlier, when running John Campbell's much larger test code, I had always used /opt, and yet the underflow processing had consumed tens of of thousands of cycles per underflow. These conflicting findings cause me to wonder if, in compiling the larger and more complex code, there were instances of failure to optimize.
I will peek into the machine code generated with and without /opt to obtain some insight on what is happening, but I thought that I should let you know this strange fact without delay.
SDBG does not seem to support assembly level debugging, and the Registers display it can show is limited to CPU registers. On the other hand, debuggers such as WinDBG and the VS debugger do not understand the debug information provided by FTN95. Do you have any suggestions for resolving this?
Last edited by mecej4 on Sun Mar 01, 2015 6:22 pm; edited 1 time in total |
|
| Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8283 Location: Salford, UK
|
Posted: Sun Mar 01, 2015 4:13 pm Post subject: |
|
|
Thanks for the information.
SDBG did support assembly level debugging (F11) not long ago. I will log this as a regression and pass it on for investigation. |
|
| Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2623 Location: Sydney
|
Posted: Sun Mar 01, 2015 10:58 pm Post subject: |
|
|
For those who do not have access to the assembler compiler, the following provides a FTN95 interface to rdtsc.
| Code: |
integer*8 function rdtsc ()
real*10 time
time = cpu_clock@ ()
rdtsc = time
end function rdtsc
|
this gives the following similar results on my i5-2300
| Code: |
Program entered
i x y y-x Cycle_count
15 6.10352E-35 6.10413E-35 0.00000E+00 71480
16 3.05176E-35 3.05206E-35 0.00000E+00 40664
17 1.52588E-35 1.52603E-35 0.00000E+00 37320
18 7.62939E-36 7.63016E-36 0.00000E+00 35856
19 3.81470E-36 3.81508E-36 0.00000E+00 40096
20 1.90735E-36 1.90754E-36 0.00000E+00 31964
21 1.90735E-36 1.90754E-36 0.00000E+00 31284
22 3.81470E-36 3.81508E-36 0.00000E+00 30648
23 7.62939E-36 7.63016E-36 0.00000E+00 30640
24 1.52588E-35 1.52603E-35 0.00000E+00 30376
25 3.05176E-35 3.05206E-35 0.00000E+00 30260
26 6.10352E-35 6.10413E-35 0.00000E+00 30328
12 underflows, average cycles/underflow = 36743.
12 underflows reported by underflow_count@
|
I also change the write statements to be outside the if test, as
write(*,77)i,x,y,dx,dt
if(dt .gt. 1000)then
Now the program fails with /opt but runs without /opt.
Another related issue to the fpe problem ?
John |
|
| Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2623 Location: Sydney
|
Posted: Mon Mar 02, 2015 12:11 am Post subject: |
|
|
More twists in this FPE problem !
I changed the test to:
| Code: |
program test_underflow
implicit none
integer i,nund,ucount
!
real :: x,y,dx
integer*8 cnt1,cnt2,dt,sdt,rdtsc
external rdtsc
!
sdt=0
nund=0
x=1e-30
y=x+1e-34
write(*,*)' i x y y-x Cycle_count'
write(*,*)
! successively halve x and y, take difference
do i=1,20
cnt1=rdtsc()
dx=y-x
cnt2 = rdtsc()
dt=cnt2-cnt1
write(*,77)i,x,y,dx,dt
if(dt .gt. 400)then
nund=nund+1
sdt=sdt+dt
endif
y=y*0.5d0
x=x*0.5d0
end do
! successively double x and y, take difference
write(*,*)
do i=21,40
y=y*2.d0
x=x*2.d0
cnt1=rdtsc()
dx=y-x
cnt2 = rdtsc()
dt=cnt2-cnt1
write(*,77)i,x,y,dx,dt
if(dt .gt. 400)then
nund=nund+1
sdt=sdt+dt
endif
end do
write (*,*) ' '
write (*,*) ' nund=',nund
write (*,78) nund, float(sdt)/nund
call underflow_count@ (ucount)
write (*,79) ucount
77 format(1x,i3,2x,3ES13.5,2x,i15)
78 format(1x,i3,2x,' underflows, average cycles/underflow = ',F7.0)
79 format(1x,i3,2x,' underflows reported by underflow_count@')
end program
integer*8 function rdtsc ()
real*10 time
time = cpu_clock@ ()
rdtsc = time
end function rdtsc |
My test .bat file is
| Code: |
echo test /debug
ftn95 fpe2 /debug /link
fpe2
echo test /p6
ftn95 fpe2 /p6 /link
fpe2
echo test /opt
ftn95 fpe2 /opt /link
fpe2
echo test /opt /p6
ftn95 fpe2 /opt /p6 /link
fpe2 |
I now get /opt runs much faster, than with previous tests.
Could cpu_clock@ be messing with this latest test ?
John |
|
| Back to top |
|
 |
jalih
Joined: 30 Jul 2012 Posts: 196
|
Posted: Mon Mar 02, 2015 6:56 am Post subject: Re: |
|
|
| JohnCampbell wrote: |
| For those who do not have access to the assembler compiler, the following provides a FTN95 interface to rdtsc. |
Using RDTSC for high resolution timing on todays processors with multiple cores, hyperthreading support and out of order execution is not recommended and should be used with care. Preferred approach for high resolution timing on Windows platform is to use the QueryPerformanceCounter() and QueryPerformanceFrequency(). |
|
| Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8283 Location: Salford, UK
|
Posted: Mon Mar 02, 2015 8:01 am Post subject: |
|
|
For FTN95, the standard intrinsic SYSTEM_CLOCK is based on QueryPerformanceCounter() and QueryPerformanceFrequency().
mecej4
SDBG and function key F11 gives assembly level debugging on my machine but only when
a) running SDBG from a command line and
b) running SDBG from Plato but with the option "Integrate with SDBG" selected. |
|
| Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2623 Location: Sydney
|
Posted: Mon Mar 02, 2015 9:06 am Post subject: |
|
|
Jalih,
I don't agree with your comment on RDTSC.
My understanding is that this gives a value of the system clock, which is based on a CPU clock rate. No other timer gives this precision.
While there can be all the problems of multi-threading and multiple cores, this is still a high precision time estimate.
Apart from QueryPerformanceCounter, most other windows timers are accurate to 1/64 second, which is much worse than any error that rdtsc may provide.
The big advantage of rdtsc is that it gives an estimate of cpu cycles.
On my pc, QueryPerformanceCounter is accurate to 2,728,242 ticks per second, which is about 1,000 CPU cycles. given it's problems, rdtsc is typically much better than this.
It would be good if there was an Integer*8 variant of CPU_CLOCK@, say CPU_CLOCK_TICK@, so that real*10 conversion was not required and minimise the call overhead.
My estimate is that cpu_clock@ has about a 50 cpu cycle overhead, while QueryPerformanceCounter is about 200 cpu cycles. ( Gettickcount has a lower overhead, but the answer is to 1/64 seconds. )
John |
|
| Back to top |
|
 |
jalih
Joined: 30 Jul 2012 Posts: 196
|
Posted: Mon Mar 02, 2015 10:08 am Post subject: Re: |
|
|
| JohnCampbell wrote: |
Jalih,
I don't agree with your comment on RDTSC.
My understanding is that this gives a value of the system clock, which is based on a CPU clock rate. No other timer gives this precision.
|
Unfortunately it's not that easy. On prosessors supporting "out-of-order" execution, you require serializing instruction like cpuid before using rdtsc. And what about motherboard with multiple processors, are these syncronized? Also todays processors use constant rate time stamp counter, and with these processors the time stamp counter always reads the maximum rate regardless of the actual CPU running rate. Remember, on todays hardware even the cpu speed may vary due to power saving or system may be hibernated and later resumed resetting the time stamp counter. |
|
| Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2623 Location: Sydney
|
Posted: Mon Mar 02, 2015 11:31 am Post subject: |
|
|
Jalih,
Yes, it is not that easy. Depending on what problems you are prepared to ignore, you may need a different clock. Unfortunately most of the other timers have very poor precision. I have always found the lack of precision of most alternatives to be the biggest problem. There is a lot of activity that can occur in .015 seconds.
John |
|
| Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1917
|
Posted: Mon Mar 02, 2015 11:51 am Post subject: |
|
|
As far as the small test program is concerned, I do not think that we really care to time it accurately. It is enough to use some measure of the CPU cycles that get consumed when two numbers are loaded from memory into the X87 registers, the difference taken, and the result saved back to memory. Secondly, we are interested in how this measure, however approximate it may be or related to a non-constant rate clock, comes out for two event sequences : (i) load, subtract, store, measure dt when no underflow occurs, (ii) the same, when an FPU underflow exception handler is invoked.
The assembler code segment is as follows, where I have added numbers to indicate the instruction latency for a Haswell CPU:
| Code: |
call _RDTSC
mov dword ptr [ebp-500h],eax 2
mov dword ptr [ebp-4FCh],edx 2
mov edi,dword ptr [ebp-500h] 2
mov ecx,dword ptr [ebp-4FCh] 2
mov dword ptr [ebp-414h],ecx 2
mov dword ptr [ebp-418h],edi 2
fld dword ptr [ebp-4E4h] 3
fsub dword ptr [ebp-4E8h] 3
fstp dword ptr [ebp-10h] 4
call _RDTSC
|
The two calls to _RDTSC could be replaced by inline RDTSC instructions. The total count is around 25 cycles. Now, on my laptop, which I have in "battery saving mode" the CPU may be running at half the maximum speed, but the RDTSC counter reports as if the CPU were running at the maximum speed. Therefore, the reported change in the RDTSC counter would be about 50 cycles, which agrees quite well with what I have seen from many runs of the program.
If an underflow and/or denormal exception occurs, the ISR is called between the FSUB and the RDTSC instructions. With the same program compiled with another compiler targeting X87, I see the count going to a few hundred when an exception is taken. With FTN95, we see around 20,000. The comparison between these three (50 w/o exception, 500 with exception-other compiler, 20000 with exception-FTN95), it seems to me, leads us to clear conclusions, even though individual numbers may be inaccurate or dependent on the specific CPU, timing technique, optimization, etc. |
|
| Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2623 Location: Sydney
|
Posted: Mon Mar 02, 2015 10:37 pm Post subject: |
|
|
mecej4,
Try the adaptation I provided in the post above on Mon Mar 02, 2015 12:11 am.
Using the .bat file, I get different FPE cycle delay estimates, depending on the compile options.
| Code: |
/debug 12 underflows, average cycles/underflow = 31874.
no options 12 underflows, average cycles/underflow = 32019.
/p6 12 underflows, average cycles/underflow = 32023.
/opt 12 underflows, average cycles/underflow = 442.
/opt /p6 12 underflows, average cycles/underflow = 457.
|
This is not the case with different compile options with the larger test programs. Dan has often noted this problem of changing performance when moving to larger test programs.
John |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|