Silverfrost Forums

Welcome to our forums

Polyhedron Benchmark tests

22 Mar 2015 11:40 #15947

John, What high res timer do you use here? I am now confused, so many were discussed before, can you please post again whole its text and usage? I need it for tuning of some my own stuff (unfortunately no time even to bend the painful nail in the shoe, let alone for anything else like Polyhedron stuff 😦

Paul, what this english word 'backend' means? I have with it some very wrong associations, but i am not english speaking 😃 . By the way, i have some third party parallel algebra 32bit libraries compiled by ancient MS and recent Intel Fortran and which somehow work with FTN95, will they work under 64bit?

23 Mar 2015 10:03 #15954

There are a number of good timers available. Mecej4 recently posted a simple routine with integer*8 access to rdtsc, which works like CPU_CLOCK@. The good timers are:

call system_clock (count_start, count_rate, count_max)

  STDCALL   QUERYPERFORMANCECOUNTER 'QueryPerformanceCounter' (REF):LOGICAL*4
  STDCALL   QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4

cpu_clock@ ( which uses rdtsc instruction)

integer8 function rdtsc_tick () integer8 cnt1 ! ! get rdtsc value code rdtsc mov cnt1,eax mov cnt1[4],edx edoc ! rdtsc_tick = cnt1 end function rdtsc_tick

Both CPU_CLOCK@ tick at the processor clock rate and have a small call overhead. The problem is you need to calibrate them, which can be achieved by timing with SYSTEM_CLOCK and accumulating the ticks.

With FTN95, SYSTEM_CLOCK is accurate and easy to use, although rdtsc is much better for timing shorter duration events. All these timers are elapsed time timers. For each timer routine, I have developed 3 function types:

  • integer*8 function RDTSC_TICK () returns the tick count
  • integer*8 function RDTSC_RATE () returns the tick rate in ticks per second
  • real*8 function RDTSC_SECONDS () returns the time in seconds I have similar for SYSTEM_CLOCK_xxx and QUERYPERFORMANCE_xxx

All the other timers, including all CPU timers are hopeless. They all report a tick value that is updated 64 times per second. If you event is of the order of seconds, then these would be ok, but for accurate timing they are no good. (It depends on what you want)

These timers include: cpu_time (intrinsic) date_and_time (intrinsic) high_res_clock@ (ftn95) dclock@ (ftn95) clock@ (ftn95) GetLocalTime (winapi) GetTickCount (winapi) GetProcessTimes (winapi)

I hope this answers your question.

John

23 Mar 2015 10:07 #15955

The following is my example of my Function RDTSC_Rate, which shows how to utilise the tick rate of each routine. Only RDTSC and CPU_CLOCK@ give different values at each call, all others return values faster than their clock rate. integer*8 rdtsc_rate, rate external rdtsc_rate ! rate = rdtsc_rate (10) rate = rdtsc_rate (100) rate = rdtsc_rate (1000) rate = rdtsc_rate (2500) rate = rdtsc_rate (10000) rate = rdtsc_rate (25000) rate = rdtsc_rate (100000) rate = rdtsc_rate (1000000) end

   integer*8 function rdtsc_rate (num_cycle)
!
! initialises rdtsc pointer and estimates rdtsc tick rate
!
!  calibrate using num_cycle of QueryPerform
!
    integer*4 num_cycle
!    integer*4, parameter :: num_cycle = 1000  
    integer*8  rd_list(0:2), last_rdtsc, rd_tick, ticks, call_rate
    integer*8  qu_list(0:2), last_query, query_tick, query_rate
    real*8     secs
    integer*4  i, kk, nt, i_list(0:2),  calls
!
    integer*8 :: known_rate = -1   ! or 2666701126  ticks per second ~ processor clock rate
    integer*8 rdtsc_tick, QueryPerformance_tick, QueryPerformance_rate
    external  rdtsc_tick, QueryPerformance_tick, QueryPerformance_rate
!
!    if ( known_rate <= 0) then
      write (*,11) 'rdtsc_rate Initialise     ', known_rate
      write (*,11) 'target number of cycles   ', num_cycle
!
      Query_rate = QueryPerformance_rate ()
!
!  Run both clocks to get time
      last_rdtsc = rdtsc_tick ()
      last_Query = QueryPerformance_tick ()
      kk   = -1
      nt   = 0
      do i = 0, huge(i)
         rd_tick    = rdtsc_tick ()
         Query_tick = QueryPerformance_tick ()
         if ( Query_tick == last_Query ) cycle
         if ( kk < 2 ) then
           kk = kk+1
         else
           nt = nt+1
         end if
         i_list(kk)  = i
         rd_list(kk) = rd_tick
         last_rdtsc  = rd_tick  
         qu_list(kk) = Query_tick
         last_query  = Query_tick
         if ( nt > num_cycle ) exit
      end do
!
!   number of ticks of RDTSC
      calls      = i_list(2)  - i_list(1)
      ticks      = rd_list(2) - rd_list(1)
      query_tick = qu_list(2) - qu_list(1)
!        
      secs       = dble (query_tick) / dble(query_rate)
      rdtsc_rate = dble (ticks) / secs
      call_rate  = dble (calls) / secs
!
!      rdtsc_rate = known_rate
!
      write (*,11) 'rdtsc_tick cycles    =', calls,      ' calls'
      write (*,11) 'Number of cycles     =', nt,         ' ticks'
      write (*,11) 'query perform ticks  =', query_tick, ' ticks'
      write (*,12) 'initialise duration  =', secs,       ' seconds'
      write (*,11) 'rdtsc_tick duration  =', ticks,      ' ticks'
      write (*,11) 'rdtsc_tick rate      =', call_rate,  ' calls per second'
!
      write (*,11) 'rdtsc rate           =', rdtsc_rate, ' ticks per second'
      write (*,11) 'change in rate       =', rdtsc_rate-known_rate, ' ticks per second'
      write (*,10) 'rdtsc_tick initialised'
      write (*,10) ' '
      known_rate = rdtsc_rate
!    else
!      rdtsc_rate = known_rate
!    end if
10 format (3x,a)
11 format (3x,a,i15,a)
12 format (3x,a,f15.7,a)
!
end function rdtsc_rate
23 Mar 2015 10:09 #15956

queryperform routines are:

! QueryPerformanceCounter   Windows API routine
    real*8 function QueryPerformance_sec ()
      integer*8 :: tick
      real*8    :: tick_rate = -1
      integer*8 QueryPerformance_rate, QueryPerformance_tick
      external  QueryPerformance_rate, QueryPerformance_tick
!
      if (tick_rate < 0)  &
      tick_rate            = QueryPerformance_rate ()
      tick                 = QueryPerformance_tick ()
      QueryPerformance_sec = dble(tick) / tick_rate
    end function QueryPerformance_sec

    integer*8 function QueryPerformance_tick ()
      STDCALL   QUERYPERFORMANCECOUNTER 'QueryPerformanceCounter' (REF):LOGICAL*4
      logical*4 ll
      integer*8 tick
!
      ll    = QUERYPERFORMANCECOUNTER (tick)
      QueryPerformance_tick = tick
    end function QueryPerformance_tick

    integer*8 function QueryPerformance_rate ()
      STDCALL   QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4
      logical*4 ll
      integer*8 tick_rate
!
      ll    = QUERYPERFORMANCEFREQUENCY (tick_rate)
      write (*,*) 'QueryPerformance', tick_rate,' ticks per second'
      QueryPerformance_rate = tick_rate
    end function QueryPerformance_rate

   integer*8 function rdtsc_tick ()
      integer*8 cnt1 
!
!  get rdtsc value
       code 
         rdtsc 
         mov cnt1,eax 
         mov cnt1[4],edx 
       edoc 
!
       rdtsc_tick = cnt1
   end function rdtsc_tick

Dan, I hope this answers your question.

23 Mar 2015 11:59 #15958

Dan,

Good enough explanations for front end and back end are in http://en.wikipedia.org/wiki/Compiler under the section 'Structure of a compiler'.

It's the bit that is machine and OS dependent I think.

Eddie

24 Mar 2015 3:47 #15971

Eddie,

Well, I guess this post is about the FTN95 front end. I have now reviewed 9 of the test files and I shall update the results soon.

The last test example I have reviewed is mdbx.f90. This one has sapped my enthusiasm and I must admit I would prefer to have an optimising compiler to do all the changes this one needs. Dan is right and this is a job for the compiler.

mdbx.f90 is full of lines of lengthy calculations. It is like a finite element program, where the element stiffness matrices are being generated but not solved. The majority of the calculation time involves lengthy formulas. Applying a programming restructure to group repeated calculations would be a dangerous approach for such an extensive number of code lines and should probably not be done.

In this case, the formulas are replicated like they have been defined, and the optimisation, by grouping repeated formula snippets and calculations that can be moved outside the inner loop has not been done by the programmer. I actually agree with this programming approach, as it documents the theory being applied. I am not sure if this program requires optimum run time, but an optimising compiler would help.

I would expect that ifort's vectorisation, cache utilisation and inner loop smarts are very useful in this case.

The most important result from this program would be the correct answer, which FTN95 would provide.

I shall summarise the other tests in the next post. There are some useful results that identify the coding approaches that needs better attention in FTN95 or can be easily avoided by the programmer.

John

24 Mar 2015 4:42 #15977

John,

The programmer is in front of the front end! (And sometimes slows everything down like the man with the red flag who was at one time required to walk in front of a steam-powered road vehicle).

I looked at some of the Polyhedron stuff ages ago, decided that I didn't like it and moved on. Perhaps you can tell us if you compile for .NET if that slows things down further, and indeed, if it is a WINAPP, what impact that has. I find these things fascinating in a quasi-theological sense, because for me, I only need so many angels to be able to dance simultaneously on the head of a pin ... usually one, sometimes two, and theologians agree that the limit is higher even if they don't agree what it is.

The compiler I most hated that produced a different answer to the rest is one of the fastest now, but the version I have (but don't use) is lots of versions old, so I haven't named it - FTN77 with DBOS worked straight out of the box and Clearwin+ is without equal. I'm old enough that the phrase 'on different computers' actually means on radically different hardware, e.g. IBM, ICL, CDC, Univac, Burroughs, VAX, Elliott/NCR, Pr1me, PC ...

Eddie

Please login to reply.