forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Polyhedron Benchmark tests
Goto page Previous  1, 2
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Sun Mar 22, 2015 8:59 am    Post subject: Reply with quote

Eddie
There is a good chance that your best salflibc.dll will work with FTN77.
Back to top
View user's profile Send private message AIM Address
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sun Mar 22, 2015 10:36 am    Post subject: Reply with quote

Paul, I suspected as much. Now, I wonder what the benchmarks do with FTN77? (Or does it have the same back end as FTN95?)

Eddie
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Sun Mar 22, 2015 12:08 pm    Post subject: Re: Reply with quote

LitusSaxonicum wrote:
Now, I wonder what the benchmarks do with FTN77? (Or does it have the same back end as FTN95?)
Eddie
The current Polyhedron benchmarks are in F90+, so if you want to use FTN77 you will need to dig up older versions of the benchmarks written in F77.

Of course, there are lots of other benchmarks, in F77 as well as in F90+.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Mar 22, 2015 12:11 pm    Post subject: Reply with quote

Dan wrote:
Quote:
John, You are doing and have done great job speeding up the not optimized codes but your idea that you can beat the compiler optimization goes against the basic trend

I think your point is valid. However I have always considered it useful to identify where compilers perform poorly and then understand why this is the case.

The latest test I have reviewed is FATIGUE2.
The best time reported was for Lahey GNU at 31.09 seconds
FTN95 is reported at 263.88 seconds
My test was 324.75 seconds
My revised test is 168.88 seconds.
Again this is an improvement by about 50% but still significantly more than the Lahey compilation.

For FTN95, I reviewed the run times and there was a significant amount of time managing the call to subroutine perdida. There are 585,898,984 calls to this routine !

Code:
                 call perdida (dt, lambda, mu, yield_stress, R_infinity, b, X_infinity,     &
                               gamma, eta, plastic_strain_threshold, stress_tensor(:,:,n),  &
                               strain_tensor(:,:,n), plastic_strain_tensor(:,:,n),          &
                               strain_rate_tensor(:,:,n), accumulated_plastic_strain(n),    &
                               back_stress_tensor(:,:,n), isotropic_hardening_stress(n),    &
                               damage(n), failure_threshold, crack_closure_parameter)
...
      subroutine perdida (dt, lambda, mu, yield_stress, R_infinity, b, X_infinity, gamma,   &
                          eta, plastic_strain_threshold, stress_tensor, strain_tensor,      &
                          plastic_strain_tensor, strain_rate_tensor,                        &
                          accumulated_plastic_strain, back_stress_tensor,                   &
                          isotropic_hardening_stress, damage, failure_threshold,            &
                          crack_closure_parameter)
!
      real (kind = LONGreal), intent(in) :: dt, yield_stress, lambda, mu, R_infinity, b,    &
                                            X_infinity, gamma, eta, failure_threshold,      &
                                            plastic_strain_threshold,                       &
                                            crack_closure_parameter
      real (kind = LONGreal), dimension(:,:), intent(in) :: strain_rate_tensor,             &
                                                            strain_tensor
      real (kind = LONGreal), dimension(:,:), intent(inout) :: plastic_strain_tensor,       &
                                                               back_stress_tensor
      real (kind = LONGreal), dimension(:,:), intent(out) :: stress_tensor
      real (kind = LONGreal), intent(inout) :: damage, accumulated_plastic_strain,          &
                                               isotropic_hardening_stress
!

The main change I made was to change array sections to being explicit dimension (3,3), which they all are and use F77 addressing in the call.
What is interesting in this example is the combination : dimension(:,:), intent(in) ::
I assume FTN95 is making copies of the arguments and not returning them for intent(in) and then for intent(out) updating the copy on return. FTN95 is using about 150 seconds of run time just manipulating these temporary copies of the 3x3 arrays. I am not sure if FTN95 is enforcing the intent, or if the intent is a rule that should be checked.
The other compilers benefit from SSE instructions, which could bring the run time to 80 seconds, but Lahey's 31 seconds must be identifying other efficiencies.
Array sections is one of FTN95's Achilles heels.

John
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sun Mar 22, 2015 7:46 pm    Post subject: Reply with quote

Thanks MECEJ4, it's not obvious from snippets that they are Fortran 90.

Over half a billion calls to a routine with all those parameters? You'd make a really big improvement if they were in COMMON! And that's without all the Fortran 90 stuff that John seems to think that is costing so much time.

Plus if you are making over half a billion calls to a subroutine in the first place, the program structure is probably all to cock ...
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Sun Mar 22, 2015 8:10 pm    Post subject: Reply with quote

I was thinking along the same lines as you Eddie when I saw the 1/2 billion.

What came into my mind was that codes overall runtimes are not linear are they.
The case of half a billion loops is a pretty extreme example, for which the optimum optimised may be very interesting for the 0.1% of the computing world that needs them, but for Jowe blogs with a 'normal' size (whatever that is) program , FTN95 will probably perform much closer to the best.
I like your analogy of the 2min and 20min runtimes by the way. That's reality !
I think the big practical measure missing is a good tome about optimising a programs construction. I've already learned a lot being on here, the most basic being getting the order of the DO loops in right order, something which hadn't even occurred to me before tbh, simply because it never has been important. A 30 sec runtime is the same as a 10 minute runtime for a program for most people, both are a cup of coffee long, and if you're running say 50 of those a day then something is much more amiss than the programs optimisation.

It's analagous to FE modelling and the size of models .... people get lazy and start meshing like billyo, just because they can and end up with a mesh 10 times too small and hence 100 times too big a model globally , and hence 1000+times longer to tun than need be. An extrwme example I know, but even a 2times too big mesh would result easily in around 20 times the runtime.
The problem with Fortran is that its not always obvious where the reductions could be made.

I think less-extreme benchmarks are equally as valid in comparing compilers for this type of reason, because the aim should be to be 'optimum' for the highest percentage of users, not measured against the extreme-programmes only. It's unfair on the compilers to just consider the extreme case.
Of course the real problem is, just like for FE models, computers are too powerful today and most computing is way over the top as a result and creates more problems than it solves ! That's a fact.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sun Mar 22, 2015 8:36 pm    Post subject: Reply with quote

John-Silver,

Kreitzberg & Schneiderman: 'The elements of FORTRAN style' (Harcourt Brace Jovanovich) is where I started. Still available on the internet. Probably less than half of it is relevant today, sadly. My first copy was loaned and never returned, then I got another ... I suspect you can't pop round to my house to borrow it!

There are needs for speed: you are about to crash onto the moon? You want tomorrow's weather forecast, and the run-time is 25 hours? You need speed then.

But mostly you don't, and the speed ratios of even 300 to one mean nothing if FTN95-compiled code executes in less than a second! Then, there's the business I already alluded to of useful speed.

Eddie
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Mon Mar 23, 2015 12:40 am    Post subject: Reply with quote

John,
What high res timer do you use here? I am now confused, so many were discussed before, can you please post again whole its text and usage? I need it for tuning of some my own stuff (unfortunately no time even to bend the painful nail in the shoe, let alone for anything else like Polyhedron stuff Sad

Paul, what this english word "backend" means? I have with it some very wrong associations, but i am not english speaking Smile . By the way, i have some third party parallel algebra 32bit libraries compiled by ancient MS and recent Intel Fortran and which somehow work with FTN95, will they work under 64bit?
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Mar 23, 2015 11:03 am    Post subject: Reply with quote

There are a number of good timers available. Mecej4 recently posted a simple routine with integer*8 access to rdtsc, which works like CPU_CLOCK@. The good timers are:

call system_clock (count_start, count_rate, count_max)

STDCALL QUERYPERFORMANCECOUNTER 'QueryPerformanceCounter' (REF):LOGICAL*4
STDCALL QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4

cpu_clock@ ( which uses rdtsc instruction)

integer*8 function rdtsc_tick ()
integer*8 cnt1
!
! get rdtsc value
code
rdtsc
mov cnt1,eax
mov cnt1[4],edx
edoc
!
rdtsc_tick = cnt1
end function rdtsc_tick

Both CPU_CLOCK@ tick at the processor clock rate and have a small call overhead. The problem is you need to calibrate them, which can be achieved by timing with SYSTEM_CLOCK and accumulating the ticks.

With FTN95, SYSTEM_CLOCK is accurate and easy to use, although rdtsc is much better for timing shorter duration events.
All these timers are elapsed time timers.
For each timer routine, I have developed 3 function types:
* integer*8 function RDTSC_TICK () returns the tick count
* integer*8 function RDTSC_RATE () returns the tick rate in ticks per second
* real*8 function RDTSC_SECONDS () returns the time in seconds
I have similar for SYSTEM_CLOCK_xxx and QUERYPERFORMANCE_xxx

------------------------------------------------------
All the other timers, including all CPU timers are hopeless. They all report a tick value that is updated 64 times per second. If you event is of the order of seconds, then these would be ok, but for accurate timing they are no good. (It depends on what you want)

These timers include:
cpu_time (intrinsic)
date_and_time (intrinsic)
high_res_clock@ (ftn95)
dclock@ (ftn95)
clock@ (ftn95)
GetLocalTime (winapi)
GetTickCount (winapi)
GetProcessTimes (winapi)

I hope this answers your question.

John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Mar 23, 2015 11:07 am    Post subject: Reply with quote

The following is my example of my Function RDTSC_Rate, which shows how to utilise the tick rate of each routine. Only RDTSC and CPU_CLOCK@ give different values at each call, all others return values faster than their clock rate.
Code:
   integer*8 rdtsc_rate, rate
   external  rdtsc_rate
!
   rate = rdtsc_rate (10)
   rate = rdtsc_rate (100)
   rate = rdtsc_rate (1000)
   rate = rdtsc_rate (2500)
   rate = rdtsc_rate (10000)
   rate = rdtsc_rate (25000)
   rate = rdtsc_rate (100000)
   rate = rdtsc_rate (1000000)
   end
       
   integer*8 function rdtsc_rate (num_cycle)
!
! initialises rdtsc pointer and estimates rdtsc tick rate
!
!  calibrate using num_cycle of QueryPerform
!
    integer*4 num_cycle
!    integer*4, parameter :: num_cycle = 1000 
    integer*8  rd_list(0:2), last_rdtsc, rd_tick, ticks, call_rate
    integer*8  qu_list(0:2), last_query, query_tick, query_rate
    real*8     secs
    integer*4  i, kk, nt, i_list(0:2),  calls
!
    integer*8 :: known_rate = -1   ! or 2666701126  ticks per second ~ processor clock rate
    integer*8 rdtsc_tick, QueryPerformance_tick, QueryPerformance_rate
    external  rdtsc_tick, QueryPerformance_tick, QueryPerformance_rate
!
!    if ( known_rate <= 0) then
      write (*,11) 'rdtsc_rate Initialise     ', known_rate
      write (*,11) 'target number of cycles   ', num_cycle
!
      Query_rate = QueryPerformance_rate ()
!
!  Run both clocks to get time
      last_rdtsc = rdtsc_tick ()
      last_Query = QueryPerformance_tick ()
      kk   = -1
      nt   = 0
      do i = 0, huge(i)
         rd_tick    = rdtsc_tick ()
         Query_tick = QueryPerformance_tick ()
         if ( Query_tick == last_Query ) cycle
         if ( kk < 2 ) then
           kk = kk+1
         else
           nt = nt+1
         end if
         i_list(kk)  = i
         rd_list(kk) = rd_tick
         last_rdtsc  = rd_tick 
         qu_list(kk) = Query_tick
         last_query  = Query_tick
         if ( nt > num_cycle ) exit
      end do
!
!   number of ticks of RDTSC
      calls      = i_list(2)  - i_list(1)
      ticks      = rd_list(2) - rd_list(1)
      query_tick = qu_list(2) - qu_list(1)
!       
      secs       = dble (query_tick) / dble(query_rate)
      rdtsc_rate = dble (ticks) / secs
      call_rate  = dble (calls) / secs
!
!      rdtsc_rate = known_rate
!
      write (*,11) 'rdtsc_tick cycles    =', calls,      ' calls'
      write (*,11) 'Number of cycles     =', nt,         ' ticks'
      write (*,11) 'query perform ticks  =', query_tick, ' ticks'
      write (*,12) 'initialise duration  =', secs,       ' seconds'
      write (*,11) 'rdtsc_tick duration  =', ticks,      ' ticks'
      write (*,11) 'rdtsc_tick rate      =', call_rate,  ' calls per second'
!
      write (*,11) 'rdtsc rate           =', rdtsc_rate, ' ticks per second'
      write (*,11) 'change in rate       =', rdtsc_rate-known_rate, ' ticks per second'
      write (*,10) 'rdtsc_tick initialised'
      write (*,10) ' '
      known_rate = rdtsc_rate
!    else
!      rdtsc_rate = known_rate
!    end if
10 format (3x,a)
11 format (3x,a,i15,a)
12 format (3x,a,f15.7,a)
!
end function rdtsc_rate

Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Mar 23, 2015 11:09 am    Post subject: Reply with quote

queryperform routines are:
Code:
! QueryPerformanceCounter   Windows API routine
    real*8 function QueryPerformance_sec ()
      integer*8 :: tick
      real*8    :: tick_rate = -1
      integer*8 QueryPerformance_rate, QueryPerformance_tick
      external  QueryPerformance_rate, QueryPerformance_tick
!
      if (tick_rate < 0)  &
      tick_rate            = QueryPerformance_rate ()
      tick                 = QueryPerformance_tick ()
      QueryPerformance_sec = dble(tick) / tick_rate
    end function QueryPerformance_sec

    integer*8 function QueryPerformance_tick ()
      STDCALL   QUERYPERFORMANCECOUNTER 'QueryPerformanceCounter' (REF):LOGICAL*4
      logical*4 ll
      integer*8 tick
!
      ll    = QUERYPERFORMANCECOUNTER (tick)
      QueryPerformance_tick = tick
    end function QueryPerformance_tick

    integer*8 function QueryPerformance_rate ()
      STDCALL   QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4
      logical*4 ll
      integer*8 tick_rate
!
      ll    = QUERYPERFORMANCEFREQUENCY (tick_rate)
      write (*,*) 'QueryPerformance', tick_rate,' ticks per second'
      QueryPerformance_rate = tick_rate
    end function QueryPerformance_rate

   integer*8 function rdtsc_tick ()
      integer*8 cnt1
!
!  get rdtsc value
       code
         rdtsc
         mov cnt1,eax
         mov cnt1[4],edx
       edoc
!
       rdtsc_tick = cnt1
   end function rdtsc_tick



Dan, I hope this answers your question.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Mon Mar 23, 2015 12:59 pm    Post subject: Reply with quote

Dan,

Good enough explanations for front end and back end are in http://en.wikipedia.org/wiki/Compiler under the section 'Structure of a compiler'.

It's the bit that is machine and OS dependent I think.

Eddie
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue Mar 24, 2015 4:47 am    Post subject: Reply with quote

Eddie,

Well, I guess this post is about the FTN95 front end.
I have now reviewed 9 of the test files and I shall update the results soon.

The last test example I have reviewed is mdbx.f90. This one has sapped my enthusiasm and I must admit I would prefer to have an optimising compiler to do all the changes this one needs. Dan is right and this is a job for the compiler.

mdbx.f90 is full of lines of lengthy calculations. It is like a finite element program, where the element stiffness matrices are being generated but not solved. The majority of the calculation time involves lengthy formulas. Applying a programming restructure to group repeated calculations would be a dangerous approach for such an extensive number of code lines and should probably not be done.

In this case, the formulas are replicated like they have been defined, and the optimisation, by grouping repeated formula snippets and calculations that can be moved outside the inner loop has not been done by the programmer.
I actually agree with this programming approach, as it documents the theory being applied. I am not sure if this program requires optimum run time, but an optimising compiler would help.

I would expect that ifort's vectorisation, cache utilisation and inner loop smarts are very useful in this case.

The most important result from this program would be the correct answer, which FTN95 would provide.

I shall summarise the other tests in the next post. There are some useful results that identify the coding approaches that needs better attention in FTN95 or can be easily avoided by the programmer.

John
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Tue Mar 24, 2015 5:42 pm    Post subject: Reply with quote

John,

The programmer is in front of the front end! (And sometimes slows everything down like the man with the red flag who was at one time required to walk in front of a steam-powered road vehicle).

I looked at some of the Polyhedron stuff ages ago, decided that I didn't like it and moved on. Perhaps you can tell us if you compile for .NET if that slows things down further, and indeed, if it is a WINAPP, what impact that has. I find these things fascinating in a quasi-theological sense, because for me, I only need so many angels to be able to dance simultaneously on the head of a pin ... usually one, sometimes two, and theologians agree that the limit is higher even if they don't agree what it is.

The compiler I most hated that produced a different answer to the rest is one of the fastest now, but the version I have (but don't use) is lots of versions old, so I haven't named it - FTN77 with DBOS worked straight out of the box and Clearwin+ is without equal. I'm old enough that the phrase 'on different computers' actually means on radically different hardware, e.g. IBM, ICL, CDC, Univac, Burroughs, VAX, Elliott/NCR, Pr1me, PC ...

Eddie
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group