|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7931 Location: Salford, UK
|
Posted: Sun Mar 22, 2015 8:59 am Post subject: |
|
|
Eddie
There is a good chance that your best salflibc.dll will work with FTN77. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Sun Mar 22, 2015 10:36 am Post subject: |
|
|
Paul, I suspected as much. Now, I wonder what the benchmarks do with FTN77? (Or does it have the same back end as FTN95?)
Eddie |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1888
|
Posted: Sun Mar 22, 2015 12:08 pm Post subject: Re: |
|
|
LitusSaxonicum wrote: | Now, I wonder what the benchmarks do with FTN77? (Or does it have the same back end as FTN95?)
Eddie | The current Polyhedron benchmarks are in F90+, so if you want to use FTN77 you will need to dig up older versions of the benchmarks written in F77.
Of course, there are lots of other benchmarks, in F77 as well as in F90+. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Sun Mar 22, 2015 12:11 pm Post subject: |
|
|
Dan wrote: Quote: | John, You are doing and have done great job speeding up the not optimized codes but your idea that you can beat the compiler optimization goes against the basic trend |
I think your point is valid. However I have always considered it useful to identify where compilers perform poorly and then understand why this is the case.
The latest test I have reviewed is FATIGUE2.
The best time reported was for Lahey GNU at 31.09 seconds
FTN95 is reported at 263.88 seconds
My test was 324.75 seconds
My revised test is 168.88 seconds.
Again this is an improvement by about 50% but still significantly more than the Lahey compilation.
For FTN95, I reviewed the run times and there was a significant amount of time managing the call to subroutine perdida. There are 585,898,984 calls to this routine !
Code: | call perdida (dt, lambda, mu, yield_stress, R_infinity, b, X_infinity, &
gamma, eta, plastic_strain_threshold, stress_tensor(:,:,n), &
strain_tensor(:,:,n), plastic_strain_tensor(:,:,n), &
strain_rate_tensor(:,:,n), accumulated_plastic_strain(n), &
back_stress_tensor(:,:,n), isotropic_hardening_stress(n), &
damage(n), failure_threshold, crack_closure_parameter)
...
subroutine perdida (dt, lambda, mu, yield_stress, R_infinity, b, X_infinity, gamma, &
eta, plastic_strain_threshold, stress_tensor, strain_tensor, &
plastic_strain_tensor, strain_rate_tensor, &
accumulated_plastic_strain, back_stress_tensor, &
isotropic_hardening_stress, damage, failure_threshold, &
crack_closure_parameter)
!
real (kind = LONGreal), intent(in) :: dt, yield_stress, lambda, mu, R_infinity, b, &
X_infinity, gamma, eta, failure_threshold, &
plastic_strain_threshold, &
crack_closure_parameter
real (kind = LONGreal), dimension(:,:), intent(in) :: strain_rate_tensor, &
strain_tensor
real (kind = LONGreal), dimension(:,:), intent(inout) :: plastic_strain_tensor, &
back_stress_tensor
real (kind = LONGreal), dimension(:,:), intent(out) :: stress_tensor
real (kind = LONGreal), intent(inout) :: damage, accumulated_plastic_strain, &
isotropic_hardening_stress
!
|
The main change I made was to change array sections to being explicit dimension (3,3), which they all are and use F77 addressing in the call.
What is interesting in this example is the combination : dimension(:,:), intent(in) ::
I assume FTN95 is making copies of the arguments and not returning them for intent(in) and then for intent(out) updating the copy on return. FTN95 is using about 150 seconds of run time just manipulating these temporary copies of the 3x3 arrays. I am not sure if FTN95 is enforcing the intent, or if the intent is a rule that should be checked.
The other compilers benefit from SSE instructions, which could bring the run time to 80 seconds, but Lahey's 31 seconds must be identifying other efficiencies.
Array sections is one of FTN95's Achilles heels.
John |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Sun Mar 22, 2015 7:46 pm Post subject: |
|
|
Thanks MECEJ4, it's not obvious from snippets that they are Fortran 90.
Over half a billion calls to a routine with all those parameters? You'd make a really big improvement if they were in COMMON! And that's without all the Fortran 90 stuff that John seems to think that is costing so much time.
Plus if you are making over half a billion calls to a subroutine in the first place, the program structure is probably all to cock ... |
|
Back to top |
|
|
John-Silver
Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley
|
Posted: Sun Mar 22, 2015 8:10 pm Post subject: |
|
|
I was thinking along the same lines as you Eddie when I saw the 1/2 billion.
What came into my mind was that codes overall runtimes are not linear are they.
The case of half a billion loops is a pretty extreme example, for which the optimum optimised may be very interesting for the 0.1% of the computing world that needs them, but for Jowe blogs with a 'normal' size (whatever that is) program , FTN95 will probably perform much closer to the best.
I like your analogy of the 2min and 20min runtimes by the way. That's reality !
I think the big practical measure missing is a good tome about optimising a programs construction. I've already learned a lot being on here, the most basic being getting the order of the DO loops in right order, something which hadn't even occurred to me before tbh, simply because it never has been important. A 30 sec runtime is the same as a 10 minute runtime for a program for most people, both are a cup of coffee long, and if you're running say 50 of those a day then something is much more amiss than the programs optimisation.
It's analagous to FE modelling and the size of models .... people get lazy and start meshing like billyo, just because they can and end up with a mesh 10 times too small and hence 100 times too big a model globally , and hence 1000+times longer to tun than need be. An extrwme example I know, but even a 2times too big mesh would result easily in around 20 times the runtime.
The problem with Fortran is that its not always obvious where the reductions could be made.
I think less-extreme benchmarks are equally as valid in comparing compilers for this type of reason, because the aim should be to be 'optimum' for the highest percentage of users, not measured against the extreme-programmes only. It's unfair on the compilers to just consider the extreme case.
Of course the real problem is, just like for FE models, computers are too powerful today and most computing is way over the top as a result and creates more problems than it solves ! That's a fact. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Sun Mar 22, 2015 8:36 pm Post subject: |
|
|
John-Silver,
Kreitzberg & Schneiderman: 'The elements of FORTRAN style' (Harcourt Brace Jovanovich) is where I started. Still available on the internet. Probably less than half of it is relevant today, sadly. My first copy was loaned and never returned, then I got another ... I suspect you can't pop round to my house to borrow it!
There are needs for speed: you are about to crash onto the moon? You want tomorrow's weather forecast, and the run-time is 25 hours? You need speed then.
But mostly you don't, and the speed ratios of even 300 to one mean nothing if FTN95-compiled code executes in less than a second! Then, there's the business I already alluded to of useful speed.
Eddie |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2826 Location: South Pole, Antarctica
|
Posted: Mon Mar 23, 2015 12:40 am Post subject: |
|
|
John,
What high res timer do you use here? I am now confused, so many were discussed before, can you please post again whole its text and usage? I need it for tuning of some my own stuff (unfortunately no time even to bend the painful nail in the shoe, let alone for anything else like Polyhedron stuff
Paul, what this english word "backend" means? I have with it some very wrong associations, but i am not english speaking . By the way, i have some third party parallel algebra 32bit libraries compiled by ancient MS and recent Intel Fortran and which somehow work with FTN95, will they work under 64bit? |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Mon Mar 23, 2015 11:03 am Post subject: |
|
|
There are a number of good timers available. Mecej4 recently posted a simple routine with integer*8 access to rdtsc, which works like CPU_CLOCK@. The good timers are:
call system_clock (count_start, count_rate, count_max)
STDCALL QUERYPERFORMANCECOUNTER 'QueryPerformanceCounter' (REF):LOGICAL*4
STDCALL QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4
cpu_clock@ ( which uses rdtsc instruction)
integer*8 function rdtsc_tick ()
integer*8 cnt1
!
! get rdtsc value
code
rdtsc
mov cnt1,eax
mov cnt1[4],edx
edoc
!
rdtsc_tick = cnt1
end function rdtsc_tick
Both CPU_CLOCK@ tick at the processor clock rate and have a small call overhead. The problem is you need to calibrate them, which can be achieved by timing with SYSTEM_CLOCK and accumulating the ticks.
With FTN95, SYSTEM_CLOCK is accurate and easy to use, although rdtsc is much better for timing shorter duration events.
All these timers are elapsed time timers.
For each timer routine, I have developed 3 function types:
* integer*8 function RDTSC_TICK () returns the tick count
* integer*8 function RDTSC_RATE () returns the tick rate in ticks per second
* real*8 function RDTSC_SECONDS () returns the time in seconds
I have similar for SYSTEM_CLOCK_xxx and QUERYPERFORMANCE_xxx
------------------------------------------------------
All the other timers, including all CPU timers are hopeless. They all report a tick value that is updated 64 times per second. If you event is of the order of seconds, then these would be ok, but for accurate timing they are no good. (It depends on what you want)
These timers include:
cpu_time (intrinsic)
date_and_time (intrinsic)
high_res_clock@ (ftn95)
dclock@ (ftn95)
clock@ (ftn95)
GetLocalTime (winapi)
GetTickCount (winapi)
GetProcessTimes (winapi)
I hope this answers your question.
John |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Mon Mar 23, 2015 11:07 am Post subject: |
|
|
The following is my example of my Function RDTSC_Rate, which shows how to utilise the tick rate of each routine. Only RDTSC and CPU_CLOCK@ give different values at each call, all others return values faster than their clock rate. Code: | integer*8 rdtsc_rate, rate
external rdtsc_rate
!
rate = rdtsc_rate (10)
rate = rdtsc_rate (100)
rate = rdtsc_rate (1000)
rate = rdtsc_rate (2500)
rate = rdtsc_rate (10000)
rate = rdtsc_rate (25000)
rate = rdtsc_rate (100000)
rate = rdtsc_rate (1000000)
end
integer*8 function rdtsc_rate (num_cycle)
!
! initialises rdtsc pointer and estimates rdtsc tick rate
!
! calibrate using num_cycle of QueryPerform
!
integer*4 num_cycle
! integer*4, parameter :: num_cycle = 1000
integer*8 rd_list(0:2), last_rdtsc, rd_tick, ticks, call_rate
integer*8 qu_list(0:2), last_query, query_tick, query_rate
real*8 secs
integer*4 i, kk, nt, i_list(0:2), calls
!
integer*8 :: known_rate = -1 ! or 2666701126 ticks per second ~ processor clock rate
integer*8 rdtsc_tick, QueryPerformance_tick, QueryPerformance_rate
external rdtsc_tick, QueryPerformance_tick, QueryPerformance_rate
!
! if ( known_rate <= 0) then
write (*,11) 'rdtsc_rate Initialise ', known_rate
write (*,11) 'target number of cycles ', num_cycle
!
Query_rate = QueryPerformance_rate ()
!
! Run both clocks to get time
last_rdtsc = rdtsc_tick ()
last_Query = QueryPerformance_tick ()
kk = -1
nt = 0
do i = 0, huge(i)
rd_tick = rdtsc_tick ()
Query_tick = QueryPerformance_tick ()
if ( Query_tick == last_Query ) cycle
if ( kk < 2 ) then
kk = kk+1
else
nt = nt+1
end if
i_list(kk) = i
rd_list(kk) = rd_tick
last_rdtsc = rd_tick
qu_list(kk) = Query_tick
last_query = Query_tick
if ( nt > num_cycle ) exit
end do
!
! number of ticks of RDTSC
calls = i_list(2) - i_list(1)
ticks = rd_list(2) - rd_list(1)
query_tick = qu_list(2) - qu_list(1)
!
secs = dble (query_tick) / dble(query_rate)
rdtsc_rate = dble (ticks) / secs
call_rate = dble (calls) / secs
!
! rdtsc_rate = known_rate
!
write (*,11) 'rdtsc_tick cycles =', calls, ' calls'
write (*,11) 'Number of cycles =', nt, ' ticks'
write (*,11) 'query perform ticks =', query_tick, ' ticks'
write (*,12) 'initialise duration =', secs, ' seconds'
write (*,11) 'rdtsc_tick duration =', ticks, ' ticks'
write (*,11) 'rdtsc_tick rate =', call_rate, ' calls per second'
!
write (*,11) 'rdtsc rate =', rdtsc_rate, ' ticks per second'
write (*,11) 'change in rate =', rdtsc_rate-known_rate, ' ticks per second'
write (*,10) 'rdtsc_tick initialised'
write (*,10) ' '
known_rate = rdtsc_rate
! else
! rdtsc_rate = known_rate
! end if
10 format (3x,a)
11 format (3x,a,i15,a)
12 format (3x,a,f15.7,a)
!
end function rdtsc_rate
|
|
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Mon Mar 23, 2015 11:09 am Post subject: |
|
|
queryperform routines are:
Code: | ! QueryPerformanceCounter Windows API routine
real*8 function QueryPerformance_sec ()
integer*8 :: tick
real*8 :: tick_rate = -1
integer*8 QueryPerformance_rate, QueryPerformance_tick
external QueryPerformance_rate, QueryPerformance_tick
!
if (tick_rate < 0) &
tick_rate = QueryPerformance_rate ()
tick = QueryPerformance_tick ()
QueryPerformance_sec = dble(tick) / tick_rate
end function QueryPerformance_sec
integer*8 function QueryPerformance_tick ()
STDCALL QUERYPERFORMANCECOUNTER 'QueryPerformanceCounter' (REF):LOGICAL*4
logical*4 ll
integer*8 tick
!
ll = QUERYPERFORMANCECOUNTER (tick)
QueryPerformance_tick = tick
end function QueryPerformance_tick
integer*8 function QueryPerformance_rate ()
STDCALL QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4
logical*4 ll
integer*8 tick_rate
!
ll = QUERYPERFORMANCEFREQUENCY (tick_rate)
write (*,*) 'QueryPerformance', tick_rate,' ticks per second'
QueryPerformance_rate = tick_rate
end function QueryPerformance_rate
integer*8 function rdtsc_tick ()
integer*8 cnt1
!
! get rdtsc value
code
rdtsc
mov cnt1,eax
mov cnt1[4],edx
edoc
!
rdtsc_tick = cnt1
end function rdtsc_tick
|
Dan, I hope this answers your question. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Mon Mar 23, 2015 12:59 pm Post subject: |
|
|
Dan,
Good enough explanations for front end and back end are in http://en.wikipedia.org/wiki/Compiler under the section 'Structure of a compiler'.
It's the bit that is machine and OS dependent I think.
Eddie |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Tue Mar 24, 2015 4:47 am Post subject: |
|
|
Eddie,
Well, I guess this post is about the FTN95 front end.
I have now reviewed 9 of the test files and I shall update the results soon.
The last test example I have reviewed is mdbx.f90. This one has sapped my enthusiasm and I must admit I would prefer to have an optimising compiler to do all the changes this one needs. Dan is right and this is a job for the compiler.
mdbx.f90 is full of lines of lengthy calculations. It is like a finite element program, where the element stiffness matrices are being generated but not solved. The majority of the calculation time involves lengthy formulas. Applying a programming restructure to group repeated calculations would be a dangerous approach for such an extensive number of code lines and should probably not be done.
In this case, the formulas are replicated like they have been defined, and the optimisation, by grouping repeated formula snippets and calculations that can be moved outside the inner loop has not been done by the programmer.
I actually agree with this programming approach, as it documents the theory being applied. I am not sure if this program requires optimum run time, but an optimising compiler would help.
I would expect that ifort's vectorisation, cache utilisation and inner loop smarts are very useful in this case.
The most important result from this program would be the correct answer, which FTN95 would provide.
I shall summarise the other tests in the next post. There are some useful results that identify the coding approaches that needs better attention in FTN95 or can be easily avoided by the programmer.
John |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Tue Mar 24, 2015 5:42 pm Post subject: |
|
|
John,
The programmer is in front of the front end! (And sometimes slows everything down like the man with the red flag who was at one time required to walk in front of a steam-powered road vehicle).
I looked at some of the Polyhedron stuff ages ago, decided that I didn't like it and moved on. Perhaps you can tell us if you compile for .NET if that slows things down further, and indeed, if it is a WINAPP, what impact that has. I find these things fascinating in a quasi-theological sense, because for me, I only need so many angels to be able to dance simultaneously on the head of a pin ... usually one, sometimes two, and theologians agree that the limit is higher even if they don't agree what it is.
The compiler I most hated that produced a different answer to the rest is one of the fastest now, but the version I have (but don't use) is lots of versions old, so I haven't named it - FTN77 with DBOS worked straight out of the box and Clearwin+ is without equal. I'm old enough that the phrase 'on different computers' actually means on radically different hardware, e.g. IBM, ICL, CDC, Univac, Burroughs, VAX, Elliott/NCR, Pr1me, PC ...
Eddie |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|