forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Polyhedron Benchmark tests
Goto page 1, 2  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue Mar 17, 2015 6:01 am    Post subject: Polyhedron Benchmark tests Reply with quote

Based on recent posts, it may be worth setting up a review of the Polyhedron benchmark test suite and provide some indication of why FTN95 is so slow for each of these tests.
I have (years ago) reviewed some of the tests by using /timing or /profile options to identify the area in the code where most of the time is being used.
What I found for a few of the tests was that only a few lines were to blame for FTN95 poor performance, typically where FTN95 failed to optimise the inner loop. Often these could be corrected by "better" code structures.
Other tests exhibited poor indexing of large arrays or used large local arrays that exceeded the stack. Both these problems can be addressed by better array usage with subscript re-ordering or ALLOCATE. ( I have always found that if you need a larger stack, the solution is never to use a larger stack, but fix the problem)
There were other examples where vector instructions or /Qparallel helped, which is beyond FTN95's capability.

If we can assemble this list of reasons, it may help FTN95 users to see what other compilers do automatically and what we, as FTN95 users, can do to improve our code to cope with some of these problems.
It was my general impression from my past reviews that most of the inefficiency attributed to FTN95 can be avoided or at least mitigated.

As a first step I shall generate a list of the latest tests in the suite and proceed to provide a summary of their status. Hopefully as this becomes fairly complete it will document the problems to be avoided.

If anyone has done similar reviews, please let me know.

I'll keep you updated.

John

ps: We need links to the test suites to review
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Tue Mar 17, 2015 1:57 pm    Post subject: Reply with quote

John,

I gave up on the Polyhedron benchmarks because they didn't tell me about anything I wanted to know. They tell me that FTN95 is 'slow', but I know that it is blindingly fast compared to most things I used over the past 45 years, and they tell me that some compilers that I can't even get to run are faster still. Well, wasting weeks trying to get a compiler to even run, or not produce correct results, strikes me as a way of compensating for a few seconds saved here and there in runtime.

Many years ago, I did a personal review of the 'Spag' utility, and it turned some carefully crafted source codes of mine into things that were unreadable by me. I decided that I'd carry on doing my things my way, and anyone who wanted to do the the Polyhedron way was welcome to it. I formed the view, rightly or wrongly, that once Spag had done its work, you ended up with a source code that would run badly with FTN95, and all the Polyhedron tests were Spagged!

Eddie
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Mar 18, 2015 4:28 am    Post subject: Reply with quote

I located the 17 program Polyhedron Benchmark test suite contained in pb11.zip

I have looked at Inducta.f90, one of the poor performers with FTN95.
Based on the Polyhedron test results for 64-bit Windows 7 on Intel Core i5 2500k, the results show:
Best : Intel 25.66 seconds
FTN95 test : 385.66 seconds
My FTN95 test : 469.25 seconds on an i5-2300 CPU @ 2.80GHz
My modified code test : 240.90 seconds

I had hoped for better improvement, when identifying code structures that were not well managed by FTN95, but the improvement was less than expected.
The analysis of performance in the 6,636 lines of code showed, 4 near identical inner loops that are responsible for 99% of the run time. FTN95 /timing initially showed the 2 routines that were responsible and I used the rdtsc_tick() function to confirm the problem area in these routines.
The original code was:
Code:

      real (kind = longreal), dimension(:,:), intent(in) :: rotate_coil

          do j = 1, 9
              c_vector(3) = 0.5 * h_coil * z1gauss(j)
!
!       rotate coil vector into the global coordinate system and translate it
              rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
              rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
              rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
!
              do k = 1, 9
                  q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
                  q_vector(2) = 0.5_longreal * b1 * (y2gauss(k) - 1.0_longreal)
                  q_vector(3) = 0.0_longreal
!
!       rotate quad vector into the global coordinate system
                  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
                  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
                  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
!
!       compute and add in quadrature term
                  numerator = w1gauss(j) * w2gauss(k) *                                     &
                                                 dot_product(coil_current_vec,current_vector)
                  denominator = sqrt(dot_product(rot_c_vector-rot_q_vector,                 &
                                                                  rot_c_vector-rot_q_vector))
                  l12_lower = l12_lower + numerator/denominator
              end do
          end do

The code I identified for FTN95 problems are:
real (kind = longreal), dimension(:,:), intent(in) :: rotate_coil
This uses implied dimension, where (3,3) is the only valid input

dot_product(coil_current_vec,current_vector)
this dot_product is repeated in the inner loop, but does not change, so was taken out of the loops (probably identified)

dot_product(rotate_coil(1,:),c_vector(:))
this uses array sections in dot_product, which can be slow with FTN95

denominator = sqrt(dot_product(rot_c_vector-rot_q_vector, &
rot_c_vector-rot_q_vector))

This is a complex argument for dot_product, which I tried to clean up. (no improvement)

Unfortunately none of these changes produced significant improvements in my testing. I would expect that this is due to FTN95 recognising some of these changes.

The final change was to identify that “q_vector(3) = 0.0” so that the calculation of “rot_c_vector-rot_q_vector” in the inner loop could be replaced with an explicit code of:
[code:1:a96
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Mar 18, 2015 4:30 am    Post subject: Reply with quote

The final change was to identify that “q_vector(3) = 0.0” so that the calculation of “rot_c_vector-rot_q_vector” in the inner loop could be replaced with an explicit code of:
Code:

                  rot_q_vector(1) = rot_c_vector(1) - (rotate_quad(1,1)*q_vector(1) + rotate_quad(1,2)*q_vector(2))
                  rot_q_vector(2) = rot_c_vector(2) - (rotate_quad(2,1)*q_vector(1) + rotate_quad(2,2)*q_vector(2))
                  rot_q_vector(3) = rot_c_vector(3) - (rotate_quad(3,1)*q_vector(1) + rotate_quad(3,2)*q_vector(2))

The net result was a 49% reduction in run time by changing 4 lines of code in 4 locations. Unfortunately still well above the run times of other compilers. I was hoping to demonstrate code syntax that was not suited to the FTN95 compiler. This change is more related to inefficiency in the original code. I wonder if other compilers would identify this ?

This loop calculation would be suitable for vector instructions and cache localisation, which is not available in FTN95, assisting other compilers.

I will look at some other tests and see if I can demonstrate other changes that assist FTN95.

John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Mar 18, 2015 11:13 am    Post subject: Reply with quote

The next program I have looked at is mp_prop_design.f90.
This program is ONE routine; no subroutines ! Before I realised this, I initially tried ftn95 /timing, only to file one line output to the .tmo file. I then tried ftn95 /profile ( for the first time) and found it to work quite well in conjunction with SDBG.
This identified a DO loop which had the most line usage. The code in this loop is:
Code:
                     DO k = 1 , tip - 1
!
                        DO i = 1 , 1 + NINT(2.0D0*PI*trns/dphit) ,      &
     &                     NINT(ainc/(dphit*(180.0D0/PI)))
!
                           phit = phib + phie(k) + (REAL(i)-0.50D0)     &
     &                            *dphit
!
                           xt = re(k)*COS(phit)
!
                           yt = re(k)*SIN(phit)
!
                           zt = -(REAL(i)-0.50D0)*dphit*re(k)           &
     &                          *TAN(thetae(k)+alphie(k))
!
                           distx = x(j) - xt
!
                           disty = y(j) - yt
!
                           distz = -zt
!
                           dltphi = -re(k)*dphit
!
                           dltz = re(k)*dphit*TAN(thetae(k)+alphie(k))
!
                           dltx = -dltphi*SIN(phit)
!
                           dlty = dltphi*COS(phit)
!
                           dwtx = (dgame(k)/(4.0D0*PI))                 &
     &                            *((dlty*distz-dltz*disty)             &
     &                            /(SQRT(ABS(distx)**2.0D0+ABS(disty)   &
     &                            **2.0D0+ABS(distz)**2.0D0))**3.0D0)
!
                           dwty = (dgame(k)/(4.0D0*PI))                 &
     &                            *((dltz*distx-dltx*distz)             &
     &                            /(SQRT(ABS(distx)**2.0D0+ABS(disty)   &
     &                            **2.0D0+ABS(distz)**2.0D0))**3.0D0)
!
                           dwtphi = dwty*COS(phi(j)) - dwtx*SIN(phi(j))
!
                           dwtz = (dgame(k)/(4.0D0*PI))                 &
     &                            *((dltx*disty-dlty*distx)             &
     &                            /(SQRT(ABS(distx)**2.0D0+ABS(disty)   &
     &                            **2.0D0+ABS(distz)**2.0D0))**3.0D0)
!
                           wphi(j) = -ABS(wphi(j)+dwtphi)
!
                           wz(j) = wz(j) + dwtz
!
                        ENDDO
!
                     ENDDO


There are a number of problems in this code, the main one being repeated code and use of real exponents, such as in the calculation of dwtx, dwty and dwtz.
This can be changed to:
Code:
                           dist   = sqrt ( distx**2 + disty**2 + distz**2 )
                           dist3  = dist**3
!
                           tmp1   = dgame(k) /(4.0D0*PI * dist3)
                           dwtx   = tmp1 * (dlty*distz - dltz*disty)

Based on the Polyhedron test results for 64-bit Windows 7 on Intel Core i5 2500k, the results show:
Best : Intel 19.33 seconds
FTN95 test : 562.03 seconds
My FTN95 test : 689.36 seconds
My modified code test : 229.21 seconds on an i5-2300 CPU @ 2.80GHz, which is a reduction of 67%
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Mar 18, 2015 11:22 am    Post subject: Reply with quote

mp_prop_design.f90 ctd.

The main problem was:
Code:
                            dwtx = (dgame(k)/(4.0D0*PI))                 &
      &                            *((dlty*distz-dltz*disty)             &
      &                            /(SQRT(ABS(distx)**2.0D0+ABS(disty)   &
      &                            **2.0D0+ABS(distz)**2.0D0))**3.0D0)

My revised code for these loops are:
Code:
                     limit = NINT ( trns * pi*2.0D0   / dphit ) + 1
                     step  = NINT ( ainc * pi/180.0d0 / dphit )
                     cosj  = COS (phi(j))
                     sinj  = SIN (phi(j))
!
                     DO k = 1 , tip - 1                                         ! 2.815e6
!
                        tana   = TAN (thetae(k)+alphie(k))
                        dltphi = -re(k)*dphit
                        dltz   =  re(k)*dphit*tana
!
                        DO i = 1 , limit , step                                 ! 3.378e7
!
                           tmp1   = (REAL(i)-0.50D0) * dphit
                           phit   = phib + phie(k) + tmp1       ! 2.06e9
!
                           cosa   = COS (phit)
                           sina   = SIN (phit)
!
                           xt     =  re(k) * cosa
                           yt     =  re(k) * sina
                           zt     =  re(k) * tana * tmp1
!
                           distx  = x(j) - xt
                           disty  = y(j) - yt
                           distz  = zt
                           dist   = sqrt ( distx**2 + disty**2 + distz**2 )
                           dist3  = dist**3
!
                           dltx   = -dltphi*sina
                           dlty   =  dltphi*cosa
!
                           tmp1   = dgame(k) /(4.0D0*PI * dist3)
                           dwtx   = tmp1 * (dlty*distz - dltz*disty)
                           dwty   = tmp1 * (dltz*distx - dltx*distz)     
                           dwtz   = tmp1 * (dltx*disty - dlty*distx)
!
                           dwtphi = dwty*cosj - dwtx*sinj
!
                           wphi(j) = -ABS(wphi(j)+dwtphi)
                           wz(j)   = wz(j) + dwtz
!
                        ENDDO    ! i                                       2.06e9
!
                     ENDDO       ! k                                       3.37e7


The basic changes have been:
# change **2.0d00 to **2
# provide some temporary variables to remove repeated code (eg dist3) and
# move some calculations out of the DO loop.
Note: the comment numbers past col 73 are line use counts I recovered from /profile + sdbg.

I have reviewed this program to identify code that FTN95 does not handle well. The conclusion I am moving towards is why should it ? There are examples in this program of code that is developed without attention to run time efficiency.In the few examples I have reviewed, there appears to be a bias in the test suite to code that suits optimising compilers, while code that is developed with more attention to performance is not being considered.

Why should FTN95 be measured against code that did not require this efficiency in the first place.

I'll continue to update.

John
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Wed Mar 18, 2015 12:37 pm    Post subject: Reply with quote

John,

Did you make any gains by doing repeated multiplication instead of raising to powers, e.g. dist*dist*dist instead of dist**3, or even identifying 'dist2' from dist*dist as this is done earlier and then dist3=dist2*dist.

I remember noting in a previous thread somewhere that raising to a REAL power where the REAL was a constant that was clearly really INTEGER would be an easy optimisation for the compiler.

Eddie
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Mar 18, 2015 1:06 pm    Post subject: Reply with quote

Eddie,

My view is that if d*d is faster than d**2, the compiler would probably make the change. However d**2.0d0 is much slower than d**2, as
d**x becomes exp ( x * log(d) ) if x is not recognised as an integer value.

Using d**2.0d0, rather than d**2 is the programmer just ignoring what the compiler must do with what is provided; bad programming in my opinion. I'm sure others will disagree.

John
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Wed Mar 18, 2015 1:44 pm    Post subject: Reply with quote

John,

It isn't listed in the documentation as an optimisation.

While A*A is essentially faster than A**2 and substantially better than A**2.0, there comes a point where A*A*A* ... must run out of steam.

A lot of the optimisations listed in the documentation look like common subexpression removal in one form or another. In FTN95 as far as I'm aware, A**X is not examined to see if X is an integer expressed as a real.

Eddie
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Wed Mar 18, 2015 6:26 pm    Post subject: Reply with quote

I had more thoughts.

(1) You get the biggest speedup on given hardware by choosing the best algorithm.
(2) You get a speedup with given hardware and compiler by writing the code properly.
(3) You get a speedup by having faster hardware (especially if you wait some years before trying again).
(4) You might get a speedup in execution time on given hardware by changing the compiler, but you'll never recover the time it took you to convert from one compiler to another.
(5) It probably doesn't matter anyway

To illustrate (5) if you reduce a runtime from 24 hours to 8, you still get the answer tomorrow. If you reduce the runtime from 20 minutes to 2, you will waste those 2 minutes by waiting, when you could have gone and done something useful with the 20 minutes. This includes turning to a different computer, if you must.

(6) What is the point of a Windows interface on a code that runs anything other than nearly instantly? If a box pops up asking 'OK to continue?' a minute after you left the office, then you have wasted probably 16 hours anyway.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2814
Location: South Pole, Antarctica

PostPosted: Fri Mar 20, 2015 6:23 pm    Post subject: Reply with quote

Eddie, Of course speed does matter. If for example life on earth made by the tries and faults actively developed during last 10^9 years in giant Monte Carlo simulation ran just 10^-7 times faster we would not need the concept of paradise because we would succeed to find how not to die Exclamation

There are one exception which was used in Quake III Arena game which can speed up a**b calculations if b is exactly half-integer. Specifically it is doing fast inverse SQRT. We hit this just few days back and investigating how implement that. Otherwise adding some atomic data into the code which uses few regular a**b make our PIC code run times ski high, basically unmanageable. Despite all that runs on supercomputers. I see in Wiki that SSE has done this even faster.

John, You are doing and have done great job speeding up the not optimized codes but your idea that you can beat the compiler optimization goes against the basic trend and is generally hardly realizable by the average and as you have shown even by the top grade users. For that everyone needs to be a computer science hack who intimately knows compilers, assembler, processor architecture aside from language itself and of course top level math and >20 years of software optimization experience, all of them simultaneously! You are like Garry Kasparov trying to beat a machine! The trend is exactly opposite. The code must be written by the total nuts without any clues in optimization and the compiler has to make it faster then everyone but the absolutely top level hacks can do. And seems IVF so far is doing its job pretty good - you optimized codes and made them 2x and 3x faster but IVF 30x! "Do not optimize your code, let compiler do that for you". "Sometimes your optimizations by hand will make it worse". These were thoughts 20 years back when Polyhedron benchmarks were assembled.

Anyway, I thought that instead of beating almost unbeatable you will only chop these codes into smallest possible pieces with your timers and just select, simplify and bring to light these manageable grains of source which are responsible for slow run of FTN95 versus IVF. Ideally, making also benchmarks of these snippets side by side with other compilers. Then i'd hope Silverfrost would quickly implement the optimizations into the compiler - because if this compiler will be also as fast as others it will be totally killing software.


Last edited by DanRRight on Sat Mar 21, 2015 6:02 am; edited 4 times in total
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7923
Location: Salford, UK

PostPosted: Fri Mar 20, 2015 8:43 pm    Post subject: Reply with quote

Future work on optimizing will focus on the 64 bit backend which will probably behave quite differently. The benchmark tests will be relevant but will probably have a different outcome.
Back to top
View user's profile Send private message AIM Address
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Sat Mar 21, 2015 7:35 am    Post subject: Reply with quote

It'll be very interesting to see how JohnCampbell's excellent initiative in re-visiting the benchmarks will be reflected in the equivalent x64 results. then. Hopefully all the differences will be positive Wink.
John's 'baseline' results and conclusions will be a good proving ground for showing the world how much better ftn95 is at that point.
Of course, keeping a track on any optiměizations already, or which will be,incorporated during the development would make that comparison assessment easier too.

By the way Paul, hows progress on the x64 version going ? any idea of an anticipated beta release date maybe ? (provisional and without an y committment of course). I seem to remember when yo ufirst talked about this last year the anticipation was for some time 'early 2015', but of course things can arise which totally blow expectations off course.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7923
Location: Salford, UK

PostPosted: Sat Mar 21, 2015 11:14 am    Post subject: Reply with quote

The development of the 64 bit backend is going very well With a large majority of the basic work completed. We are now working through our test suite using the new backend and it is difficult to predict how long this will take to complete. The initial release will not include optimisation.
Back to top
View user's profile Send private message AIM Address
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sat Mar 21, 2015 12:52 pm    Post subject: Reply with quote

While the 32 bit version of FTN77 is perfectly adequate for what I do (or would be if its Clearwin+ was today's), I welcome this development. There can hardly be an AMD/Intel cpu in regular use today that isn't capable of running Windows in 64 bit mode, and even Windows tablets are 2Gb minimum.

Forget software optimization: it may help, but you run the risk of going too far and getting the wrong results: what is needed is to use modern opcodes. That may well be difficult, as the facilities of the x87 coprocessor stack were rarely exploited to the full.

Dan, you didn't read or fully understand what I wrote. I never said speed doesn't matter under every circumstance, just that there are times when it doesn't

John C: If you make your changes to the propeller code, does it speed up under iFort comparably, or is it something that hits FTN95 particularly badly?

Eddie
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group