JohnCampbell

Posts: 2526 Sydney

Back to Top

17 Mar 2015 5:01 #15906

Based on recent posts, it may be worth setting up a review of the Polyhedron benchmark test suite and provide some indication of why FTN95 is so slow for each of these tests. I have (years ago) reviewed some of the tests by using /timing or /profile options to identify the area in the code where most of the time is being used. What I found for a few of the tests was that only a few lines were to blame for FTN95 poor performance, typically where FTN95 failed to optimise the inner loop. Often these could be corrected by 'better' code structures. Other tests exhibited poor indexing of large arrays or used large local arrays that exceeded the stack. Both these problems can be addressed by better array usage with subscript re-ordering or ALLOCATE. ( I have always found that if you need a larger stack, the solution is never to use a larger stack, but fix the problem) There were other examples where vector instructions or /Qparallel helped, which is beyond FTN95's capability.

If we can assemble this list of reasons, it may help FTN95 users to see what other compilers do automatically and what we, as FTN95 users, can do to improve our code to cope with some of these problems. It was my general impression from my past reviews that most of the inefficiency attributed to FTN95 can be avoided or at least mitigated.

As a first step I shall generate a list of the latest tests in the suite and proceed to provide a summary of their status. Hopefully as this becomes fairly complete it will document the problems to be avoided.

If anyone has done similar reviews, please let me know.

I'll keep you updated.

John

ps: We need links to the test suites to review

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

17 Mar 2015 12:57 #15910

John,

I gave up on the Polyhedron benchmarks because they didn't tell me about anything I wanted to know. They tell me that FTN95 is 'slow', but I know that it is blindingly fast compared to most things I used over the past 45 years, and they tell me that some compilers that I can't even get to run are faster still. Well, wasting weeks trying to get a compiler to even run, or not produce correct results, strikes me as a way of compensating for a few seconds saved here and there in runtime.

Many years ago, I did a personal review of the 'Spag' utility, and it turned some carefully crafted source codes of mine into things that were unreadable by me. I decided that I'd carry on doing my things my way, and anyone who wanted to do the the Polyhedron way was welcome to it. I formed the view, rightly or wrongly, that once Spag had done its work, you ended up with a source code that would run badly with FTN95, and all the Polyhedron tests were Spagged!

Eddie

JohnCampbell

Posts: 2526 Sydney

Back to Top

18 Mar 2015 3:28 #15915

I located the 17 program Polyhedron Benchmark test suite contained in pb11.zip

I have looked at Inducta.f90, one of the poor performers with FTN95. Based on the Polyhedron test results for 64-bit Windows 7 on Intel Core i5 2500k, the results show: Best : Intel 25.66 seconds FTN95 test : 385.66 seconds My FTN95 test : 469.25 seconds on an i5-2300 CPU @ 2.80GHz My modified code test : 240.90 seconds

I had hoped for better improvement, when identifying code structures that were not well managed by FTN95, but the improvement was less than expected. The analysis of performance in the 6,636 lines of code showed, 4 near identical inner loops that are responsible for 99% of the run time. FTN95 /timing initially showed the 2 routines that were responsible and I used the rdtsc_tick() function to confirm the problem area in these routines. The original code was:

      real (kind = longreal), dimension(:,:), intent(in) :: rotate_coil
…
          do j = 1, 9
              c_vector(3) = 0.5 * h_coil * z1gauss(j)
!
!       rotate coil vector into the global coordinate system and translate it
              rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
              rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
              rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
!
              do k = 1, 9
                  q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
                  q_vector(2) = 0.5_longreal * b1 * (y2gauss(k) - 1.0_longreal)
                  q_vector(3) = 0.0_longreal
!
!       rotate quad vector into the global coordinate system
                  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
                  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
                  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
!
!       compute and add in quadrature term
                  numerator = w1gauss(j) * w2gauss(k) *                                     &
                                                 dot_product(coil_current_vec,current_vector)
                  denominator = sqrt(dot_product(rot_c_vector-rot_q_vector,                 &
                                                                  rot_c_vector-rot_q_vector))
                  l12_lower = l12_lower + numerator/denominator
              end do
          end do

The code I identified for FTN95 problems are:

 real (kind = longreal), dimension(:,:), intent(in) :: rotate_coil*

This uses implied dimension, where (3,3) is the only valid input

dot_product(coil_current_vec,current_vector) this dot_product is repeated in the inner loop, but does not change, so was taken out of the loops (probably identified)

dot_product(rotate_coil(1,:),c_vector(:)) this uses array sections in dot_product, which can be slow with FTN95

             denominator = sqrt(dot_product(rot_c_vector-rot_q_vector,                 &amp;
                                                            rot_c_vector-rot_q_vector))*

This is a complex argument for dot_product, which I tried to clean up. (no improvement)

Unfortunately none of these changes produced significant improvements in my testing. I would expect that this is due to FTN95 recognising some of these changes.

The final change was to identify that “q_vector(3) = 0.0” so that the calculation of “rot_c_vector-rot_q_vector” in the inner loop could be replaced with an explicit code of:

                  rot_q_vector(1) = rot_c_vector(1) - (rotate_quad(1,1)*q_vector(1) + rotate_quad(1,2)*q_vector(2))
                  rot_q_vector(2) = rot_c_vector(2) - (rotate_quad(2,1)*q_vector(1) + rotate_quad(2,2)*q_vector(2))
                  rot_q_vector(3) = rot_c_vector(3) - (rotate_quad(3,1)*q_vector(1) + rotate_quad(3,2)*q_vector(2))

The net result was a 49% reduction in run time, but still well above the run times of other compilers. This loop calculation would be suitable for vector instructions and cache localisation, which is not available in FTN95.

I will look at some other tests and see if I can demonstrate other changes that suit FTN95.

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

18 Mar 2015 3:30 #15916

The final change was to identify that “q_vector(3) = 0.0” so that the calculation of “rot_c_vector-rot_q_vector” in the inner loop could be replaced with an explicit code of:

                  rot_q_vector(1) = rot_c_vector(1) - (rotate_quad(1,1)*q_vector(1) + rotate_quad(1,2)*q_vector(2))
                  rot_q_vector(2) = rot_c_vector(2) - (rotate_quad(2,1)*q_vector(1) + rotate_quad(2,2)*q_vector(2))
                  rot_q_vector(3) = rot_c_vector(3) - (rotate_quad(3,1)*q_vector(1) + rotate_quad(3,2)*q_vector(2))

The net result was a 49% reduction in run time by changing 4 lines of code in 4 locations. Unfortunately still well above the run times of other compilers. I was hoping to demonstrate code syntax that was not suited to the FTN95 compiler. This change is more related to inefficiency in the original code. I wonder if other compilers would identify this ?

This loop calculation would be suitable for vector instructions and cache localisation, which is not available in FTN95, assisting other compilers.

I will look at some other tests and see if I can demonstrate other changes that assist FTN95.

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

18 Mar 2015 10:13 #15918

The next program I have looked at is mp_prop_design.f90. This program is ONE routine; no subroutines ! Before I realised this, I initially tried ftn95 /timing, only to file one line output to the .tmo file. I then tried ftn95 /profile ( for the first time) and found it to work quite well in conjunction with SDBG. This identified a DO loop which had the most line usage. The code in this loop is:

                     DO k = 1 , tip - 1
!
                        DO i = 1 , 1 + NINT(2.0D0*PI*trns/dphit) ,      &
     &                     NINT(ainc/(dphit*(180.0D0/PI)))
!
                           phit = phib + phie(k) + (REAL(i)-0.50D0)     &
     &                            *dphit
!
                           xt = re(k)*COS(phit)
!
                           yt = re(k)*SIN(phit)
!
                           zt = -(REAL(i)-0.50D0)*dphit*re(k)           &
     &                          *TAN(thetae(k)+alphie(k))
!
                           distx = x(j) - xt
!
                           disty = y(j) - yt
!
                           distz = -zt
!
                           dltphi = -re(k)*dphit
!
                           dltz = re(k)*dphit*TAN(thetae(k)+alphie(k))
!
                           dltx = -dltphi*SIN(phit)
!
                           dlty = dltphi*COS(phit)
!
                           dwtx = (dgame(k)/(4.0D0*PI))                 &
     &                            *((dlty*distz-dltz*disty)             &
     &                            /(SQRT(ABS(distx)**2.0D0+ABS(disty)   &
     &                            **2.0D0+ABS(distz)**2.0D0))**3.0D0)
!
                           dwty = (dgame(k)/(4.0D0*PI))                 &
     &                            *((dltz*distx-dltx*distz)             &
     &                            /(SQRT(ABS(distx)**2.0D0+ABS(disty)   &
     &                            **2.0D0+ABS(distz)**2.0D0))**3.0D0)
!
                           dwtphi = dwty*COS(phi(j)) - dwtx*SIN(phi(j))
!
                           dwtz = (dgame(k)/(4.0D0*PI))                 &
     &                            *((dltx*disty-dlty*distx)             &
     &                            /(SQRT(ABS(distx)**2.0D0+ABS(disty)   &
     &                            **2.0D0+ABS(distz)**2.0D0))**3.0D0)
!
                           wphi(j) = -ABS(wphi(j)+dwtphi)
!
                           wz(j) = wz(j) + dwtz
!
                        ENDDO
!
                     ENDDO

There are a number of problems in this code, the main one being repeated code and use of real exponents, such as in the calculation of dwtx, dwty and dwtz. This can be changed to:

                           dist   = sqrt ( distx**2 + disty**2 + distz**2 )
                           dist3  = dist**3
!
                           tmp1   = dgame(k) /(4.0D0*PI * dist3)
                           dwtx   = tmp1 * (dlty*distz - dltz*disty)

Based on the Polyhedron test results for 64-bit Windows 7 on Intel Core i5 2500k, the results show: Best : Intel 19.33 seconds FTN95 test : 562.03 seconds My FTN95 test : 689.36 seconds My modified code test : 229.21 seconds on an i5-2300 CPU @ 2.80GHz, which is a reduction of 67%

JohnCampbell

Posts: 2526 Sydney

Back to Top

18 Mar 2015 10:22 #15919

mp_prop_design.f90 ctd.

The main problem was: dwtx = (dgame(k)/(4.0D0PI)) & & ((dltydistz-dltzdisty) & & /(SQRT(ABS(distx)**2.0D0+ABS(disty) & & **2.0D0+ABS(distz)**2.0D0))**3.0D0)

My revised code for these loops are:

                     limit = NINT ( trns * pi*2.0D0   / dphit ) + 1
                     step  = NINT ( ainc * pi/180.0d0 / dphit )
                     cosj  = COS (phi(j))
                     sinj  = SIN (phi(j))
!
                     DO k = 1 , tip - 1                                         ! 2.815e6
!
                        tana   = TAN (thetae(k)+alphie(k))
                        dltphi = -re(k)*dphit
                        dltz   =  re(k)*dphit*tana
!
                        DO i = 1 , limit , step                                 ! 3.378e7
!
                           tmp1   = (REAL(i)-0.50D0) * dphit
                           phit   = phib + phie(k) + tmp1       ! 2.06e9
!
                           cosa   = COS (phit)
                           sina   = SIN (phit)
!
                           xt     =  re(k) * cosa
                           yt     =  re(k) * sina
                           zt     =  re(k) * tana * tmp1
!
                           distx  = x(j) - xt
                           disty  = y(j) - yt
                           distz  = zt
                           dist   = sqrt ( distx**2 + disty**2 + distz**2 )
                           dist3  = dist**3
!
                           dltx   = -dltphi*sina
                           dlty   =  dltphi*cosa
!
                           tmp1   = dgame(k) /(4.0D0*PI * dist3)
                           dwtx   = tmp1 * (dlty*distz - dltz*disty)
                           dwty   = tmp1 * (dltz*distx - dltx*distz)      
                           dwtz   = tmp1 * (dltx*disty - dlty*distx)
!
                           dwtphi = dwty*cosj - dwtx*sinj
!
                           wphi(j) = -ABS(wphi(j)+dwtphi)
                           wz(j)   = wz(j) + dwtz
!
                        ENDDO    ! i                                       2.06e9
!
                     ENDDO       ! k                                       3.37e7

The basic changes have been:

change 2.0d00 to 2

provide some temporary variables to remove repeated code (eg dist3) and

move some calculations out of the DO loop.

Note: the comment numbers past col 73 are line use counts I recovered from /profile + sdbg.

I have reviewed this program to identify code that FTN95 does not handle well. The conclusion I am moving towards is why should it ? There are examples in this program of code that is developed without attention to run time efficiency.In the few examples I have reviewed, there appears to be a bias in the test suite to code that suits optimising compilers, while code that is developed with more attention to performance is not being considered.

Why should FTN95 be measured against code that did not require this efficiency in the first place.

I'll continue to update.

John

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

18 Mar 2015 11:37 #15921

John,

Did you make any gains by doing repeated multiplication instead of raising to powers, e.g. distdistdist instead of dist**3, or even identifying 'dist2' from distdist as this is done earlier and then dist3=dist2dist.

I remember noting in a previous thread somewhere that raising to a REAL power where the REAL was a constant that was clearly really INTEGER would be an easy optimisation for the compiler.

Eddie

JohnCampbell

Posts: 2526 Sydney

Back to Top

18 Mar 2015 12:06 #15922

Eddie,

My view is that if d*d is faster than d2, the compiler would probably make the change. However d2.0d0 is much slower than d2, as dx becomes exp ( x * log(d) ) if x is not recognised as an integer value.

Using d2.0d0, rather than d2 is the programmer just ignoring what the compiler must do with what is provided; bad programming in my opinion. I'm sure others will disagree.

John

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

18 Mar 2015 12:44 #15923

John,

It isn't listed in the documentation as an optimisation.

While AA is essentially faster than A2 and substantially better than A2.0, there comes a point where AAA ... must run out of steam.

A lot of the optimisations listed in the documentation look like common subexpression removal in one form or another. In FTN95 as far as I'm aware, A**X is not examined to see if X is an integer expressed as a real.

Eddie

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

18 Mar 2015 5:26 #15924

I had more thoughts.

(1) You get the biggest speedup on given hardware by choosing the best algorithm. (2) You get a speedup with given hardware and compiler by writing the code properly. (3) You get a speedup by having faster hardware (especially if you wait some years before trying again). (4) You might get a speedup in execution time on given hardware by changing the compiler, but you'll never recover the time it took you to convert from one compiler to another. (5) It probably doesn't matter anyway

To illustrate (5) if you reduce a runtime from 24 hours to 8, you still get the answer tomorrow. If you reduce the runtime from 20 minutes to 2, you will waste those 2 minutes by waiting, when you could have gone and done something useful with the 20 minutes. This includes turning to a different computer, if you must.

(6) What is the point of a Windows interface on a code that runs anything other than nearly instantly? If a box pops up asking 'OK to continue?' a minute after you left the office, then you have wasted probably 16 hours anyway.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

20 Mar 2015 5:23 (Edited: 21 Mar 2015 5:02) #15934

Eddie, Of course speed does matter. If for example life on earth made by the tries and faults actively developed during last 10^{9 years in giant Monte Carlo simulation ran just 10}-7 times faster we would not need the concept of paradise because we would succeed to find how not to die :!:

There are one exception which was used in Quake III Arena game which can speed up ab calculations if b is exactly half-integer. Specifically it is doing fast inverse SQRT. We hit this just few days back and investigating how implement that. Otherwise adding some atomic data into the code which uses few regular ab make our PIC code run times ski high, basically unmanageable. Despite all that runs on supercomputers. I see in Wiki that SSE has done this even faster.

John, You are doing and have done great job speeding up the not optimized codes but your idea that you can beat the compiler optimization goes against the basic trend and is generally hardly realizable by the average and as you have shown even by the top grade users. For that everyone needs to be a computer science hack who intimately knows compilers, assembler, processor architecture aside from language itself and of course top level math and >20 years of software optimization experience, all of them simultaneously! You are like Garry Kasparov trying to beat a machine! The trend is exactly opposite. The code must be written by the total nuts without any clues in optimization and the compiler has to make it faster then everyone but the absolutely top level hacks can do. And seems IVF so far is doing its job pretty good - you optimized codes and made them 2x and 3x faster but IVF 30x! 'Do not optimize your code, let compiler do that for you'. 'Sometimes your optimizations by hand will make it worse'. These were thoughts 20 years back when Polyhedron benchmarks were assembled.

Anyway, I thought that instead of beating almost unbeatable you will only chop these codes into smallest possible pieces with your timers and just select, simplify and bring to light these manageable grains of source which are responsible for slow run of FTN95 versus IVF. Ideally, making also benchmarks of these snippets side by side with other compilers. Then i'd hope Silverfrost would quickly implement the optimizations into the compiler - because if this compiler will be also as fast as others it will be totally killing software.

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

20 Mar 2015 7:43 #15935

Future work on optimizing will focus on the 64 bit backend which will probably behave quite differently. The benchmark tests will be relevant but will probably have a different outcome.

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

21 Mar 2015 10:14 #15936

The development of the 64 bit backend is going very well With a large majority of the basic work completed. We are now working through our test suite using the new backend and it is difficult to predict how long this will take to complete. The initial release will not include optimisation.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

21 Mar 2015 11:52 #15937

While the 32 bit version of FTN77 is perfectly adequate for what I do (or would be if its Clearwin+ was today's), I welcome this development. There can hardly be an AMD/Intel cpu in regular use today that isn't capable of running Windows in 64 bit mode, and even Windows tablets are 2Gb minimum.

Forget software optimization: it may help, but you run the risk of going too far and getting the wrong results: what is needed is to use modern opcodes. That may well be difficult, as the facilities of the x87 coprocessor stack were rarely exploited to the full.

Dan, you didn't read or fully understand what I wrote. I never said speed doesn't matter under every circumstance, just that there are times when it doesn't

John C: If you make your changes to the propeller code, does it speed up under iFort comparably, or is it something that hits FTN95 particularly badly?

Eddie

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

22 Mar 2015 7:59 #15938

Eddie There is a good chance that your best salflibc.dll will work with FTN77.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

22 Mar 2015 9:36 #15939

Paul, I suspected as much. Now, I wonder what the benchmarks do with FTN77? (Or does it have the same back end as FTN95?)

Eddie

mecej4

Posts: 1911

Back to Top

22 Mar 2015 11:08 #15941

Quoted from LitusSaxonicum Now, I wonder what the benchmarks do with FTN77? (Or does it have the same back end as FTN95?) Eddie The current Polyhedron benchmarks are in F90+, so if you want to use FTN77 you will need to dig up older versions of the benchmarks written in F77.

Of course, there are lots of other benchmarks, in F77 as well as in F90+.

JohnCampbell

Posts: 2526 Sydney

Back to Top

22 Mar 2015 11:11 #15942

Dan wrote:

John, You are doing and have done great job speeding up the not optimized codes but your idea that you can beat the compiler optimization goes against the basic trend

I think your point is valid. However I have always considered it useful to identify where compilers perform poorly and then understand why this is the case.

The latest test I have reviewed is FATIGUE2. The best time reported was for Lahey GNU at 31.09 seconds FTN95 is reported at 263.88 seconds My test was 324.75 seconds My revised test is 168.88 seconds. Again this is an improvement by about 50% but still significantly more than the Lahey compilation.

For FTN95, I reviewed the run times and there was a significant amount of time managing the call to subroutine perdida. There are 585,898,984 calls to this routine !

                 call perdida (dt, lambda, mu, yield_stress, R_infinity, b, X_infinity,     &
                               gamma, eta, plastic_strain_threshold, stress_tensor(:,:,n),  &
                               strain_tensor(:,:,n), plastic_strain_tensor(:,:,n),          &
                               strain_rate_tensor(:,:,n), accumulated_plastic_strain(n),    &
                               back_stress_tensor(:,:,n), isotropic_hardening_stress(n),    &
                               damage(n), failure_threshold, crack_closure_parameter)
...
      subroutine perdida (dt, lambda, mu, yield_stress, R_infinity, b, X_infinity, gamma,   &
                          eta, plastic_strain_threshold, stress_tensor, strain_tensor,      &
                          plastic_strain_tensor, strain_rate_tensor,                        &
                          accumulated_plastic_strain, back_stress_tensor,                   &
                          isotropic_hardening_stress, damage, failure_threshold,            &
                          crack_closure_parameter)
!
      real (kind = LONGreal), intent(in) :: dt, yield_stress, lambda, mu, R_infinity, b,    &
                                            X_infinity, gamma, eta, failure_threshold,      &
                                            plastic_strain_threshold,                       &
                                            crack_closure_parameter
      real (kind = LONGreal), dimension(:,:), intent(in) :: strain_rate_tensor,             &
                                                            strain_tensor
      real (kind = LONGreal), dimension(:,:), intent(inout) :: plastic_strain_tensor,       &
                                                               back_stress_tensor
      real (kind = LONGreal), dimension(:,:), intent(out) :: stress_tensor
      real (kind = LONGreal), intent(inout) :: damage, accumulated_plastic_strain,          &
                                               isotropic_hardening_stress
!

The main change I made was to change array sections to being explicit dimension (3,3), which they all are and use F77 addressing in the call. What is interesting in this example is the combination : dimension(:,:), intent(in) :: I assume FTN95 is making copies of the arguments and not returning them for intent(in) and then for intent(out) updating the copy on return. FTN95 is using about 150 seconds of run time just manipulating these temporary copies of the 3x3 arrays. I am not sure if FTN95 is enforcing the intent, or if the intent is a rule that should be checked. The other compilers benefit from SSE instructions, which could bring the run time to 80 seconds, but Lahey's 31 seconds must be identifying other efficiencies. Array sections is one of FTN95's Achilles heels.

John

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

22 Mar 2015 6:46 #15943

Thanks MECEJ4, it's not obvious from snippets that they are Fortran 90.

Over half a billion calls to a routine with all those parameters? You'd make a really big improvement if they were in COMMON! And that's without all the Fortran 90 stuff that John seems to think that is costing so much time.

Plus if you are making over half a billion calls to a subroutine in the first place, the program structure is probably all to cock ...

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

22 Mar 2015 7:36 #15944

John-Silver,

Kreitzberg & Schneiderman: 'The elements of FORTRAN style' (Harcourt Brace Jovanovich) is where I started. Still available on the internet. Probably less than half of it is relevant today, sadly. My first copy was loaned and never returned, then I got another ... I suspect you can't pop round to my house to borrow it!

There are needs for speed: you are about to crash onto the moon? You want tomorrow's weather forecast, and the run-time is 25 hours? You need speed then.

But mostly you don't, and the speed ratios of even 300 to one mean nothing if FTN95-compiled code executes in less than a second! Then, there's the business I already alluded to of useful speed.

Eddie

Polyhedron Benchmark tests

change **2.0d00 to **2

provide some temporary variables to remove repeated code (eg dist3) and

move some calculations out of the DO loop.

change 2.0d00 to 2