forums.silverfrost.com

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Based on recent posts, it may be worth setting up a review of the Polyhedron benchmark test suite and provide some indication of why FTN95 is so slow for each of these tests.
I have (years ago) reviewed some of the tests by using /timing or /profile options to identify the area in the code where most of the time is being used.
What I found for a few of the tests was that only a few lines were to blame for FTN95 poor performance, typically where FTN95 failed to optimise the inner loop. Often these could be corrected by "better" code structures.
Other tests exhibited poor indexing of large arrays or used large local arrays that exceeded the stack. Both these problems can be addressed by better array usage with subscript re-ordering or ALLOCATE. ( I have always found that if you need a larger stack, the solution is never to use a larger stack, but fix the problem)
There were other examples where vector instructions or /Qparallel helped, which is beyond FTN95's capability.

If we can assemble this list of reasons, it may help FTN95 users to see what other compilers do automatically and what we, as FTN95 users, can do to improve our code to cope with some of these problems.
It was my general impression from my past reviews that most of the inefficiency attributed to FTN95 can be avoided or at least mitigated.

As a first step I shall generate a list of the latest tests in the suite and proceed to provide a summary of their status. Hopefully as this becomes fairly complete it will document the problems to be avoided.

If anyone has done similar reviews, please let me know.

I'll keep you updated.

John

ps: We need links to the test suites to review

LitusSaxonicum · Posted: Tue Mar 17, 2015 1:57 pm Post subject:

John,

I gave up on the Polyhedron benchmarks because they didn't tell me about anything I wanted to know. They tell me that FTN95 is 'slow', but I know that it is blindingly fast compared to most things I used over the past 45 years, and they tell me that some compilers that I can't even get to run are faster still. Well, wasting weeks trying to get a compiler to even run, or not produce correct results, strikes me as a way of compensating for a few seconds saved here and there in runtime.

Many years ago, I did a personal review of the 'Spag' utility, and it turned some carefully crafted source codes of mine into things that were unreadable by me. I decided that I'd carry on doing my things my way, and anyone who wanted to do the the Polyhedron way was welcome to it. I formed the view, rightly or wrongly, that once Spag had done its work, you ended up with a source code that would run badly with FTN95, and all the Polyhedron tests were Spagged!

Eddie

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

I located the 17 program Polyhedron Benchmark test suite contained in pb11.zip

I have looked at Inducta.f90, one of the poor performers with FTN95.
Based on the Polyhedron test results for 64-bit Windows 7 on Intel Core i5 2500k, the results show:
Best : Intel 25.66 seconds
FTN95 test : 385.66 seconds
My FTN95 test : 469.25 seconds on an i5-2300 CPU @ 2.80GHz
My modified code test : 240.90 seconds

I had hoped for better improvement, when identifying code structures that were not well managed by FTN95, but the improvement was less than expected.
The analysis of performance in the 6,636 lines of code showed, 4 near identical inner loops that are responsible for 99% of the run time. FTN95 /timing initially showed the 2 routines that were responsible and I used the rdtsc_tick() function to confirm the problem area in these routines.
The original code was:

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

The final change was to identify that “q_vector(3) = 0.0” so that the calculation of “rot_c_vector-rot_q_vector” in the inner loop could be replaced with an explicit code of:

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

The next program I have looked at is mp_prop_design.f90.
This program is ONE routine; no subroutines ! Before I realised this, I initially tried ftn95 /timing, only to file one line output to the .tmo file. I then tried ftn95 /profile ( for the first time) and found it to work quite well in conjunction with SDBG.
This identified a DO loop which had the most line usage. The code in this loop is:

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

mp_prop_design.f90 ctd.

The main problem was:

LitusSaxonicum · Posted: Wed Mar 18, 2015 12:37 pm Post subject:

John,

Did you make any gains by doing repeated multiplication instead of raising to powers, e.g. dist*dist*dist instead of dist**3, or even identifying 'dist2' from dist*dist as this is done earlier and then dist3=dist2*dist.

I remember noting in a previous thread somewhere that raising to a REAL power where the REAL was a constant that was clearly really INTEGER would be an easy optimisation for the compiler.

Eddie

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Eddie,

My view is that if d*d is faster than d**2, the compiler would probably make the change. However d**2.0d0 is much slower than d**2, as
d**x becomes exp ( x * log(d) ) if x is not recognised as an integer value.

Using d**2.0d0, rather than d**2 is the programmer just ignoring what the compiler must do with what is provided; bad programming in my opinion. I'm sure others will disagree.

John

LitusSaxonicum · Posted: Wed Mar 18, 2015 1:44 pm Post subject:

John,

It isn't listed in the documentation as an optimisation.

While A*A is essentially faster than A**2 and substantially better than A**2.0, there comes a point where A*A*A* ... must run out of steam.

A lot of the optimisations listed in the documentation look like common subexpression removal in one form or another. In FTN95 as far as I'm aware, A**X is not examined to see if X is an integer expressed as a real.

Eddie

LitusSaxonicum · Posted: Wed Mar 18, 2015 6:26 pm Post subject:

I had more thoughts.

(1) You get the biggest speedup on given hardware by choosing the best algorithm.
(2) You get a speedup with given hardware and compiler by writing the code properly.
(3) You get a speedup by having faster hardware (especially if you wait some years before trying again).
(4) You might get a speedup in execution time on given hardware by changing the compiler, but you'll never recover the time it took you to convert from one compiler to another.
(5) It probably doesn't matter anyway

To illustrate (5) if you reduce a runtime from 24 hours to 8, you still get the answer tomorrow. If you reduce the runtime from 20 minutes to 2, you will waste those 2 minutes by waiting, when you could have gone and done something useful with the 20 minutes. This includes turning to a different computer, if you must.

(6) What is the point of a Windows interface on a code that runs anything other than nearly instantly? If a box pops up asking 'OK to continue?' a minute after you left the office, then you have wasted probably 16 hours anyway.

DanRRight · Posted: Fri Mar 20, 2015 6:23 pm Post subject:

Eddie, Of course speed does matter. If for example life on earth made by the tries and faults actively developed during last 10^9 years in giant Monte Carlo simulation ran just 10^-7 times faster we would not need the concept of paradise because we would succeed to find how not to die Exclamation

There are one exception which was used in Quake III Arena game which can speed up a**b calculations if b is exactly half-integer. Specifically it is doing fast inverse SQRT. We hit this just few days back and investigating how implement that. Otherwise adding some atomic data into the code which uses few regular a**b make our PIC code run times ski high, basically unmanageable. Despite all that runs on supercomputers. I see in Wiki that SSE has done this even faster.

John, You are doing and have done great job speeding up the not optimized codes but your idea that you can beat the compiler optimization goes against the basic trend and is generally hardly realizable by the average and as you have shown even by the top grade users. For that everyone needs to be a computer science hack who intimately knows compilers, assembler, processor architecture aside from language itself and of course top level math and >20 years of software optimization experience, all of them simultaneously! You are like Garry Kasparov trying to beat a machine! The trend is exactly opposite. The code must be written by the total nuts without any clues in optimization and the compiler has to make it faster then everyone but the absolutely top level hacks can do. And seems IVF so far is doing its job pretty good - you optimized codes and made them 2x and 3x faster but IVF 30x! "Do not optimize your code, let compiler do that for you". "Sometimes your optimizations by hand will make it worse". These were thoughts 20 years back when Polyhedron benchmarks were assembled.

Anyway, I thought that instead of beating almost unbeatable you will only chop these codes into smallest possible pieces with your timers and just select, simplify and bring to light these manageable grains of source which are responsible for slow run of FTN95 versus IVF. Ideally, making also benchmarks of these snippets side by side with other compilers. Then i'd hope Silverfrost would quickly implement the optimizations into the compiler - because if this compiler will be also as fast as others it will be totally killing software.

PaulLaidler · Posted: Fri Mar 20, 2015 8:43 pm Post subject:

Future work on optimizing will focus on the 64 bit backend which will probably behave quite differently. The benchmark tests will be relevant but will probably have a different outcome.

John-Silver · Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley

It'll be very interesting to see how JohnCampbell's excellent initiative in re-visiting the benchmarks will be reflected in the equivalent x64 results. then. Hopefully all the differences will be positive Wink

.
John's 'baseline' results and conclusions will be a good proving ground for showing the world how much better ftn95 is at that point.
Of course, keeping a track on any optimìizations already, or which will be,incorporated during the development would make that comparison assessment easier too.

By the way Paul, hows progress on the x64 version going ? any idea of an anticipated beta release date maybe ? (provisional and without an y committment of course). I seem to remember when yo ufirst talked about this last year the anticipation was for some time 'early 2015', but of course things can arise which totally blow expectations off course.

PaulLaidler · Posted: Sat Mar 21, 2015 11:14 am Post subject:

The development of the 64 bit backend is going very well With a large majority of the basic work completed. We are now working through our test suite using the new backend and it is difficult to predict how long this will take to complete. The initial release will not include optimisation.

LitusSaxonicum · Posted: Sat Mar 21, 2015 12:52 pm Post subject:

While the 32 bit version of FTN77 is perfectly adequate for what I do (or would be if its Clearwin+ was today's), I welcome this development. There can hardly be an AMD/Intel cpu in regular use today that isn't capable of running Windows in 64 bit mode, and even Windows tablets are 2Gb minimum.

Forget software optimization: it may help, but you run the risk of going too far and getting the wrong results: what is needed is to use modern opcodes. That may well be difficult, as the facilities of the x87 coprocessor stack were rarely exploited to the full.

Dan, you didn't read or fully understand what I wrote. I never said speed doesn't matter under every circumstance, just that there are times when it doesn't

John C: If you make your changes to the propeller code, does it speed up under iFort comparably, or is it something that hits FTN95 particularly badly?

Eddie