forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Parallelization with FTN95
Goto page Previous  1, 2
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7927
Location: Salford, UK

PostPosted: Mon Jun 23, 2008 8:15 am    Post subject: Reply with quote

As I understand it, opimisation does not reorder Fortran statements as such but it does optimise the way in which a given Fortran statement is represented in assembly code. Optimisations can include removing repeated expressions and holding certain intermediate values in registers rather than writing them back to memory but only in a way that does not change or reorder the expressed intention of the programmer.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Jun 23, 2008 9:42 am    Post subject: Reply with quote

This thread is including something of interest to me.

While I am a strong supporter of FTN95, and acknowledge its strengths in checkmate, debugging and clearwin+, there are some aspects of run-time performance which could be improved, if run-time benchmarks are a true indicater.

I would like an option where array operations could be implemented using an automatic optimisation. I don't like how dot_product is implemented as in-line code and performance can change, depending on the compiler options.
I typically compile with /debug, and avoid /opt, due to problems with this in many past compilers. My past experience is general optimisation does not always work best, but nor does selective optimisation levels.
I am waiting for the results of work on memory management for /3gb and hope this addresses some of the performance problems with real*8 calculations.
As with some of DanRRight's comments, a lot of our bad impressions are basd on past experience, which may not be correct for the current compiler.

I saw some of the results from test procedures from equation.com, to drive multiple processors. It certainly would be interesting if this approach could be applied to some basic (large) vector operations. Dan may be right in that "parallelization is our unavoidable future". It's worth watching.

regards John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7927
Location: Salford, UK

PostPosted: Mon Jun 23, 2008 2:33 pm    Post subject: Reply with quote

There are 48 optimisations for which we have internal documentation.
I will investigate to see if this documentation might be released in some form.

/INHIBIT_OPTIMISATION <n>

inhibits a given optimisation and number 41 is documented as "dot product detection".

Please note that many optisations are applied even when /OPT does not appear on the command line.
Back to top
View user's profile Send private message AIM Address
Andrew



Joined: 09 Sep 2004
Posts: 232
Location: Frankfurt, Germany

PostPosted: Tue Jun 24, 2008 12:00 am    Post subject: Reply with quote

Quote:
While I am a strong supporter of FTN95, and acknowledge its strengths in checkmate, debugging and clearwin+, there are some aspects of run-time performance which could be improved, if run-time benchmarks are a true indicater.


Indeed, while I do not doubt that many compilers produce faster runtime code, the difference in performance with real world code may differ somewhat than that you may find from benchmarks that have been around for a long time. Performance of individual codes is of course highly dependent on a range of factors.

The holy grail of compilers is hard to reach - the compilers producing the fastest binaries are mostly those with the weakest diagnostic capabilities. There are some that do generally well for both performance and diagnostics, but from past impressions, compilation time can be extremely slow.

Some compilers fit better than others into requirements, depending on where the focus of development lies. Horses for courses.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Jun 25, 2008 5:07 am    Post subject: Reply with quote

I have, for a long time, been trying to identify how I can improve the calculation performance of my equation solver in my finite element program.
I checked my past emails to Salford, and a lot of the identified problems I was having were reported in 2002, so I can't confirm this is still the case.
There is a vague indication that other compilers have better performance in this area, but I don't have any definite proof.
Certainly in 2002, I was getting results where the run time performance for "dot_product" could vary by a factor of 2, and my then past knowledge assumed that real*8 arithmetic should be a substantial part of the compute time. I was asking myself what was happening in this extra processing time, as the mathematical computation part does not change. My conclusion was that it was associated with either unnecessary transfer of data between memory and the more confusing movement of data between memory "secondary cache" and the processor.
For the last few years I have not been able to run benchmarks that reliably indicate performance and also show performance improvements that relate to programming strategies of the 70's and 80's. I put this down to the vagaries of the intel cache management.

The problem now gets more complicated, with the larger problem size. I have been trying to improve performance where the active matrix size is in the range of 1gb to 3gb. Any disk I/O now has a huge performance penalty, which can be compounded by virtual memory mapping, even where there is adequate physical memory.

The equation solver I use is a skyline solver for large sets of linear simultaneous equations (symmetric), which was a preferred direct solver in the 70's to 90's. It has two basic array processes:
Dot_Product and
vector_A = Vector_A - beta * Vector_B
These vectors are typically 0-20,000 elements long.

My holy grail is to get a procedure for these two, which optimises performance. The three areas I have identified as problems are:
1) unnecessary variable shifts ( as in 2002)
2) not utilising multiple processors
3) not getting unnecessary disk transfers

To me the basic mission is, Dot_product gets the starting address and byte step of 2 vectors in memory, then produces the a.b answer. What puzzles me is why it is so difficult to optimise.

I look forward to improvements to memory management, especially in SLINK, when addressing improvements to the /3gb switch.

Keep up the good work.

John

ps : I wonder what I would do next if this problem had a solution ?
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7927
Location: Salford, UK

PostPosted: Wed Jun 25, 2008 7:26 am    Post subject: Reply with quote

John

If you would like to post a sample calculation I would like to take a look at it when I can. I don't know when that will be but if I had your code to hand I might be able to find a minute to look at it.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Thu Jun 26, 2008 2:58 am    Post subject: Reply with quote

Paul,

Thanks, I will review some of the emails I sent in 2002 and see if I can summarise a later one that still identifies the problem and email it in a cleaner form. It is useful to look at these emails after time and see how (un)clearly I described the problem.

Typical of the problems with in-line expansion of dot_product
x = dot_product ( a(i1:i1+n-1), b(j1:j1+n-1) )

is a fairly good example of when compilation with /debug produces a poor solution. Even replacing this by an intermediate call,

x = vec_sum ( a(i1:i1+n-1), b(j1:j1+n-1) ) or (f95)
x = vec_sum ( a(i1), b(j1), n ) (f77)

where vec_sum is only a call to dot_product produces a much better result.

Also, I saw your comment on selective omission of optimisation. Lahey had something similar, but I never found it useful. It became difficult to be able to selectively use and remember which parts of the code could cause what problems.
When my programs have many files, I do use different compilation options (in .bat files), being:
/check for data reading and reporting
/debug for most code and
/opt for routines that are stable and use a high proportion of the run time.
I do have vec_sum compiled with /opt in my library file.

I would like to see automatic implimentation of /opt in "safe" code areas, such as:-
array functions like dot_product and
do loops where there are no unusual exits, such as calls to subrotines or non-pure procedures
I suppose what I am saying is I'm lazy and I want you to put the effort into improving optimising, rather than me trying to understand what optimisation approaches give me trouble.

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7927
Location: Salford, UK

PostPosted: Thu Jun 26, 2008 8:04 am    Post subject: Reply with quote

There is a lot of opmisation that is carried out by default.
/opt provides extra optimisation that could be less safe in certain extreme circumstances.
Back to top
View user's profile Send private message AIM Address
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group