forums.silverfrost.com

JohnCampbell · Joined: 16 Feb 2006 Posts: 2617 Location: Sydney

My limited experience with multi-core has not been good. A better solution is probably to use third party libraries. I don't have experience of linking them to FTN95, but Dan has had some success. He has mentioned equation.com and there are also other libraries that claim to have multi-processor capacity. I'm not sure on their general availability.
The communication and process control is a significant problem and run time overhead for someone new to the field. Existing solutions are available, but I am not sure how accessible.
This is an area I am wanting to better understand and sounds like it would help your problem.

John

DanRRight · Posted: Sat Apr 28, 2012 1:33 am Post subject:

Most of F2003 and F2008 additions to Fortran are cool options. They should attract new programmers and keep busy existing. Because we all like cool stuff.

Our Fortran "fathers" (excluding some lucky with supercomputers) did not have several things we all have today - we've got color graphics and GUI, multiprocessing and a lot of memory. So Fortran language to be "cool" has to address that in its Standard.

It did not or barely did so far so developers were adding new features at their own risk as non-standard options. Now Fortran community in general has got

- GUI
- multiprocessing
- and some started to add 64-bit addressable space

With FTN95
- we have GUI builder Clearwin+ with graphics libraries and OpenGL, we also can use (and other compilers can use too) Winteracter or GINO.
- we can use multithreading or third party parallel libraries for multiprocessing (though DO CONCURRENT and COARRAYS would be better from the point of view of the source being standard conforming, but professionally made libraries could be still faster - i for example get 8-10 times speedup with 4 cores)
- but we can not get through 4 GB limit even with any third party options while every entry laptop in the market has even more memory.

For me, adding 64bitness would be most usefull next coolness and its time came right of today. Down the road more and more people will jump over 4GB limit so this is exactly an important feature, and it lies exactly on the right trend.

As to other "coolnesses" we have a lot of it already with this compiler people yet do not use or even do not know. Fun is that some new F2003/F2008 standard features exist in this compiler from i think Fortran77 circa 1990. The not exact but by sense close analog of DO CONCURRENT multithreading for example. As well as interoperability with C. And it still has something more -- like partial interoperability with HTML, NET, ability to call system functions which even F2020 will not have.

Cool would be also to have more networking capabilities besides some initial which would be 4th major additions our fathers did not have. And object oriented features. Again this can still be partially done with outside software if its absence kills you. But unfortunately no way we can get 64bit space with any trickery or third party software. So if we'd vote, my wish would be 64bit first, adding parallel options of F200X second, optimization of execution speed to match leaders third.

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

In my experience, Fortran programs which take benefit from multi-processor and multi-core processors are most easy to implement using OpenMP compiler directives. Speed-up factors of 3-4 are achievable on quad-core computers, speed-up factors of 9-12 are achievable on 2 processor 6-core computers (i.e. 12 cores), but actual performance depends on the actual Fortran code.

A code written using OpenMP will compile and run under FTN95 but will just use one core/processor. This can therefore be used for development and debugging/testing, and a different compiler used to take advantage of the multiple processors.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

JohnCampbell · Joined: 16 Feb 2006 Posts: 2617 Location: Sydney

David,

My limited experience with OpenMP has not been as successful as you have presented. What numerical problem and algorithm have you been solving ?
I have been applying OpenMP to a direct solution of linear equations, using a skyline solver, but have had problems with the multi-processor overheads.
I'd expect itterative solvers are more suited to distributed processing and OpenMP.
To improve my attempt, I am told I would need to expand the scope of the OpenMP code to reduce the overhead but that doesn't appear possible for my existing code. Not as easy as I had hoped.
However, I have found that the vector instruction set has been more effective and easy to implement.
I'd be interested to hear more of your experiences with OpenMP.

John

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

Hi John,

Most of the time I use OpenMP in applications where there is little or no communication between the different threads, like Monte Carlo simulations where a lot of calculations need to be done independently. These codes are close to being "embarissingly parallel" and performance scales linearly with the number of cores (up to 12, I can't go any higher). However, I have had some success parallelising LR and Cholesky factorisations (factor of nearly 2 on a dual core laptop).

Often its easy to parallelise at the deepest level (innermost loop) but this won't give you good performance, since the work needs to be significant compared with the overhead of managing the threads. You have to parallelise at an outer level somehow. The innermost loops are where you may be able to get some extra benefit from vectorisation.

All this means is it depends on the algorithm and how easy it is to parallelise it. I haven't looked at skyline storage and solving such equations unfortunately.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

DanRRight · Posted: Wed May 09, 2012 5:14 am Post subject: Re:

JohnCampbell · Joined: 16 Feb 2006 Posts: 2617 Location: Sydney

Dan,

I have been doing finite element modelling since the late 70's. My main mesh generator is fortran, so my models are fairly regular and not based on 3d solids modelling, which generate much larger models. I have a number of recent 8-node solid models and/or 4-node shell models.
Typically the number of equations is between 150,000 and 200,000 plus one of 450,000 equations.
The average bandwidth is typically 1,000 to 1,500 (one of 5,000) and the maximum bandwidth typically 5,000 to 20,000 equations.

My analysis of the potential for parallel calculations in a direct solver goes something like:
There are basically 3 levels of looping in a direct solver.
1:loop (j) over equations
2:loop (i) over band width (connected equations), changing each coefficient a(i,j) in the column j
3:loop dot_product over common band width, calculating the change to a(i,j) due to columns a(:,i) and a(:,j)

By placing the inner loop in a parallel construct, there are 150,000 x 1,500 = 225 million initiated threads involving a dot product of size 1 to 1,500. This loop calculates the modification to a single element of column (:,j) : too many threads ! My attempt (1 year ago) was on the inner loop and did not work well since I did not have enough processors and there were too many threads.

If you place the inner two loops in a parallel construct, there are 150,000 initiated threads involving 1,500 x 1,500 multiplies. This would work much better, but the coefficients are not independent, as the coefficients of (:,j) are changing during the computation. To use the inner 2 loops, there would be 150,000 threads involving 1,500 x 1,500 /2 = 1.3 million calculations.
While this would reduce the number of threads by a factor of 1,000, the timing of the changes to the coefficients needs to be managed.
It is a big step up but as there are only 8 or 16 processors to share the load, it could possibly be done by doing groups of 8 to 16 equations at a time, rather than the 1,500, and tidy up the left overs.
I presume this is what equation.com have done, but it is a bit messy and you have to deal with the left over equations and number of processors.

Unfortunately there are too many other projects on the go, to go deep into this. It also appears that direct solvers have been replaced by itterative solvers in the major commercial packages. All too much to learn. It will have to wait until I need (must have) the solution.

John

DanRRight · Posted: Wed May 09, 2012 4:59 pm Post subject:

This compiler has high resolution clock and debugging switches to get the time spent in different parts of the code. Have you done this assessment? The key point here: is the preparation of matrix most time consuming or the matrix solution itself?
- if the first one is most time consuming and you can divide it into independent streams then all can be done just within FTN95. It can do multithreading with NET and does it with the charm (specifically with Clearwin, see the examples)
- if matrix solution is most time consuming (which is often true ) then use equation dot com libraries

JohnCampbell · Joined: 16 Feb 2006 Posts: 2617 Location: Sydney

Dan,

I wrote my first skyline solver code in 1976, based on a paper by Graham Powell. I would have spent years trying to find how to improve the speed of the inner loop. It would be 99% of the run time and is basically a Dot_Product. The code could be written as 1 line:

s = 0 ; do i = 1,n ; s = s + a(i)*b(i) ; end do

FTN95 converts it to instructions and does not use a library routine for Dot_Product.
n does vary from 1 to 10,000, which makes it hard to suit different approaches.
The way modern processors do this and optimises it's handling of the cache and all the other black arts of processor optimisation do not appear to be managed by FTN95.
The new vector instruction set can be very effective, but not available in FTN95.

The other function that is used for back substitution of the load vectors is basically:
do i = 1,n
a(i) = a(i) - const * column(i)
end do
This has been a real interesting one, as over the years I have had combinations of FTN95 and some processors where this can take 2x, 5x even 10x as long compared to other compilers. Again I think it has something to do with memory alignment or the processor optimisation (pre-fetch or pre-calculation which I actually know little about; I just assume it takes place. Very much like my understanding of dark matter) (how can it take 10 times as long as other compilers ?)

equation dot com is a very good suggestion and looks like an option that I should be looking into. Do you have any details of using this with FTN95 ?

John

DanRRight · Posted: Thu May 10, 2012 5:53 am Post subject:

John, i afraid if the dot product takes 99% of your CPU time then parallel linear algebra will not help you. Otherwise LAIPE parallel library use is pretty straightforward: you call its solver subroutine instead of yours and at link time add their library to your obj files, that's it. Then you can play changing the number of employed processors and see the speedup. For large matrix sizes which take more then 0.01-0.1 second to solve with regular non-parallel methods the speedup is essentially often just the linear function of amount of CPU cores.
You may also
1) try contacting the author, may be he will parallelize for you the dot product. But if a(i) and b(i) are known and the problem is just the size of dot product matrix so it takes so much time, then
2) do that yourself with multithreading in FTN95.
3) use IVF or Absoft wrapped in a DLL to do just the dot product and then link it to FTN95 because other compilers may do specifically that faster

How much time it usually takes for one typical variant to run?

JohnCampbell · Joined: 16 Feb 2006 Posts: 2617 Location: Sydney

Dan,

You prompted me to look at the way I call the inner loop of the equation reduction : Dot_Product.

calling REDCOL is effectively the outer loop, called 159,139 for each equation and each block
Loop : DO J = JB,JT is the next loop, passing 271,679,207 times while
the loop inside Dot_Product / VecSum is the inner loop

I compared two forms of the inner loop call, using dot_product or my interface routine and produced a 3 x run time.
287 seconds went to 953 seconds, by changing

A(A_ptr_0+J) = A(A_ptr_0+J) - VecSum (A(A_ptr_0+JEQ_bot), B(B_ptr_0+JEQ_bot), JBAND)
to
A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )

Given it takes 287 seconds to do all the real*8 multiplications, what did it do for the other 666 seconds ?
both were compiled with /opt

My F77 form of the call is:

JohnCampbell · Joined: 16 Feb 2006 Posts: 2617 Location: Sydney

Dan,

My F90 form of the call is:

JohnCampbell · Joined: 16 Feb 2006 Posts: 2617 Location: Sydney

Dan,

It gets more frustrating !
I tried to improve the F77 style code by removing the subroutine call. Thios would be more in keeping with transfer to OpenMP. The inner 2 loops are now:

LitusSaxonicum · Posted: Thu May 10, 2012 10:27 am Post subject:

Hi John,

Just a quickie from me. VecSum doesn't need to be EXTERNAL, as it isn't a parameter in a subroutine call, just a straightforward use as a function.

Won't make the timing difference though.

Eddie

DanRRight · Posted: Thu May 10, 2012 12:57 pm Post subject: Re: