
forums.silverfrost.com Welcome to the Silverfrost forums

View previous topic :: View next topic 
Author 
Message 
JohnCampbell
Joined: 16 Feb 2006 Posts: 2159 Location: Sydney

Posted: Fri Apr 27, 2012 3:18 am Post subject: 


My limited experience with multicore has not been good. A better solution is probably to use third party libraries. I don't have experience of linking them to FTN95, but Dan has had some success. He has mentioned equation.com and there are also other libraries that claim to have multiprocessor capacity. I'm not sure on their general availability.
The communication and process control is a significant problem and run time overhead for someone new to the field. Existing solutions are available, but I am not sure how accessible.
This is an area I am wanting to better understand and sounds like it would help your problem.
John 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 2150 Location: South Pole, Antarctica

Posted: Sat Apr 28, 2012 1:33 am Post subject: 


Most of F2003 and F2008 additions to Fortran are cool options. They should attract new programmers and keep busy existing. Because we all like cool stuff.
Our Fortran "fathers" (excluding some lucky with supercomputers) did not have several things we all have today  we've got color graphics and GUI, multiprocessing and a lot of memory. So Fortran language to be "cool" has to address that in its Standard.
It did not or barely did so far so developers were adding new features at their own risk as nonstandard options. Now Fortran community in general has got
 GUI
 multiprocessing
 and some started to add 64bit addressable space
With FTN95
 we have GUI builder Clearwin+ with graphics libraries and OpenGL, we also can use (and other compilers can use too) Winteracter or GINO.
 we can use multithreading or third party parallel libraries for multiprocessing (though DO CONCURRENT and COARRAYS would be better from the point of view of the source being standard conforming, but professionally made libraries could be still faster  i for example get 810 times speedup with 4 cores)
 but we can not get through 4 GB limit even with any third party options while every entry laptop in the market has even more memory.
For me, adding 64bitness would be most usefull next coolness and its time came right of today. Down the road more and more people will jump over 4GB limit so this is exactly an important feature, and it lies exactly on the right trend.
As to other "coolnesses" we have a lot of it already with this compiler people yet do not use or even do not know. Fun is that some new F2003/F2008 standard features exist in this compiler from i think Fortran77 circa 1990. The not exact but by sense close analog of DO CONCURRENT multithreading for example. As well as interoperability with C. And it still has something more  like partial interoperability with HTML, NET, ability to call system functions which even F2020 will not have.
Cool would be also to have more networking capabilities besides some initial which would be 4th major additions our fathers did not have. And object oriented features. Again this can still be partially done with outside software if its absence kills you. But unfortunately no way we can get 64bit space with any trickery or third party software. So if we'd vote, my wish would be 64bit first, adding parallel options of F200X second, optimization of execution speed to match leaders third.
Last edited by DanRRight on Mon Apr 30, 2012 6:48 am; edited 3 times in total 

Back to top 


davidb
Joined: 17 Jul 2009 Posts: 557 Location: UK

Posted: Sun Apr 29, 2012 11:51 am Post subject: 


In my experience, Fortran programs which take benefit from multiprocessor and multicore processors are most easy to implement using OpenMP compiler directives. Speedup factors of 34 are achievable on quadcore computers, speedup factors of 912 are achievable on 2 processor 6core computers (i.e. 12 cores), but actual performance depends on the actual Fortran code.
A code written using OpenMP will compile and run under FTN95 but will just use one core/processor. This can therefore be used for development and debugging/testing, and a different compiler used to take advantage of the multiple processors. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2159 Location: Sydney

Posted: Tue May 01, 2012 7:11 am Post subject: 


David,
My limited experience with OpenMP has not been as successful as you have presented. What numerical problem and algorithm have you been solving ?
I have been applying OpenMP to a direct solution of linear equations, using a skyline solver, but have had problems with the multiprocessor overheads.
I'd expect itterative solvers are more suited to distributed processing and OpenMP.
To improve my attempt, I am told I would need to expand the scope of the OpenMP code to reduce the overhead but that doesn't appear possible for my existing code. Not as easy as I had hoped.
However, I have found that the vector instruction set has been more effective and easy to implement.
I'd be interested to hear more of your experiences with OpenMP.
John 

Back to top 


davidb
Joined: 17 Jul 2009 Posts: 557 Location: UK

Posted: Thu May 03, 2012 5:57 pm Post subject: 


Hi John,
Most of the time I use OpenMP in applications where there is little or no communication between the different threads, like Monte Carlo simulations where a lot of calculations need to be done independently. These codes are close to being "embarissingly parallel" and performance scales linearly with the number of cores (up to 12, I can't go any higher). However, I have had some success parallelising LR and Cholesky factorisations (factor of nearly 2 on a dual core laptop).
Often its easy to parallelise at the deepest level (innermost loop) but this won't give you good performance, since the work needs to be significant compared with the overhead of managing the threads. You have to parallelise at an outer level somehow. The innermost loops are where you may be able to get some extra benefit from vectorisation.
All this means is it depends on the algorithm and how easy it is to parallelise it. I haven't looked at skyline storage and solving such equations unfortunately. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 2150 Location: South Pole, Antarctica

Posted: Wed May 09, 2012 5:14 am Post subject: Re: 


JohnCampbell wrote:  I have been applying OpenMP to a direct solution of linear equations, using a skyline solver, but have had problems with the multiprocessor overheads. 
What is the bandwidth of the matrix and the number of equations? How much time it takes to solve your set of equations? 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2159 Location: Sydney

Posted: Wed May 09, 2012 1:56 pm Post subject: 


Dan,
I have been doing finite element modelling since the late 70's. My main mesh generator is fortran, so my models are fairly regular and not based on 3d solids modelling, which generate much larger models. I have a number of recent 8node solid models and/or 4node shell models.
Typically the number of equations is between 150,000 and 200,000 plus one of 450,000 equations.
The average bandwidth is typically 1,000 to 1,500 (one of 5,000) and the maximum bandwidth typically 5,000 to 20,000 equations.
My analysis of the potential for parallel calculations in a direct solver goes something like:
There are basically 3 levels of looping in a direct solver.
1:loop (j) over equations
2:loop (i) over band width (connected equations), changing each coefficient a(i,j) in the column j
3:loop dot_product over common band width, calculating the change to a(i,j) due to columns a(:,i) and a(:,j)
By placing the inner loop in a parallel construct, there are 150,000 x 1,500 = 225 million initiated threads involving a dot product of size 1 to 1,500. This loop calculates the modification to a single element of column (:,j) : too many threads ! My attempt (1 year ago) was on the inner loop and did not work well since I did not have enough processors and there were too many threads.
If you place the inner two loops in a parallel construct, there are 150,000 initiated threads involving 1,500 x 1,500 multiplies. This would work much better, but the coefficients are not independent, as the coefficients of (:,j) are changing during the computation. To use the inner 2 loops, there would be 150,000 threads involving 1,500 x 1,500 /2 = 1.3 million calculations.
While this would reduce the number of threads by a factor of 1,000, the timing of the changes to the coefficients needs to be managed.
It is a big step up but as there are only 8 or 16 processors to share the load, it could possibly be done by doing groups of 8 to 16 equations at a time, rather than the 1,500, and tidy up the left overs.
I presume this is what equation.com have done, but it is a bit messy and you have to deal with the left over equations and number of processors.
Unfortunately there are too many other projects on the go, to go deep into this. It also appears that direct solvers have been replaced by itterative solvers in the major commercial packages. All too much to learn. It will have to wait until I need (must have) the solution.
John 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 2150 Location: South Pole, Antarctica

Posted: Wed May 09, 2012 4:59 pm Post subject: 


This compiler has high resolution clock and debugging switches to get the time spent in different parts of the code. Have you done this assessment? The key point here: is the preparation of matrix most time consuming or the matrix solution itself?
 if the first one is most time consuming and you can divide it into independent streams then all can be done just within FTN95. It can do multithreading with NET and does it with the charm (specifically with Clearwin, see the examples)
 if matrix solution is most time consuming (which is often true ) then use equation dot com libraries 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2159 Location: Sydney

Posted: Thu May 10, 2012 4:50 am Post subject: 


Dan,
I wrote my first skyline solver code in 1976, based on a paper by Graham Powell. I would have spent years trying to find how to improve the speed of the inner loop. It would be 99% of the run time and is basically a Dot_Product. The code could be written as 1 line:
s = 0 ; do i = 1,n ; s = s + a(i)*b(i) ; end do
FTN95 converts it to instructions and does not use a library routine for Dot_Product.
n does vary from 1 to 10,000, which makes it hard to suit different approaches.
The way modern processors do this and optimises it's handling of the cache and all the other black arts of processor optimisation do not appear to be managed by FTN95.
The new vector instruction set can be very effective, but not available in FTN95.
The other function that is used for back substitution of the load vectors is basically:
do i = 1,n
a(i) = a(i)  const * column(i)
end do
This has been a real interesting one, as over the years I have had combinations of FTN95 and some processors where this can take 2x, 5x even 10x as long compared to other compilers. Again I think it has something to do with memory alignment or the processor optimisation (prefetch or precalculation which I actually know little about; I just assume it takes place. Very much like my understanding of dark matter) (how can it take 10 times as long as other compilers ?)
equation dot com is a very good suggestion and looks like an option that I should be looking into. Do you have any details of using this with FTN95 ?
John 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 2150 Location: South Pole, Antarctica

Posted: Thu May 10, 2012 5:53 am Post subject: 


John, i afraid if the dot product takes 99% of your CPU time then parallel linear algebra will not help you. Otherwise LAIPE parallel library use is pretty straightforward: you call its solver subroutine instead of yours and at link time add their library to your obj files, that's it. Then you can play changing the number of employed processors and see the speedup. For large matrix sizes which take more then 0.010.1 second to solve with regular nonparallel methods the speedup is essentially often just the linear function of amount of CPU cores.
You may also
1) try contacting the author, may be he will parallelize for you the dot product. But if a(i) and b(i) are known and the problem is just the size of dot product matrix so it takes so much time, then
2) do that yourself with multithreading in FTN95.
3) use IVF or Absoft wrapped in a DLL to do just the dot product and then link it to FTN95 because other compilers may do specifically that faster
How much time it usually takes for one typical variant to run? 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2159 Location: Sydney

Posted: Thu May 10, 2012 7:27 am Post subject: 


Dan,
You prompted me to look at the way I call the inner loop of the equation reduction : Dot_Product.
calling REDCOL is effectively the outer loop, called 159,139 for each equation and each block
Loop : DO J = JB,JT is the next loop, passing 271,679,207 times while
the loop inside Dot_Product / VecSum is the inner loop
I compared two forms of the inner loop call, using dot_product or my interface routine and produced a 3 x run time.
287 seconds went to 953 seconds, by changing
A(A_ptr_0+J) = A(A_ptr_0+J)  VecSum (A(A_ptr_0+JEQ_bot), B(B_ptr_0+JEQ_bot), JBAND)
to
A(A_ptr_0+J) = A(A_ptr_0+J)  Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )
Given it takes 287 seconds to do all the real*8 multiplications, what did it do for the other 666 seconds ?
both were compiled with /opt
My F77 form of the call is:
Code: 
SUBROUTINE REDCOL_77 (A, B, NB, JB, JT, IEQ_bot)
!
! Reduces vector 'A' by block 'B'
! Allows for skyline matrix to be stored in multiple blocks
! Vector A might or might not be in block B
!
REAL*8 A(*), & ! column to be reduced provided as A(IEQ_bot:)
B(*) ! skyline block of columns
INTEGER*4 NB(*), & ! block structure of B
JB, & ! first equation to reduce (is in block B)
JT, & ! last equation to reduce (is in block B and < equation A)
IEQ_bot ! first active equation in column A
!
INTEGER*4 A_ptr_0, B_ptr_0, JEQ_bot, J, JBOT, JPOINT, JBAND
REAL*8 VecSum
EXTERNAL VecSum ! Dot_Product using F77 syntax
!
IF (JB > JT) RETURN
JBOT = NB(1) ! first equation column in Block B
A_ptr_0 = 1IEQ_bot ! virtual pointer to A(0)
!
DO J = JB,JT ! loop over range of equations
JPOINT = J  JBOT + 3 ! pointer to index for column J in block B
B_ptr_0 = NB(JPOINT)  J ! virtual pointer to B(0,j)
!
JEQ_bot = NB(JPOINT1)  B_ptr_0 + 1 ! first active equation of column J = J  jband + 1 where jband = NB(JPOINT)NB(JPOINT1)
IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot
JBAND = J  JEQ_bot ! active bandwidth for equation J and equation A
IF (JBAND < 1) CYCLE
!
A(A_ptr_0+J) = A(A_ptr_0+J)  VecSum (A(A_ptr_0+JEQ_bot), B(B_ptr_0+JEQ_bot), JBAND) ! 99% of all the work is done here
!
END DO
!
RETURN
END
REAL*8 FUNCTION VECSUM (A, B, N)
!
! Performs a vector dot product VECSUM = [A] . [B]
! account is NOT taken of the zero terms in the vectors
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
vecsum = dot_product ( a, b )
RETURN
!
END
CPU TIMES FOR EQUATION REDUCTION
Description Num Cpu I/O Avg
1 READ BLOCK A 4 1.00 0.02 0.256
3 MOVE BLOCK B 3 0.42 0.02 0.146
4 REDUCE BLOCK A 4 287.26 1.68 72.235
5 APPLY BLOCK B 3 5.38 0.11 1.829
6 WRITE BLOCK A 4 1.17 0.06 0.308
7 DIAGONAL ANALYSIS 2 0.02 0.01 0.003
99 TOTAL REDUCTION 4 295.25 1.88 74.282
 [/code]
Last edited by JohnCampbell on Thu May 10, 2012 7:37 am; edited 2 times in total 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2159 Location: Sydney

Posted: Thu May 10, 2012 7:29 am Post subject: 


Dan,
My F90 form of the call is:
Code: 
IF (JB > JT) RETURN
JBOT = NB(1) ! first equation column in Block B
A_ptr_0 = 1IEQ_bot ! virtual pointer to A(0)
!
DO J = JB,JT ! loop over range of equations
JPOINT = J  JBOT + 3 ! pointer to index for column J in block B
B_ptr_0 = NB(JPOINT)  J ! virtual pointer to B(0,j)
!
JEQ_bot = NB(JPOINT1)  B_ptr_0 + 1 ! first active equation of column J = J  jband + 1 where jband = NB(JPOINT)NB(JPOINT1)
IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot
IF (JEQ_bot >= J) CYCLE ! active bandwidth for equation J and equation A
!
A_ptr_t = J  IEQ_bot
B_ptr_t = NB(JPOINT)  1
A_ptr_b = A_ptr_0 + JEQ_bot
B_ptr_b = B_ptr_0 + JEQ_bot
!
A(A_ptr_0+J) = A(A_ptr_0+J)  Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) ) ! 99% of all the work is done here
!
END DO
CPU TIMES FOR EQUATION REDUCTION
Description Num Cpu I/O Avg
1 READ BLOCK A 4 0.78 0.01 0.197
3 MOVE BLOCK B 3 0.42 0.01 0.142
4 REDUCE BLOCK A 4 953.29 6.96 240.063
5 APPLY BLOCK B 3 17.46 0.17 5.875
6 WRITE BLOCK A 4 1.03 0.01 0.254
7 DIAGONAL ANALYSIS 2 0.00 0.01 0.003
99 TOTAL REDUCTION 4 972.98 7.13 245.028
159,139 calls to redcol
181,398,857 IEQ_bot > JEQ_bot test
1,126 JEQ_bot >= J test
271,679,207 Dot_Product / VecSum calls
there are 4 blocks in the equation set.



Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2159 Location: Sydney

Posted: Thu May 10, 2012 9:19 am Post subject: 


Dan,
It gets more frustrating !
I tried to improve the F77 style code by removing the subroutine call. Thios would be more in keeping with transfer to OpenMP. The inner 2 loops are now: Code:  DO J = JB,JT ! loop over range of equations
JPOINT = J  JBOT + 3 ! pointer to index for column J in block B
B_ptr_0 = NB(JPOINT)  J ! virtual pointer to B(0,j)
!
JEQ_bot = NB(JPOINT1)  B_ptr_0 + 1 ! first active equation of column J
IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot
!
c = 0
do k = JEQ_bot,J1
c = c + A(A_ptr_0+k) * B(B_ptr_0+k)
end do
A(A_ptr_0+J) = A(A_ptr_0+J)  c
!
END DO

The run time blew out to from 295 to 950 seconds.
I thought the K loop would not have failed like that and there is only 1,126 of 271,000,000 of the K loop runs that are null.
Again compiled with /opt
All today's runs were on a Core i5 notebook.
How do you explain this performance ? 

Back to top 


LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2074 Location: Yateley, Hants, UK

Posted: Thu May 10, 2012 10:27 am Post subject: 


Hi John,
Just a quickie from me. VecSum doesn't need to be EXTERNAL, as it isn't a parameter in a subroutine call, just a straightforward use as a function.
Won't make the timing difference though.
Eddie 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 2150 Location: South Pole, Antarctica

Posted: Thu May 10, 2012 12:57 pm Post subject: Re: 


JohnCampbell wrote: 
Given it takes 287 seconds to do all the real*8 multiplications, what did it do for the other 666 seconds ?

The answer exists implicitly in your question  it's just the devilry
If seriously  here smells something wrong. I'd made a short demo code for developers to look at that.
As to other things  can you divide this into two blocks i.e. make two independent dot product loops out of your currently one? Then as a test we will place them into 2 independent threads to run in two (and then more) cores and see if this will speed up your code twice. I meantime will find the shortest FTN95 demo for multithreading.....Here it is (last code)
http://forums.silverfrost.com/viewtopic.php?t=1133&highlight=multithreading
Last edited by DanRRight on Thu May 10, 2012 1:02 pm; edited 3 times in total 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

Powered by phpBB © 2001, 2005 phpBB Group
