forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Fortran 2003/2008
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Fri Apr 27, 2012 3:18 am    Post subject: Reply with quote

My limited experience with multi-core has not been good. A better solution is probably to use third party libraries. I don't have experience of linking them to FTN95, but Dan has had some success. He has mentioned equation.com and there are also other libraries that claim to have multi-processor capacity. I'm not sure on their general availability.
The communication and process control is a significant problem and run time overhead for someone new to the field. Existing solutions are available, but I am not sure how accessible.
This is an area I am wanting to better understand and sounds like it would help your problem.

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Sat Apr 28, 2012 1:33 am    Post subject: Reply with quote

Most of F2003 and F2008 additions to Fortran are cool options. They should attract new programmers and keep busy existing. Because we all like cool stuff.

Our Fortran "fathers" (excluding some lucky with supercomputers) did not have several things we all have today - we've got color graphics and GUI, multiprocessing and a lot of memory. So Fortran language to be "cool" has to address that in its Standard.

It did not or barely did so far so developers were adding new features at their own risk as non-standard options. Now Fortran community in general has got

- GUI
- multiprocessing
- and some started to add 64-bit addressable space

With FTN95
- we have GUI builder Clearwin+ with graphics libraries and OpenGL, we also can use (and other compilers can use too) Winteracter or GINO.
- we can use multithreading or third party parallel libraries for multiprocessing (though DO CONCURRENT and COARRAYS would be better from the point of view of the source being standard conforming, but professionally made libraries could be still faster - i for example get 8-10 times speedup with 4 cores)
- but we can not get through 4 GB limit even with any third party options while every entry laptop in the market has even more memory.

For me, adding 64bitness would be most usefull next coolness and its time came right of today. Down the road more and more people will jump over 4GB limit so this is exactly an important feature, and it lies exactly on the right trend.

As to other "coolnesses" we have a lot of it already with this compiler people yet do not use or even do not know. Fun is that some new F2003/F2008 standard features exist in this compiler from i think Fortran77 circa 1990. The not exact but by sense close analog of DO CONCURRENT multithreading for example. As well as interoperability with C. And it still has something more -- like partial interoperability with HTML, NET, ability to call system functions which even F2020 will not have.

Cool would be also to have more networking capabilities besides some initial which would be 4th major additions our fathers did not have. And object oriented features. Again this can still be partially done with outside software if its absence kills you. But unfortunately no way we can get 64bit space with any trickery or third party software. So if we'd vote, my wish would be 64bit first, adding parallel options of F200X second, optimization of execution speed to match leaders third.


Last edited by DanRRight on Mon Apr 30, 2012 6:48 am; edited 3 times in total
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 560
Location: UK

PostPosted: Sun Apr 29, 2012 11:51 am    Post subject: Reply with quote

In my experience, Fortran programs which take benefit from multi-processor and multi-core processors are most easy to implement using OpenMP compiler directives. Speed-up factors of 3-4 are achievable on quad-core computers, speed-up factors of 9-12 are achievable on 2 processor 6-core computers (i.e. 12 cores), but actual performance depends on the actual Fortran code.

A code written using OpenMP will compile and run under FTN95 but will just use one core/processor. This can therefore be used for development and debugging/testing, and a different compiler used to take advantage of the multiple processors.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue May 01, 2012 7:11 am    Post subject: Reply with quote

David,

My limited experience with OpenMP has not been as successful as you have presented. What numerical problem and algorithm have you been solving ?
I have been applying OpenMP to a direct solution of linear equations, using a skyline solver, but have had problems with the multi-processor overheads.
I'd expect itterative solvers are more suited to distributed processing and OpenMP.
To improve my attempt, I am told I would need to expand the scope of the OpenMP code to reduce the overhead but that doesn't appear possible for my existing code. Not as easy as I had hoped.
However, I have found that the vector instruction set has been more effective and easy to implement.
I'd be interested to hear more of your experiences with OpenMP.

John
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 560
Location: UK

PostPosted: Thu May 03, 2012 5:57 pm    Post subject: Reply with quote

Hi John,

Most of the time I use OpenMP in applications where there is little or no communication between the different threads, like Monte Carlo simulations where a lot of calculations need to be done independently. These codes are close to being "embarissingly parallel" and performance scales linearly with the number of cores (up to 12, I can't go any higher). However, I have had some success parallelising LR and Cholesky factorisations (factor of nearly 2 on a dual core laptop).

Often its easy to parallelise at the deepest level (innermost loop) but this won't give you good performance, since the work needs to be significant compared with the overhead of managing the threads. You have to parallelise at an outer level somehow. The innermost loops are where you may be able to get some extra benefit from vectorisation.

All this means is it depends on the algorithm and how easy it is to parallelise it. I haven't looked at skyline storage and solving such equations unfortunately.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Wed May 09, 2012 5:14 am    Post subject: Re: Reply with quote

JohnCampbell wrote:
I have been applying OpenMP to a direct solution of linear equations, using a skyline solver, but have had problems with the multi-processor overheads.


What is the bandwidth of the matrix and the number of equations? How much time it takes to solve your set of equations?
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed May 09, 2012 1:56 pm    Post subject: Reply with quote

Dan,

I have been doing finite element modelling since the late 70's. My main mesh generator is fortran, so my models are fairly regular and not based on 3d solids modelling, which generate much larger models. I have a number of recent 8-node solid models and/or 4-node shell models.
Typically the number of equations is between 150,000 and 200,000 plus one of 450,000 equations.
The average bandwidth is typically 1,000 to 1,500 (one of 5,000) and the maximum bandwidth typically 5,000 to 20,000 equations.

My analysis of the potential for parallel calculations in a direct solver goes something like:
There are basically 3 levels of looping in a direct solver.
1:loop (j) over equations
2:loop (i) over band width (connected equations), changing each coefficient a(i,j) in the column j
3:loop dot_product over common band width, calculating the change to a(i,j) due to columns a(:,i) and a(:,j)

By placing the inner loop in a parallel construct, there are 150,000 x 1,500 = 225 million initiated threads involving a dot product of size 1 to 1,500. This loop calculates the modification to a single element of column (:,j) : too many threads ! My attempt (1 year ago) was on the inner loop and did not work well since I did not have enough processors and there were too many threads.

If you place the inner two loops in a parallel construct, there are 150,000 initiated threads involving 1,500 x 1,500 multiplies. This would work much better, but the coefficients are not independent, as the coefficients of (:,j) are changing during the computation. To use the inner 2 loops, there would be 150,000 threads involving 1,500 x 1,500 /2 = 1.3 million calculations.
While this would reduce the number of threads by a factor of 1,000, the timing of the changes to the coefficients needs to be managed.
It is a big step up but as there are only 8 or 16 processors to share the load, it could possibly be done by doing groups of 8 to 16 equations at a time, rather than the 1,500, and tidy up the left overs.
I presume this is what equation.com have done, but it is a bit messy and you have to deal with the left over equations and number of processors.

Unfortunately there are too many other projects on the go, to go deep into this. It also appears that direct solvers have been replaced by itterative solvers in the major commercial packages. All too much to learn. It will have to wait until I need (must have) the solution.

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Wed May 09, 2012 4:59 pm    Post subject: Reply with quote

This compiler has high resolution clock and debugging switches to get the time spent in different parts of the code. Have you done this assessment? The key point here: is the preparation of matrix most time consuming or the matrix solution itself?
- if the first one is most time consuming and you can divide it into independent streams then all can be done just within FTN95. It can do multithreading with NET and does it with the charm (specifically with Clearwin, see the examples)
- if matrix solution is most time consuming (which is often true ) then use equation dot com libraries
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Thu May 10, 2012 4:50 am    Post subject: Reply with quote

Dan,

I wrote my first skyline solver code in 1976, based on a paper by Graham Powell. I would have spent years trying to find how to improve the speed of the inner loop. It would be 99% of the run time and is basically a Dot_Product. The code could be written as 1 line:

s = 0 ; do i = 1,n ; s = s + a(i)*b(i) ; end do

FTN95 converts it to instructions and does not use a library routine for Dot_Product.
n does vary from 1 to 10,000, which makes it hard to suit different approaches.
The way modern processors do this and optimises it's handling of the cache and all the other black arts of processor optimisation do not appear to be managed by FTN95.
The new vector instruction set can be very effective, but not available in FTN95.

The other function that is used for back substitution of the load vectors is basically:
do i = 1,n
a(i) = a(i) - const * column(i)
end do
This has been a real interesting one, as over the years I have had combinations of FTN95 and some processors where this can take 2x, 5x even 10x as long compared to other compilers. Again I think it has something to do with memory alignment or the processor optimisation (pre-fetch or pre-calculation which I actually know little about; I just assume it takes place. Very much like my understanding of dark matter) (how can it take 10 times as long as other compilers ?)

equation dot com is a very good suggestion and looks like an option that I should be looking into. Do you have any details of using this with FTN95 ?

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Thu May 10, 2012 5:53 am    Post subject: Reply with quote

John, i afraid if the dot product takes 99% of your CPU time then parallel linear algebra will not help you. Otherwise LAIPE parallel library use is pretty straightforward: you call its solver subroutine instead of yours and at link time add their library to your obj files, that's it. Then you can play changing the number of employed processors and see the speedup. For large matrix sizes which take more then 0.01-0.1 second to solve with regular non-parallel methods the speedup is essentially often just the linear function of amount of CPU cores.
You may also
1) try contacting the author, may be he will parallelize for you the dot product. But if a(i) and b(i) are known and the problem is just the size of dot product matrix so it takes so much time, then
2) do that yourself with multithreading in FTN95.
3) use IVF or Absoft wrapped in a DLL to do just the dot product and then link it to FTN95 because other compilers may do specifically that faster

How much time it usually takes for one typical variant to run?
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Thu May 10, 2012 7:27 am    Post subject: Reply with quote

Dan,

You prompted me to look at the way I call the inner loop of the equation reduction : Dot_Product.

calling REDCOL is effectively the outer loop, called 159,139 for each equation and each block
Loop : DO J = JB,JT is the next loop, passing 271,679,207 times while
the loop inside Dot_Product / VecSum is the inner loop

I compared two forms of the inner loop call, using dot_product or my interface routine and produced a 3 x run time.
287 seconds went to 953 seconds, by changing

A(A_ptr_0+J) = A(A_ptr_0+J) - VecSum (A(A_ptr_0+JEQ_bot), B(B_ptr_0+JEQ_bot), JBAND)
to
A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )

Given it takes 287 seconds to do all the real*8 multiplications, what did it do for the other 666 seconds ?
both were compiled with /opt

My F77 form of the call is:
Code:

      SUBROUTINE REDCOL_77 (A, B, NB, JB, JT, IEQ_bot)
!
!     Reduces vector 'A' by block 'B'
!     Allows for skyline matrix to be stored in multiple blocks
!     Vector A might or might not be in block B
!
      REAL*8    A(*),                &          ! column to be reduced provided as A(IEQ_bot:)
                B(*)                            ! skyline block of columns
      INTEGER*4 NB(*),               &          ! block structure of B
                JB,                  &          ! first equation to reduce (is in block B)
                JT,                  &          ! last equation to reduce (is in block B and < equation A)
                IEQ_bot                         ! first active equation in column A
!
      INTEGER*4 A_ptr_0, B_ptr_0, JEQ_bot, J, JBOT, JPOINT, JBAND
      REAL*8    VecSum
      EXTERNAL  VecSum                         ! Dot_Product using F77 syntax
!
      IF (JB > JT) RETURN
      JBOT    = NB(1)                           ! first equation column in Block B
      A_ptr_0 = 1-IEQ_bot                       ! virtual pointer to A(0)
!
      DO J = JB,JT                              ! loop over range of equations
         JPOINT  = J - JBOT + 3                 ! pointer to index for column J in block B
         B_ptr_0 = NB(JPOINT) - J               ! virtual pointer to B(0,j)
!
         JEQ_bot = NB(JPOINT-1) - B_ptr_0 + 1   ! first active equation of column J  = J - jband + 1 where jband = NB(JPOINT)-NB(JPOINT-1)
            IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot   
         JBAND   = J - JEQ_bot                  ! active bandwidth for equation J and equation A
            IF (JBAND < 1) CYCLE
!
         A(A_ptr_0+J) = A(A_ptr_0+J) - VecSum (A(A_ptr_0+JEQ_bot), B(B_ptr_0+JEQ_bot), JBAND)   ! 99% of all the work is done here
!
      END DO
!
      RETURN
      END

      REAL*8 FUNCTION VECSUM (A, B, N)
!
!     Performs a vector dot product  VECSUM =  [A] . [B]
!     account is NOT taken of the zero terms in the vectors
!
      integer*4,                 intent (in)    :: n
      real*8,    dimension(n),   intent (in)    :: a
      real*8,    dimension(n),   intent (in)    :: b
!
      vecsum = dot_product ( a, b )
      RETURN
!
      END

  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    1.00    0.02   0.256
  3  MOVE BLOCK B             3    0.42    0.02   0.146
  4  REDUCE BLOCK A           4  287.26    1.68  72.235
  5  APPLY BLOCK B            3    5.38    0.11   1.829
  6  WRITE BLOCK A            4    1.17    0.06   0.308
  7  DIAGONAL ANALYSIS        2    0.02   -0.01   0.003
 99  TOTAL REDUCTION          4  295.25    1.88  74.282
[/code]

Last edited by JohnCampbell on Thu May 10, 2012 7:37 am; edited 2 times in total
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Thu May 10, 2012 7:29 am    Post subject: Reply with quote

Dan,

My F90 form of the call is:
Code:

      IF (JB > JT) RETURN
      JBOT    = NB(1)                           ! first equation column in Block B
      A_ptr_0 = 1-IEQ_bot                       ! virtual pointer to A(0)
!
      DO J = JB,JT                              ! loop over range of equations
         JPOINT  = J - JBOT + 3                 ! pointer to index for column J in block B
         B_ptr_0 = NB(JPOINT) - J               ! virtual pointer to B(0,j)
!
         JEQ_bot = NB(JPOINT-1) - B_ptr_0 + 1   ! first active equation of column J  = J - jband + 1 where jband = NB(JPOINT)-NB(JPOINT-1)
            IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot
         IF (JEQ_bot >= J) CYCLE                 ! active bandwidth for equation J and equation A
!
         A_ptr_t = J - IEQ_bot
         B_ptr_t = NB(JPOINT) - 1
         A_ptr_b = A_ptr_0 + JEQ_bot
         B_ptr_b = B_ptr_0 + JEQ_bot
!
         A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )   ! 99% of all the work is done here
!
      END DO

  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    0.78    0.01   0.197
  3  MOVE BLOCK B             3    0.42    0.01   0.142
  4  REDUCE BLOCK A           4  953.29    6.96 240.063
  5  APPLY BLOCK B            3   17.46    0.17   5.875
  6  WRITE BLOCK A            4    1.03   -0.01   0.254
  7  DIAGONAL ANALYSIS        2    0.00    0.01   0.003
 99  TOTAL REDUCTION          4  972.98    7.13 245.028

     159,139 calls to redcol
 181,398,857 IEQ_bot > JEQ_bot test
       1,126 JEQ_bot >= J test
 271,679,207 Dot_Product / VecSum calls
 there are 4 blocks in the equation set.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Thu May 10, 2012 9:19 am    Post subject: Reply with quote

Dan,

It gets more frustrating !
I tried to improve the F77 style code by removing the subroutine call. Thios would be more in keeping with transfer to OpenMP. The inner 2 loops are now:
Code:
      DO J = JB,JT                              ! loop over range of equations
         JPOINT  = J - JBOT + 3                 ! pointer to index for column J in block B
         B_ptr_0 = NB(JPOINT) - J               ! virtual pointer to B(0,j)
!
         JEQ_bot = NB(JPOINT-1) - B_ptr_0 + 1   ! first active equation of column J
          IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot   
!
         c = 0
         do k = JEQ_bot,J-1
            c = c + A(A_ptr_0+k) * B(B_ptr_0+k)
         end do
         A(A_ptr_0+J) = A(A_ptr_0+J) - c
!
      END DO


The run time blew out to from 295 to 950 seconds.
I thought the K loop would not have failed like that and there is only 1,126 of 271,000,000 of the K loop runs that are null.
Again compiled with /opt
All today's runs were on a Core i5 notebook.

How do you explain this performance ?
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Thu May 10, 2012 10:27 am    Post subject: Reply with quote

Hi John,

Just a quickie from me. VecSum doesn't need to be EXTERNAL, as it isn't a parameter in a subroutine call, just a straightforward use as a function.

Won't make the timing difference though.

Eddie
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Thu May 10, 2012 12:57 pm    Post subject: Re: Reply with quote

JohnCampbell wrote:

Given it takes 287 seconds to do all the real*8 multiplications, what did it do for the other 666 seconds ?


The answer exists implicitly in your question - it's just the devilry Smile

If seriously - here smells something wrong. I'd made a short demo code for developers to look at that.

As to other things - can you divide this into two blocks i.e. make two independent dot product loops out of your currently one? Then as a test we will place them into 2 independent threads to run in two (and then more) cores and see if this will speed up your code twice. I meantime will find the shortest FTN95 demo for multithreading.....Here it is (last code)
http://forums.silverfrost.com/viewtopic.php?t=1133&highlight=multithreading


Last edited by DanRRight on Thu May 10, 2012 1:02 pm; edited 3 times in total
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
Page 2 of 10

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group