Silverfrost Forums

Welcome to our forums

Fortran 2003/2008

9 May 2012 4:14 #10106

Quoted from JohnCampbell I have been applying OpenMP to a direct solution of linear equations, using a skyline solver, but have had problems with the multi-processor overheads.

What is the bandwidth of the matrix and the number of equations? How much time it takes to solve your set of equations?

9 May 2012 12:56 #10113

Dan,

I have been doing finite element modelling since the late 70's. My main mesh generator is fortran, so my models are fairly regular and not based on 3d solids modelling, which generate much larger models. I have a number of recent 8-node solid models and/or 4-node shell models. Typically the number of equations is between 150,000 and 200,000 plus one of 450,000 equations. The average bandwidth is typically 1,000 to 1,500 (one of 5,000) and the maximum bandwidth typically 5,000 to 20,000 equations.

My analysis of the potential for parallel calculations in a direct solver goes something like: There are basically 3 levels of looping in a direct solver. 1:loop (j) over equations 2:loop (i) over band width (connected equations), changing each coefficient a(i,j) in the column j 3:loop dot_product over common band width, calculating the change to a(i,j) due to columns a(:,i) and a(:,j)

By placing the inner loop in a parallel construct, there are 150,000 x 1,500 = 225 million initiated threads involving a dot product of size 1 to 1,500. This loop calculates the modification to a single element of column (:,j) : too many threads ! My attempt (1 year ago) was on the inner loop and did not work well since I did not have enough processors and there were too many threads.

If you place the inner two loops in a parallel construct, there are 150,000 initiated threads involving 1,500 x 1,500 multiplies. This would work much better, but the coefficients are not independent, as the coefficients of (:,j) are changing during the computation. To use the inner 2 loops, there would be 150,000 threads involving 1,500 x 1,500 /2 = 1.3 million calculations. While this would reduce the number of threads by a factor of 1,000, the timing of the changes to the coefficients needs to be managed. It is a big step up but as there are only 8 or 16 processors to share the load, it could possibly be done by doing groups of 8 to 16 equations at a time, rather than the 1,500, and tidy up the left overs. I presume this is what equation.com have done, but it is a bit messy and you have to deal with the left over equations and number of processors.

Unfortunately there are too many other projects on the go, to go deep into this. It also appears that direct solvers have been replaced by itterative solvers in the major commercial packages. All too much to learn. It will have to wait until I need (must have) the solution.

John

9 May 2012 3:59 #10114

This compiler has high resolution clock and debugging switches to get the time spent in different parts of the code. Have you done this assessment? The key point here: is the preparation of matrix most time consuming or the matrix solution itself?

  • if the first one is most time consuming and you can divide it into independent streams then all can be done just within FTN95. It can do multithreading with NET and does it with the charm (specifically with Clearwin, see the examples)
  • if matrix solution is most time consuming (which is often true ) then use equation dot com libraries
10 May 2012 3:50 #10117

Dan,

I wrote my first skyline solver code in 1976, based on a paper by Graham Powell. I would have spent years trying to find how to improve the speed of the inner loop. It would be 99% of the run time and is basically a Dot_Product. The code could be written as 1 line:

s = 0 ; do i = 1,n ; s = s + a(i)*b(i) ; end do

FTN95 converts it to instructions and does not use a library routine for Dot_Product. n does vary from 1 to 10,000, which makes it hard to suit different approaches. The way modern processors do this and optimises it's handling of the cache and all the other black arts of processor optimisation do not appear to be managed by FTN95. The new vector instruction set can be very effective, but not available in FTN95.

The other function that is used for back substitution of the load vectors is basically: do i = 1,n a(i) = a(i) - const * column(i) end do This has been a real interesting one, as over the years I have had combinations of FTN95 and some processors where this can take 2x, 5x even 10x as long compared to other compilers. Again I think it has something to do with memory alignment or the processor optimisation (pre-fetch or pre-calculation which I actually know little about; I just assume it takes place. Very much like my understanding of dark matter) (how can it take 10 times as long as other compilers ?)

equation dot com is a very good suggestion and looks like an option that I should be looking into. Do you have any details of using this with FTN95 ?

John

10 May 2012 4:53 #10118

John, i afraid if the dot product takes 99% of your CPU time then parallel linear algebra will not help you. Otherwise LAIPE parallel library use is pretty straightforward: you call its solver subroutine instead of yours and at link time add their library to your obj files, that's it. Then you can play changing the number of employed processors and see the speedup. For large matrix sizes which take more then 0.01-0.1 second to solve with regular non-parallel methods the speedup is essentially often just the linear function of amount of CPU cores. You may also

  1. try contacting the author, may be he will parallelize for you the dot product. But if a(i) and b(i) are known and the problem is just the size of dot product matrix so it takes so much time, then
  2. do that yourself with multithreading in FTN95.
  3. use IVF or Absoft wrapped in a DLL to do just the dot product and then link it to FTN95 because other compilers may do specifically that faster

How much time it usually takes for one typical variant to run?

10 May 2012 6:27 (Edited: 10 May 2012 6:37) #10121

Dan,

You prompted me to look at the way I call the inner loop of the equation reduction : Dot_Product.

calling REDCOL is effectively the outer loop, called 159,139 for each equation and each block Loop : DO J = JB,JT is the next loop, passing 271,679,207 times while the loop inside Dot_Product / VecSum is the inner loop

I compared two forms of the inner loop call, using dot_product or my interface routine and produced a 3 x run time. 287 seconds went to 953 seconds, by changing

     A(A_ptr_0+J) = A(A_ptr_0+J) - VecSum (A(A_ptr_0+JEQ_bot), B(B_ptr_0+JEQ_bot), JBAND)

to A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )

Given it takes 287 seconds to do all the real*8 multiplications, what did it do for the other 666 seconds ? both were compiled with /opt

My F77 form of the call is:

      SUBROUTINE REDCOL_77 (A, B, NB, JB, JT, IEQ_bot)
!
!     Reduces vector 'A' by block 'B'
!     Allows for skyline matrix to be stored in multiple blocks
!     Vector A might or might not be in block B
!
      REAL*8    A(*),                &          ! column to be reduced provided as A(IEQ_bot:)
                B(*)                            ! skyline block of columns
      INTEGER*4 NB(*),               &          ! block structure of B
                JB,                  &          ! first equation to reduce (is in block B)
                JT,                  &          ! last equation to reduce (is in block B and < equation A)
                IEQ_bot                         ! first active equation in column A
!
      INTEGER*4 A_ptr_0, B_ptr_0, JEQ_bot, J, JBOT, JPOINT, JBAND
      REAL*8    VecSum
      EXTERNAL  VecSum                         ! Dot_Product using F77 syntax
!
      IF (JB > JT) RETURN
      JBOT    = NB(1)                           ! first equation column in Block B
      A_ptr_0 = 1-IEQ_bot                       ! virtual pointer to A(0)
!
      DO J = JB,JT                              ! loop over range of equations
         JPOINT  = J - JBOT + 3                 ! pointer to index for column J in block B
         B_ptr_0 = NB(JPOINT) - J               ! virtual pointer to B(0,j)
!
         JEQ_bot = NB(JPOINT-1) - B_ptr_0 + 1   ! first active equation of column J  = J - jband + 1 where jband = NB(JPOINT)-NB(JPOINT-1)
            IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot   
         JBAND   = J - JEQ_bot                  ! active bandwidth for equation J and equation A
            IF (JBAND < 1) CYCLE
!
         A(A_ptr_0+J) = A(A_ptr_0+J) - VecSum (A(A_ptr_0+JEQ_bot), B(B_ptr_0+JEQ_bot), JBAND)   ! 99% of all the work is done here
!
      END DO
!
      RETURN
      END

      REAL*8 FUNCTION VECSUM (A, B, N)
!
!     Performs a vector dot product  VECSUM =  [A] . [B]
!     account is NOT taken of the zero terms in the vectors
!
      integer*4,                 intent (in)    :: n
      real*8,    dimension(n),   intent (in)    :: a
      real*8,    dimension(n),   intent (in)    :: b
!
      vecsum = dot_product ( a, b )
      RETURN
!
      END

  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    1.00    0.02   0.256
  3  MOVE BLOCK B             3    0.42    0.02   0.146
  4  REDUCE BLOCK A           4  287.26    1.68  72.235
  5  APPLY BLOCK B            3    5.38    0.11   1.829
  6  WRITE BLOCK A            4    1.17    0.06   0.308
  7  DIAGONAL ANALYSIS        2    0.02   -0.01   0.003
 99  TOTAL REDUCTION          4  295.25    1.88  74.282

[/code]

10 May 2012 6:29 #10122

Dan,

My F90 form of the call is:

      IF (JB > JT) RETURN
      JBOT    = NB(1)                           ! first equation column in Block B
      A_ptr_0 = 1-IEQ_bot                       ! virtual pointer to A(0)
!
      DO J = JB,JT                              ! loop over range of equations
         JPOINT  = J - JBOT + 3                 ! pointer to index for column J in block B
         B_ptr_0 = NB(JPOINT) - J               ! virtual pointer to B(0,j)
!
         JEQ_bot = NB(JPOINT-1) - B_ptr_0 + 1   ! first active equation of column J  = J - jband + 1 where jband = NB(JPOINT)-NB(JPOINT-1)
            IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot
         IF (JEQ_bot >= J) CYCLE                 ! active bandwidth for equation J and equation A
!
         A_ptr_t = J - IEQ_bot
         B_ptr_t = NB(JPOINT) - 1
         A_ptr_b = A_ptr_0 + JEQ_bot
         B_ptr_b = B_ptr_0 + JEQ_bot
!
         A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )   ! 99% of all the work is done here
!
      END DO

  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    0.78    0.01   0.197
  3  MOVE BLOCK B             3    0.42    0.01   0.142
  4  REDUCE BLOCK A           4  953.29    6.96 240.063
  5  APPLY BLOCK B            3   17.46    0.17   5.875
  6  WRITE BLOCK A            4    1.03   -0.01   0.254
  7  DIAGONAL ANALYSIS        2    0.00    0.01   0.003
 99  TOTAL REDUCTION          4  972.98    7.13 245.028

     159,139 calls to redcol
 181,398,857 IEQ_bot > JEQ_bot test
       1,126 JEQ_bot >= J test
 271,679,207 Dot_Product / VecSum calls
 there are 4 blocks in the equation set.
10 May 2012 8:19 #10124

Dan,

It gets more frustrating ! I tried to improve the F77 style code by removing the subroutine call. Thios would be more in keeping with transfer to OpenMP. The inner 2 loops are now: DO J = JB,JT ! loop over range of equations JPOINT = J - JBOT + 3 ! pointer to index for column J in block B B_ptr_0 = NB(JPOINT) - J ! virtual pointer to B(0,j) ! JEQ_bot = NB(JPOINT-1) - B_ptr_0 + 1 ! first active equation of column J IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot
! c = 0 do k = JEQ_bot,J-1 c = c + A(A_ptr_0+k) * B(B_ptr_0+k) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c ! END DO

The run time blew out to from 295 to 950 seconds. I thought the K loop would not have failed like that and there is only 1,126 of 271,000,000 of the K loop runs that are null. Again compiled with /opt All today's runs were on a Core i5 notebook.

How do you explain this performance ?

10 May 2012 9:27 #10125

Hi John,

Just a quickie from me. VecSum doesn't need to be EXTERNAL, as it isn't a parameter in a subroutine call, just a straightforward use as a function.

Won't make the timing difference though.

Eddie

10 May 2012 11:57 (Edited: 10 May 2012 12:02) #10126

Quoted from JohnCampbell

Given it takes 287 seconds to do all the real*8 multiplications, what did it do for the other 666 seconds ?

The answer exists implicitly in your question - it's just the devilry 😃

If seriously - here smells something wrong. I'd made a short demo code for developers to look at that.

As to other things - can you divide this into two blocks i.e. make two independent dot product loops out of your currently one? Then as a test we will place them into 2 independent threads to run in two (and then more) cores and see if this will speed up your code twice. I meantime will find the shortest FTN95 demo for multithreading.....Here it is (last code) https://forums.silverfrost.com/Forum/Topic/882&highlight=multithreading

10 May 2012 1:11 #10127

Dan,

The answer must be to multi-thread the J loop, as this loop group is initiated 160,000 times. The K loop is initiated 270,000,000 times, which is what I previously tried to multi-thread. Too many threads, which is why I did not get a good result. I introduced the K loop so there would be no call inside the J loop. I'll look at your thread example.

Back in 70's, operations counts looked at the number of multiplies and divides, as these were considered the time consuming parts of the run. That is no longer the case !

For this example, I presented 3 alternative codes, 1 that takes 287 seconds and 2 that takes 950 seconds. We can assume the multiplies take only 287 seconds and something else takes 660 seconds. Thats 11 minutes of doing nothing, worse than when windows starts up. Considering : A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) ) I thought this might be slow due to the array sections, but the K loop : do k = JEQ_bot,J-1 c = c + A(A_ptr_0+k) * B(B_ptr_0+k) end do This has similar poor performance. ( I thought the K index, common to both A and B would have been efficient ) There must be another reason why it processes so slowly, and it is difficult to understand what it is.

Eddie, I use EXTERNAL like /implicit_none, to document the functions being used. Using functions as an argument has never appealed. I hope you note that the F77 syntax is far better than F90 in the above example !

John

10 May 2012 6:56 #10128

Well, that does not sound good. Processor must spend at least 0.01 second (from my very rough estimation. Could be 10 times more) in the thread for overhead of creating the thread to be small. Think if you can change the way the code works.

But did i understood correctly that the run takes around 10 min to complete on laptop? I'd say that's fast already to think about modifications!

10 May 2012 7:17 #10129

John,

Surely the best way to indicate that something is a function is to include FUNCTION (or _FN) in its name. You don't have to for a subroutine, as it is CALLed.

For things that have to be EXTERNAL like Clearwin callbacks, I found that FTN95 is happy to be EXTERNAL without explicitly giving the INTEGER attribute, although I have begun to adopt the Fortran 90-ish version

INTEGER, EXTERNAL :: etc.

The habit of declaring routines as EXTERNAL when they aren't passed as parameters to a subprogram seems to have originated with named BLOCKDATA routines, as otherwise one could link without them and nobody would be any the wiser ...

Your timing problem looks to me like it is caused by differing numbers of cache misses between the various approaches.

Eddie

11 May 2012 6:13 (Edited: 11 May 2012 6:56) #10130

Quoted from JohnCampbell

By placing the inner loop in a parallel construct, there are 150,000 x 1,500 = 225 million initiated threads involving a dot product of size 1 to 1,500. This loop calculates the modification to a single element of column (:,j) : too many threads ! My attempt (1 year ago) was on the inner loop and did not work well since I did not have enough processors and there were too many threads. John

Forgive me, but this doesn't make sense. The number of threads and the number of iterations are different. You can have a loop doing 16 million iterations, shared across 8 threads with each thread doing 2 million iterations. The work should be shared out amongst the threads.

If you want to parallelise your inner loop (the dot product) you would do something like the following in OpenMP. Note, I haven't coded your offsets to the arrays.

Also I have vectorised contributions from successive 4 array elements. You might change this to 2 depending on the hardware and whether you are working in single or double precision. Keep it at 4 for double precision if your processor supports AVX, set it to 2 if it supports SSE2. (Obviously your compiler needs to support OpenMP and vectorization too!)

A mix of parallelising using threads and parallelising using vectorization within each thread should give good performance.

I have tried this and I get just less than a 4 X speed up with 4 cores when N is 1000000.

! N is number of iterations

NN = MOD(N, 4)
C = SUM(A(1:NN)*B(1:NN))
S = 0.0D0

!$OMP PARALLEL DO PRIVATE(J) REDUCTION(+:S)
DO J=NN+1,N,4
   S = S + SUM(A(J:J+3)*B(J:J+3))
END DO
C = C + S

The !$OMP directive creates the parallel region, and shares out the work of the DO loop amongst the threads. You can limit the number of threads by adding more clauses to the directive, but I have just assumed you want to use the maximum number (the number of cores usually).

11 May 2012 6:40 #10131

I would like to use parallel processing, and after reading this discussion, I try to find information about OpenMP. On the official page http://openmp.org/wp/openmp-compilers/ is a list of compilers supporting OpenMP - FTN95 is missing. Is it nevertheless possible to use OpenMP together with FTN95?

Regards - Wilfried

11 May 2012 7:01 #10132

Hi Wilfried,

Silverfrost's FTN95 doesn't support OpenMP. But it will compile OpenMP programs since it just treats the directives as comments. I use FTN95 to debug my programs and another compiler (Ifort, Gfortran etc) to create executables to run which enables the OpenMP features.

David.

11 May 2012 7:52 #10133

Thank you, David. What a pity... FTN95: No development for 64 bits, no support of parallel processing, no support of OpenMP... It's like a Ferrari with an 80-hp-engine. I really hope that ClearWin will be prepared to work together with other compilers like Intel, then I will start using that compiler for calculations and the very good ClearWin library only for the GUI.

Fortran is the language of choice for number-crunching software, and here we need access to really all of the machines's computing power. For other purposes, I wouldn't use Fortran.

Regards - Wilfried

11 May 2012 8:56 #10134

David,

Thanks for your comments. They have been very helpful.

To explain the calculation I am performing: There are 154,220 equations in a blocked skyline, using 4 blocks. Giving 159,139 column reductions (repeated for columns spanning blocks) These results in 271,679,207 Dot_Product calls, roughly the number of numbers in the matrix profile. The number of iterations in each Dot Product range from 1 to 9,861 0.5% < 10 2% < 34 10% < 100 66% in the range 400-1800 10% > 1,800 I hope this better describes the profile of my problem.

I was under the impression there would be new threads generated for each Dot_Product call if the process was parallelised using OpenMP. I thought the use of vector instructions or AVX is a different approach which could suit yours and my example. I certainly need to understand the difference between these approaches or if they are the same process. My impression is that vector instructions or AVX involve vector instructions in a single processor and is the solution I should be chasing.

Back to the FTN95 solution I have now produced 4 approaches to the 'Dot_Product' call.

  1. The old F77 code, using a DO loop: c = 0 do k = 0,jband-1 c = c + A(A_ptr_b+k) * B(B_ptr_b+k) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c

  2. Using Dot_Product intrinsic with array sections: A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_b+jband-1), B(B_ptr_b:B_ptr_b+jband-1) )

  3. A DO loop inside a F77 wrapper A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum (A(A_ptr_b), B(B_ptr_b), JBAND) where REAL8 FUNCTION VEC_SUM (A, B, N) ! integer4, intent (in) :: n real8, dimension(n), intent (in) :: a real8, dimension(n), intent (in) :: b ! c = 0 do k = 1,N c = c + A(k) * B(k) end do vec_sum = c end

  4. A Dot_Product inside a F77 wrapper where REAL8 FUNCTION VEC_SUM (A, B, N) ! integer4, intent (in) :: n real8, dimension(n), intent (in) :: a real8, dimension(n), intent (in) :: b ! vec_sum = dot_product ( a, b ) end

Both options 1 and 2 take about 950 seconds while both wrapper options 3 and 4 take 300 seconds on my core i5 notebook using FTN95 Ver. 6.10 with /opt I'm not sure if 4) becomes 3 or 3) becomes 4 when compiled ? I don't think FTN95 has an optimised Dot_Product library function.

I like Eddie's suggestion that it is due to 'cache misses' but I have no idea of what that could really mean or how to address it. Again, 11 minutes of doing nothing due to a one line syntax change is also difficult to understand.

John

11 May 2012 5:22 (Edited: 12 May 2012 7:54) #10135

Wilfried, Absence of 64bit is really worst miss, i agree. But i hope for the movement here. The OpenMP probably too but you have some choices - you can do some parallelization with FTN95 and do that NATIVELY what no other compiler can do (non-standard way, of course, same as non-standard is with the OpenMP). NVIDIA CUDA or AMD's own parallel compiler libraries support would be probably even better because our graphics cards are actually massively parallel vector supercomputers

John, It's tricky for outsiders to suggest something they do not know in all details. I can only suggest something very general which works if you satisfy some requirements.

Does your method allow division of the workload into Np independent threads - one per processor core - which will do their 1/Np portion of job or not ? In other words, can you fit your code into this parallel program demo where i do non-threaded do loop run and then divide it into two threads with 1/2 job and measure time for all these tasks? You should see speedup almost 100% with two cores (or 4x, 6x .... if you add more threads same way with multicore CPU)

 ! Multithreading example mult04.for 
 ! Dan R Right 2012 
 ! 
 ! Watch in Task Manager how two threads run1 and run2 
 ! grab two processor cores working in parallel 
 ! 
 ! Compilation: ftn95 mult04.FOR /clr /link /free /multi_threaded 
 ! 
 include <clearwin.ins> 
 EXTERNAL run1, run2 
 common /abc_/kEnded1, kended2 
 character*1 cha

!..... straight non-threaded run

 print*,' Run w/o threads started'
 call clock@ (time_start)
    d=2.0 
    do k=1,20 
    do i=1,10000000 
    d=alog(exp(d)) 
    enddo 
    enddo 
 call clock@ (time_finish)
 print*, 'Elapsed time without threads=', time_finish-time_start

!...multithreaded
 call clock@ (time_start)

   CALL CREATE_THREAD@(run1,22) 
   CALL CREATE_THREAD@(run2,23) 

!...wait till both threads end 
 do while (kEnded1==0.or.kEnded2==0)
   call sleep1@(0.1)
 enddo

 call clock@ (time_finish)

 print*, 'Elapsed time with threads=', time_finish-time_start
 print*, 'Enter any key to exit'
 read(*,*) cha
 END 
 !---------------- 
 subroutine run1() 
 include <clearwin.ins> 
 common /abc_/kEnded1, kended2  
    lock;   print*,'Thr.1 started'; end lock 
    kEnded1=0
    d=2.0 
    do k=1,10 
    do i=1,10000000 
    d=alog(exp(d)) 
    enddo 
    enddo 
    kEnded1=1
    lock;   print*,'Thr.1 ended' ; end lock 
 end 
 !---------------- 
 subroutine run2() 
 include <clearwin.ins> 
 common /abc_/kEnded1, kended2  

    lock;   print*,'Thr.2 started' ;end lock 
    kEnded2=0
    d=2.0 
    do k=1,10 
    do i=1,10000000 
    d=alog(exp(d)) 
    enddo 
    enddo     
    kEnded2=1
    lock;   print*,'Thr.2 ended' ;end lock 
 end
12 May 2012 12:05 #10136

Dan,

I've looked at this problem a lot and one issue that worries me is why the poor FTN95 performance in approach 3) compared to 1) Basically changing from:

!3) - good
   do k = 1,n 
      c = c + A(k) * B(k) 
   end do 
to:
! 1) - bad
   do k = 0,n-1 
      c = c + A(A_ptr_b+k) * B(B_ptr_b+k) 
   end do 

This results in a 3-fold increase in run time. I've tested this on both a Core i5 and Xeon processor and got similar results. I'm not sure if 3) has fluked a processor optimisation or 1) has a basic inefficiency, but the results are dramatic. It would be good to have an optimised/vectorised Dot_Product to apply to case 4) and see any possible change or if there is already some processor optimisation. I think the performance, as in case 1) is similar to that seen in the Polyhedron benchmark tests. I'm not sure if this example can be a generalised guide to DO loops with FTN95 or an isolated case.

I am not as enthusiastic about the OpenMP approach, as I think the new vector instructions have much to offer for this problem. Thanks for your comments, as I will consider multiple threading further.

John

Please login to reply.