Fortran 2003/2008 Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
Author Message
JohnCampbell

Joined: 16 Feb 2006
Posts: 2194
Location: Sydney Posted: Thu May 10, 2012 2:11 pm    Post subject: Dan, The answer must be to multi-thread the J loop, as this loop group is initiated 160,000 times. The K loop is initiated 270,000,000 times, which is what I previously tried to multi-thread. Too many threads, which is why I did not get a good result. I introduced the K loop so there would be no call inside the J loop. I'll look at your thread example. Back in 70's, operations counts looked at the number of multiplies and divides, as these were considered the time consuming parts of the run. That is no longer the case ! For this example, I presented 3 alternative codes, 1 that takes 287 seconds and 2 that takes 950 seconds. We can assume the multiplies take only 287 seconds and something else takes 660 seconds. Thats 11 minutes of doing nothing, worse than when windows starts up. Considering : A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) ) I thought this might be slow due to the array sections, but the K loop : do k = JEQ_bot,J-1 c = c + A(A_ptr_0+k) * B(B_ptr_0+k) end do This has similar poor performance. ( I thought the K index, common to both A and B would have been efficient ) There must be another reason why it processes so slowly, and it is difficult to understand what it is. Eddie, I use EXTERNAL like /implicit_none, to document the functions being used. Using functions as an argument has never appealed. I hope you note that the F77 syntax is far better than F90 in the above example ! John   DanRRight Joined: 10 Mar 2008
Posts: 2206
Location: South Pole, Antarctica Posted: Thu May 10, 2012 7:56 pm    Post subject: Well, that does not sound good. Processor must spend at least 0.01 second (from my very rough estimation. Could be 10 times more) in the thread for overhead of creating the thread to be small. Think if you can change the way the code works. But did i understood correctly that the run takes around 10 min to complete on laptop? I'd say that's fast already to think about modifications!   LitusSaxonicum

Joined: 23 Aug 2005
Posts: 2146
Location: Yateley, Hants, UK Posted: Thu May 10, 2012 8:17 pm    Post subject: John, Surely the best way to indicate that something is a function is to include FUNCTION (or _FN) in its name. You don't have to for a subroutine, as it is CALLed. For things that have to be EXTERNAL like Clearwin callbacks, I found that FTN95 is happy to be EXTERNAL without explicitly giving the INTEGER attribute, although I have begun to adopt the Fortran 90-ish version INTEGER, EXTERNAL :: etc. The habit of declaring routines as EXTERNAL when they aren't passed as parameters to a subprogram seems to have originated with named BLOCKDATA routines, as otherwise one could link without them and nobody would be any the wiser ... Your timing problem looks to me like it is caused by differing numbers of cache misses between the various approaches. Eddie   davidb

Joined: 17 Jul 2009
Posts: 557
Location: UK Posted: Fri May 11, 2012 7:13 am    Post subject: Re: JohnCampbell wrote: By placing the inner loop in a parallel construct, there are 150,000 x 1,500 = 225 million initiated threads involving a dot product of size 1 to 1,500. This loop calculates the modification to a single element of column (:,j) : too many threads ! My attempt (1 year ago) was on the inner loop and did not work well since I did not have enough processors and there were too many threads. John

Forgive me, but this doesn't make sense. The number of threads and the number of iterations are different. You can have a loop doing 16 million iterations, shared across 8 threads with each thread doing 2 million iterations. The work should be shared out amongst the threads.

If you want to parallelise your inner loop (the dot product) you would do something like the following in OpenMP. Note, I haven't coded your offsets to the arrays.

Also I have vectorised contributions from successive 4 array elements. You might change this to 2 depending on the hardware and whether you are working in single or double precision. Keep it at 4 for double precision if your processor supports AVX, set it to 2 if it supports SSE2. (Obviously your compiler needs to support OpenMP and vectorization too!)

A mix of parallelising using threads and parallelising using vectorization within each thread should give good performance.

I have tried this and I get just less than a 4 X speed up with 4 cores when N is 1000000.

 Code: ! N is number of iterations NN = MOD(N, 4) C = SUM(A(1:NN)*B(1:NN)) S = 0.0D0 !\$OMP PARALLEL DO PRIVATE(J) REDUCTION(+:S) DO J=NN+1,N,4    S = S + SUM(A(J:J+3)*B(J:J+3)) END DO C = C + S

The !\$OMP directive creates the parallel region, and shares out the work of the DO loop amongst the threads. You can limit the number of threads by adding more clauses to the directive, but I have just assumed you want to use the maximum number (the number of cores usually).
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

Last edited by davidb on Fri May 11, 2012 7:56 am; edited 2 times in total   Wilfried Linder

Joined: 14 Nov 2007
Posts: 314
Location: Düsseldorf, Germany Posted: Fri May 11, 2012 7:40 am    Post subject: I would like to use parallel processing, and after reading this discussion, I try to find information about OpenMP. On the official page http://openmp.org/wp/openmp-compilers/ is a list of compilers supporting OpenMP - FTN95 is missing. Is it nevertheless possible to use OpenMP together with FTN95? Regards - Wilfried   davidb

Joined: 17 Jul 2009
Posts: 557
Location: UK Posted: Fri May 11, 2012 8:01 am    Post subject: Hi Wilfried, Silverfrost's FTN95 doesn't support OpenMP. But it will compile OpenMP programs since it just treats the directives as comments. I use FTN95 to debug my programs and another compiler (Ifort, Gfortran etc) to create executables to run which enables the OpenMP features. David._________________Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl   Wilfried Linder

Joined: 14 Nov 2007
Posts: 314
Location: Düsseldorf, Germany Posted: Fri May 11, 2012 8:52 am    Post subject: Thank you, David. What a pity... FTN95: No development for 64 bits, no support of parallel processing, no support of OpenMP... It's like a Ferrari with an 80-hp-engine. I really hope that ClearWin will be prepared to work together with other compilers like Intel, then I will start using that compiler for calculations and the very good ClearWin library only for the GUI. Fortran is the language of choice for number-crunching software, and here we need access to really all of the machines's computing power. For other purposes, I wouldn't use Fortran. Regards - Wilfried   JohnCampbell

Joined: 16 Feb 2006
Posts: 2194
Location: Sydney Posted: Fri May 11, 2012 9:56 am    Post subject: David, Thanks for your comments. They have been very helpful. To explain the calculation I am performing: There are 154,220 equations in a blocked skyline, using 4 blocks. Giving 159,139 column reductions (repeated for columns spanning blocks) These results in 271,679,207 Dot_Product calls, roughly the number of numbers in the matrix profile. The number of iterations in each Dot Product range from 1 to 9,861 0.5% < 10 2% < 34 10% < 100 66% in the range 400-1800 10% > 1,800 I hope this better describes the profile of my problem. I was under the impression there would be new threads generated for each Dot_Product call if the process was parallelised using OpenMP. I thought the use of vector instructions or AVX is a different approach which could suit yours and my example. I certainly need to understand the difference between these approaches or if they are the same process. My impression is that vector instructions or AVX involve vector instructions in a single processor and is the solution I should be chasing. Back to the FTN95 solution I have now produced 4 approaches to the "Dot_Product" call. 1) The old F77 code, using a DO loop: c = 0 do k = 0,jband-1 c = c + A(A_ptr_b+k) * B(B_ptr_b+k) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c 2) Using Dot_Product intrinsic with array sections: A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_b+jband-1), B(B_ptr_b:B_ptr_b+jband-1) ) 3) A DO loop inside a F77 wrapper A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum (A(A_ptr_b), B(B_ptr_b), JBAND) where REAL*8 FUNCTION VEC_SUM (A, B, N) ! integer*4, intent (in) :: n real*8, dimension(n), intent (in) :: a real*8, dimension(n), intent (in) :: b ! c = 0 do k = 1,N c = c + A(k) * B(k) end do vec_sum = c end 4) A Dot_Product inside a F77 wrapper where REAL*8 FUNCTION VEC_SUM (A, B, N) ! integer*4, intent (in) :: n real*8, dimension(n), intent (in) :: a real*8, dimension(n), intent (in) :: b ! vec_sum = dot_product ( a, b ) end Both options 1 and 2 take about 950 seconds while both wrapper options 3 and 4 take 300 seconds on my core i5 notebook using FTN95 Ver. 6.10 with /opt I'm not sure if 4) becomes 3 or 3) becomes 4 when compiled ? I don't think FTN95 has an optimised Dot_Product library function. I like Eddie's suggestion that it is due to "cache misses" but I have no idea of what that could really mean or how to address it. Again, 11 minutes of doing nothing due to a one line syntax change is also difficult to understand. John   DanRRight Joined: 10 Mar 2008
Posts: 2206
Location: South Pole, Antarctica Posted: Fri May 11, 2012 6:22 pm    Post subject: Wilfried,
Absence of 64bit is really worst miss, i agree. But i hope for the movement here.
The OpenMP probably too but you have some choices - you can do some parallelization with FTN95 and do that NATIVELY what no other compiler can do (non-standard way, of course, same as non-standard is with the OpenMP). NVIDIA CUDA or AMD's own parallel compiler libraries support would be probably even better because our graphics cards are actually massively parallel vector supercomputers

John,
It's tricky for outsiders to suggest something they do not know in all details. I can only suggest something very general which works if you satisfy some requirements.

Does your method allow division of the workload into Np independent threads - one per processor core - which will do their 1/Np portion of job or not ? In other words, can you fit your code into this parallel program demo where i do non-threaded do loop run and then divide it into two threads with 1/2 job and measure time for all these tasks? You should see speedup almost 100% with two cores (or 4x, 6x .... if you add more threads same way with multicore CPU)

Last edited by DanRRight on Sat May 12, 2012 8:54 am; edited 3 times in total   JohnCampbell

Joined: 16 Feb 2006
Posts: 2194
Location: Sydney Posted: Sat May 12, 2012 1:05 am    Post subject: Dan,

I've looked at this problem a lot and one issue that worries me is why the poor FTN95 performance in approach 3) compared to 1)
Basically changing from:
 Code: !3) - good    do k = 1,n       c = c + A(k) * B(k)    end do to: ! 1) - bad    do k = 0,n-1       c = c + A(A_ptr_b+k) * B(B_ptr_b+k)    end do

This results in a 3-fold increase in run time.
I've tested this on both a Core i5 and Xeon processor and got similar results.
I'm not sure if 3) has fluked a processor optimisation or 1) has a basic inefficiency, but the results are dramatic. It would be good to have an optimised/vectorised Dot_Product to apply to case 4) and see any possible change or if there is already some processor optimisation.
I think the performance, as in case 1) is similar to that seen in the Polyhedron benchmark tests.
I'm not sure if this example can be a generalised guide to DO loops with FTN95 or an isolated case.

I am not as enthusiastic about the OpenMP approach, as I think the new vector instructions have much to offer for this problem.

John   PaulLaidler Joined: 21 Feb 2005
Posts: 6456
Location: Salford, UK Posted: Sat May 12, 2012 8:04 am    Post subject: John

I have noted this for investigation. What is the purpose of A_ptr_b and B_ptr_b? How do they vary?

An initial investigate would take a look at the /explist for each case and also possibly...

 Code: do k = 0,n-1       ia = A_ptr+k       ib = B_ptr+k       c = c + A(ia) * B(ib)    end do    DanRRight Joined: 10 Mar 2008
Posts: 2206
Location: South Pole, Antarctica Posted: Sat May 12, 2012 9:02 pm    Post subject: I'd like to remind all that small demo code would simplify understanding of the problem and speed up the solutions by the factor of .... 1000. Optimization is trickiest problem in any language and the main achievement and pride of Fortran. Besides its inherent simplicity and legacy of developed libraries, it's the primary reason Fortran still survives. My suggestion is to ask the hard questions in comp.lang.fortran where you may have a luck finding the ultimate fortran pro in your subject. 2-3 fortran compiler developers and fortran standard committee people were often there and hidden tricks of optimization were always the favorite topic there.   JohnCampbell

Joined: 16 Feb 2006
Posts: 2194
Location: Sydney Posted: Sun May 13, 2012 11:13 am    Post subject: Dan,

I've generated a simplified code, based on the direct reduction of a profile solver. I've included 6 inner loop options, the 4 I presented plus 2 variations of Paul's post.
It should be compiled with /lgo or /opt /lgo.
To change the size of the problem, I'd recommend changing "max_size". It can run up to 2gb on Win_7-64 or probably 1.3gb on XP_32.
The results show that the wrapper improves by a factor of 3 with FTN95 /opt.
I need to test it with other compilers.
253 lines of code, so see how much can be copied per post.

John
 Code: module big_arrays    integer*4, parameter :: mb_8     = 1024*1024/8       ! real*8 variables in 1 mb !    integer*4, parameter :: max_eqn  = 200000            ! maximum equations possible    integer*4, parameter :: max_band =   2800            ! maximum profile size    integer*4, parameter :: max_size = 100*mb_8          ! maximum size profile matrix  (CHANGE THIS) !    integer*4, allocatable, dimension(:) :: nb           ! profile storage    real*8,    allocatable, dimension(:) :: B            ! matrix storage !     integer*4  neq    integer*4  mstor    integer*4  avg_band ! end module big_arrays use big_arrays !    integer*4  ieq, iband, ipos, method, k    real*8     c_start(6), c_end(6), x, dt, ops_per_sec    integer*8  num_ops, num_call !    character  method_name(6)*30    data method_name / 'Original DO Loop', 'F90 syntax', 'F77 wrapper for DO Loop',         &                       'F77 wrapper for Dot_Product', 'Paul option', 'alternate Paul option' / ! !  generate profile in nb !     allocate ( nb(max_eqn) ) !     iband = 1     nb(1) = 1     nb(2) = 0     do ieq = 1, max_eqn        call random_number (x)        iband = (iband + x * max_band) / 2        nb(ieq+2) = nb(ieq+1) + min (iband,ieq)        if (nb(ieq+2) > max_size) exit     end do ! !  generate diagonal pointer     neq    = ieq-1     do ieq = 0,neq        nb(ieq+2) = nb(ieq+2) + neq+2     end do !     mstor    = nb(neq+2)     avg_band = mstor / neq     x        = real(mstor) / real(mb_8) !     write (*,2001) neq,      ' equations'     write (*,2001) avg_band, ' average profile'     write (*,2001) mstor,    ' coefficients'     write (*,2002) x ,       ' storage (mb)'     write (*,1001) ' ' 2001 format (i10,a) 2002 format (f10.2,a) ! !  initialise the matrix !     allocate ( B(mstor) )     do ieq = 1,neq        ipos  = ieq+2        iband = nb(ipos) - nb(ipos-1)        do k = nb(ipos-1)+1, nb(ipos)           b(k) = 1        end do        b(nb(ipos)) = iband     end do !    do method = 1,6 !      call cpu_time (c_start(method)) !      call redblk (B, NB, B, NB, method, num_ops, num_call) !      call cpu_time (c_end(method))      dt = c_end(method)-c_start(method)      ops_per_sec = dble (num_ops) / dt      write (*,1001) ' Method', method, '  CPU_time =',dt,'  ops/sec =',ops_per_sec, method_name(method)    end do 1001 format (a,i2,a,f9.3,a,en12.2,2x,a) !    write (*,*) num_call,' Number of dot_product calls'    write (*,*) num_ops, ' Number of itterations' ! end       SUBROUTINE REDBLK (A, NA, B, NB, method, num_ops, num_call) ! !   reduces block 'A' by block 'B' !       REAL*8    A(*), B(*)       INTEGER*4 NA(*), NB(*), method       integer*8 num_ops, num_call       INTRINSIC MIN, MAX !       INTEGER*4 I, IBOT, ITOP, IPOINT, A_ptr_0, A_ptr_I, IL,            &                 J, JBOT, JTOP, JB, JT, JI   JohnCampbell

Joined: 16 Feb 2006
Posts: 2194
Location: Sydney Posted: Sun May 13, 2012 11:15 am    Post subject: continued
 Code: !       IBOT = NA(1)                               ! FIRST EQUATION       ITOP = IBOT + NA(2) - 3                    ! LAST EQUATION !       JBOT = NB(1)       JTOP = JBOT + NB(2) - 3       num_ops  = 0       num_call = 0 !       DO I = IBOT,ITOP          IPOINT  = I - IBOT + 3          A_ptr_0 = NA(IPOINT) - I !          IL      = I - (NA(IPOINT) - NA(IPOINT-1)) + 1          JB      = MAX (IL+1,JBOT)          JT      = MIN (I-1 ,JTOP) ! !     patch to test the column reduction, returning stats !          call redcol (A(A_ptr_0+IL), B, NB, JB, JT, IL, method, num_ops, num_call)          IF (IBOT /= JBOT) CYCLE ! !      NOW REDUCE PIVOT COLUMN ON ITSELF (now resert values) !          A_ptr_I = A_ptr_0 + I          DO J = IL,JT             JI = A_ptr_0+J             A(JI) = 1          END DO          A(A_ptr_I) = NA(IPOINT) - NA(IPOINT-1) !       END DO !       RETURN       END       SUBROUTINE REDCOL (A, B, NB, JB, JT, IEQ_bot, method, num_ops, num_call) ! !   Reduces vector 'A' by block 'B' !   Allows for skyline matrix to be stored in multiple blocks !   Vector A might or might not be in block B !       REAL*8    A(*),                &          ! column to be reduced provided as A(IEQ_bot:)                 B(*)                            ! skyline block of columns       INTEGER*4 NB(*),               &          ! block structure of B                 JB,                  &          ! first equation to reduce (is in block B)                 JT,                  &          ! last equation to reduce (is in block B and < equation A)                 IEQ_bot,             &          ! first active equation in column A                 method       integer*8 num_ops, num_call !       INTEGER*4 J, JEQ_bot, JBOT, JPOINT, JBAND, k, ia, ib,              &                 A_ptr_0, B_ptr_0, A_ptr_b, B_ptr_b, A_ptr_t, B_ptr_t       REAL*8    Vec_Sum, Vec_Sum_d, c       EXTERNAL  Vec_Sum, Vec_Sum_d              ! Dot_Product using F77 syntax !       IF (JB > JT) RETURN       JBOT    = NB(1)                           ! first equation column in Block B       A_ptr_0 = 1-IEQ_bot                       ! virtual pointer to A(0) !       DO J = JB,JT                              ! loop over range of equations          JPOINT  = J - JBOT + 3                 ! pointer to index for column J in block B          B_ptr_0 = NB(JPOINT) - J               ! virtual pointer to B(0,j) !          JEQ_bot = NB(JPOINT-1) - B_ptr_0 + 1   ! first active equation of column J  = J - jband + 1 where jband = NB(JPOINT)-NB(JPOINT-1)           IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot    !          JBAND   = J - JEQ_bot                  ! active bandwidth for equation J and equation A           IF (JBAND < 1) CYCLE !          A_ptr_b  = A_ptr_0  + JEQ_bot          B_ptr_b  = B_ptr_0  + JEQ_bot          num_ops  = num_ops  + JBAND          num_call = num_call + 1 !          select case (method) !           case (1) ! if (method == 1) then                    ! Original DO Loop             c = 0             do k = JEQ_bot,J-1                c = c + A(A_ptr_0+k) * B(B_ptr_0+k)             end do             A(A_ptr_0+J) = A(A_ptr_0+J) - c !           case (2) ! else if (method == 2) then               ! F90 syntax !              A_ptr_t = A_ptr_b + JBAND - 1             B_ptr_t = B_ptr_b + JBAND - 1             A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) ) !   JohnCampbell

Joined: 16 Feb 2006
Posts: 2194
Location: Sydney Posted: Sun May 13, 2012 11:18 am    Post subject: hopefully the end
 Code: case (3) ! else if (method == 3) then               ! F77 wrapper for DO loop !             A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum (A(A_ptr_b), B(B_ptr_b), JBAND) !           case (4) ! else if (method == 4) then               ! F77 wrapper for Dot_Product !             A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum_d (A(A_ptr_b), B(B_ptr_b), JBAND) !           case (5) ! else if (method == 5) then               ! Paul option, simplified subscripts             c = 0             do k = JEQ_bot,J-1                ia = A_ptr_0+k                ib = B_ptr_0+k                c = c + A(ia) * B(ib)             end do             A(A_ptr_0+J) = A(A_ptr_0+J) - c !           case (6) ! else if (method == 6) then               ! alternate Paul option (auto increment ?)             c = 0             do k = JEQ_bot,J-1                c = c + A(A_ptr_b) * B(B_ptr_b)                A_ptr_b = A_ptr_b+1                B_ptr_b = B_ptr_b+1             end do             A(A_ptr_b) = A(A_ptr_b) - c          end select !       END DO !       RETURN       END       REAL*8 FUNCTION VEC_SUM (A, B, N) ! !    Performs a vector dot product  VECSUM =  [A] . [B] !    account is NOT taken of the zero terms in the vectors !       real*8    a(*), b(*), c       integer*4 n, k !       c = 0       do k = 1,N          c = c + A(k) * B(k)       end do       vec_sum = c !       RETURN !       END       REAL*8 FUNCTION VEC_SUM_d (A, B, N) ! !   Performs a vector dot product  VECSUM =  [A] . [B] !       integer*4,               intent (in) :: n       real*8,    dimension(n), intent (in) :: a       real*8,    dimension(n), intent (in) :: b !       vec_sum_d = dot_product ( a, b ) !       RETURN !       END   Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First
 All times are GMT + 1 HourGoto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next Page 3 of 10