Silverfrost Forums

Welcome to our forums

Fortran 2003/2008

12 May 2012 7:04 #10137

John

I have noted this for investigation. What is the purpose of A_ptr_b and B_ptr_b? How do they vary?

An initial investigate would take a look at the /explist for each case and also possibly...

   do k = 0,n-1
      ia = A_ptr+k
      ib = B_ptr+k 
      c = c + A(ia) * B(ib) 
   end do 
12 May 2012 8:02 #10140

I'd like to remind all that small demo code would simplify understanding of the problem and speed up the solutions by the factor of .... 1000.

Optimization is trickiest problem in any language and the main achievement and pride of Fortran. Besides its inherent simplicity and legacy of developed libraries, it's the primary reason Fortran still survives. My suggestion is to ask the hard questions in comp.lang.fortran where you may have a luck finding the ultimate fortran pro in your subject. 2-3 fortran compiler developers and fortran standard committee people were often there and hidden tricks of optimization were always the favorite topic there.

13 May 2012 10:13 #10142

Dan,

I've generated a simplified code, based on the direct reduction of a profile solver. I've included 6 inner loop options, the 4 I presented plus 2 variations of Paul's post. It should be compiled with /lgo or /opt /lgo. To change the size of the problem, I'd recommend changing 'max_size'. It can run up to 2gb on Win_7-64 or probably 1.3gb on XP_32. The results show that the wrapper improves by a factor of 3 with FTN95 /opt. I need to test it with other compilers. 253 lines of code, so see how much can be copied per post.

John

module big_arrays
   integer*4, parameter :: mb_8     = 1024*1024/8       ! real*8 variables in 1 mb
!
   integer*4, parameter :: max_eqn  = 200000            ! maximum equations possible
   integer*4, parameter :: max_band =   2800            ! maximum profile size
   integer*4, parameter :: max_size = 100*mb_8          ! maximum size profile matrix  (CHANGE THIS)
!
   integer*4, allocatable, dimension(:) :: nb           ! profile storage
   real*8,    allocatable, dimension(:) :: B            ! matrix storage
!  
   integer*4  neq
   integer*4  mstor
   integer*4  avg_band
!
end module big_arrays

use big_arrays
!
   integer*4  ieq, iband, ipos, method, k
   real*8     c_start(6), c_end(6), x, dt, ops_per_sec
   integer*8  num_ops, num_call
!
   character  method_name(6)*30
   data method_name / 'Original DO Loop', 'F90 syntax', 'F77 wrapper for DO Loop',         &
                      'F77 wrapper for Dot_Product', 'Paul option', 'alternate Paul option' /
!
!  generate profile in nb
!
    allocate ( nb(max_eqn) )
! 
    iband = 1
    nb(1) = 1
    nb(2) = 0
    do ieq = 1, max_eqn
       call random_number (x)
       iband = (iband + x * max_band) / 2
       nb(ieq+2) = nb(ieq+1) + min (iband,ieq)
       if (nb(ieq+2) > max_size) exit
    end do
!
!  generate diagonal pointer
    neq    = ieq-1
    do ieq = 0,neq
       nb(ieq+2) = nb(ieq+2) + neq+2
    end do
!
    mstor    = nb(neq+2)
    avg_band = mstor / neq
    x        = real(mstor) / real(mb_8)
!
    write (*,2001) neq,      ' equations'
    write (*,2001) avg_band, ' average profile'
    write (*,2001) mstor,    ' coefficients'
    write (*,2002) x ,       ' storage (mb)'
    write (*,1001) ' '
2001 format (i10,a)
2002 format (f10.2,a)
!
!  initialise the matrix
!
    allocate ( B(mstor) )
    do ieq = 1,neq
       ipos  = ieq+2
       iband = nb(ipos) - nb(ipos-1)
       do k = nb(ipos-1)+1, nb(ipos)
          b(k) = 1
       end do
       b(nb(ipos)) = iband
    end do
!
   do method = 1,6
!
     call cpu_time (c_start(method))
!
     call redblk (B, NB, B, NB, method, num_ops, num_call)
!
     call cpu_time (c_end(method))
     dt = c_end(method)-c_start(method)
     ops_per_sec = dble (num_ops) / dt
     write (*,1001) ' Method', method, '  CPU_time =',dt,'  ops/sec =',ops_per_sec, method_name(method)
   end do
1001 format (a,i2,a,f9.3,a,en12.2,2x,a)
!
   write (*,*) num_call,' Number of dot_product calls'
   write (*,*) num_ops, ' Number of itterations'
!
end

      SUBROUTINE REDBLK (A, NA, B, NB, method, num_ops, num_call)
!
!   reduces block 'A' by block 'B'
!
      REAL*8    A(*), B(*)
      INTEGER*4 NA(*), NB(*), method
      integer*8 num_ops, num_call
      INTRINSIC MIN, MAX
!
      INTEGER*4 I, IBOT, ITOP, IPOINT, A_ptr_0, A_ptr_I, IL,            &
                J, JBOT, JTOP, JB, JT, JI
13 May 2012 10:15 #10143

continued ! IBOT = NA(1) ! FIRST EQUATION ITOP = IBOT + NA(2) - 3 ! LAST EQUATION ! JBOT = NB(1) JTOP = JBOT + NB(2) - 3 num_ops = 0 num_call = 0 ! DO I = IBOT,ITOP IPOINT = I - IBOT + 3 A_ptr_0 = NA(IPOINT) - I ! IL = I - (NA(IPOINT) - NA(IPOINT-1)) + 1 JB = MAX (IL+1,JBOT) JT = MIN (I-1 ,JTOP) ! ! patch to test the column reduction, returning stats ! call redcol (A(A_ptr_0+IL), B, NB, JB, JT, IL, method, num_ops, num_call) IF (IBOT /= JBOT) CYCLE ! ! NOW REDUCE PIVOT COLUMN ON ITSELF (now resert values) ! A_ptr_I = A_ptr_0 + I DO J = IL,JT JI = A_ptr_0+J A(JI) = 1 END DO A(A_ptr_I) = NA(IPOINT) - NA(IPOINT-1) ! END DO ! RETURN END

      SUBROUTINE REDCOL (A, B, NB, JB, JT, IEQ_bot, method, num_ops, num_call)
!
!   Reduces vector 'A' by block 'B'
!   Allows for skyline matrix to be stored in multiple blocks
!   Vector A might or might not be in block B
!
      REAL*8    A(*),                &          ! column to be reduced provided as A(IEQ_bot:)
                B(*)                            ! skyline block of columns
      INTEGER*4 NB(*),               &          ! block structure of B
                JB,                  &          ! first equation to reduce (is in block B)
                JT,                  &          ! last equation to reduce (is in block B and < equation A)
                IEQ_bot,             &          ! first active equation in column A
                method
      integer*8 num_ops, num_call
!
      INTEGER*4 J, JEQ_bot, JBOT, JPOINT, JBAND, k, ia, ib,              &
                A_ptr_0, B_ptr_0, A_ptr_b, B_ptr_b, A_ptr_t, B_ptr_t
      REAL*8    Vec_Sum, Vec_Sum_d, c
      EXTERNAL  Vec_Sum, Vec_Sum_d              ! Dot_Product using F77 syntax
!
      IF (JB > JT) RETURN
      JBOT    = NB(1)                           ! first equation column in Block B
      A_ptr_0 = 1-IEQ_bot                       ! virtual pointer to A(0)
!
      DO J = JB,JT                              ! loop over range of equations
         JPOINT  = J - JBOT + 3                 ! pointer to index for column J in block B
         B_ptr_0 = NB(JPOINT) - J               ! virtual pointer to B(0,j)
!
         JEQ_bot = NB(JPOINT-1) - B_ptr_0 + 1   ! first active equation of column J  = J - jband + 1 where jband = NB(JPOINT)-NB(JPOINT-1)
          IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot   
!
         JBAND   = J - JEQ_bot                  ! active bandwidth for equation J and equation A
          IF (JBAND < 1) CYCLE
!
         A_ptr_b  = A_ptr_0  + JEQ_bot
         B_ptr_b  = B_ptr_0  + JEQ_bot
         num_ops  = num_ops  + JBAND
         num_call = num_call + 1
!
         select case (method)
!
          case (1) ! if (method == 1) then                    ! Original DO Loop 
            c = 0
            do k = JEQ_bot,J-1
               c = c + A(A_ptr_0+k) * B(B_ptr_0+k)
            end do
            A(A_ptr_0+J) = A(A_ptr_0+J) - c
!
          case (2) ! else if (method == 2) then               ! F90 syntax
!  
            A_ptr_t = A_ptr_b + JBAND - 1
            B_ptr_t = B_ptr_b + JBAND - 1
            A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) ) 
!
13 May 2012 10:18 #10144

hopefully the end case (3) ! else if (method == 3) then ! F77 wrapper for DO loop ! A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum (A(A_ptr_b), B(B_ptr_b), JBAND) ! case (4) ! else if (method == 4) then ! F77 wrapper for Dot_Product ! A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum_d (A(A_ptr_b), B(B_ptr_b), JBAND) ! case (5) ! else if (method == 5) then ! Paul option, simplified subscripts c = 0 do k = JEQ_bot,J-1 ia = A_ptr_0+k ib = B_ptr_0+k c = c + A(ia) * B(ib) end do A(A_ptr_0+J) = A(A_ptr_0+J) - c ! case (6) ! else if (method == 6) then ! alternate Paul option (auto increment ?) c = 0 do k = JEQ_bot,J-1 c = c + A(A_ptr_b) * B(B_ptr_b) A_ptr_b = A_ptr_b+1 B_ptr_b = B_ptr_b+1 end do A(A_ptr_b) = A(A_ptr_b) - c end select ! END DO ! RETURN END

      REAL*8 FUNCTION VEC_SUM (A, B, N)
!
!    Performs a vector dot product  VECSUM =  [A] . [B]
!    account is NOT taken of the zero terms in the vectors
!
      real*8    a(*), b(*), c
      integer*4 n, k
!
      c = 0
      do k = 1,N
         c = c + A(k) * B(k)
      end do
      vec_sum = c
!
      RETURN
!
      END

      REAL*8 FUNCTION VEC_SUM_d (A, B, N)
!
!   Performs a vector dot product  VECSUM =  [A] . [B]
!
      integer*4,               intent (in) :: n
      real*8,    dimension(n), intent (in) :: a
      real*8,    dimension(n), intent (in) :: b
!
      vec_sum_d = dot_product ( a, b )
!
      RETURN
!
      END
13 May 2012 11:15 #10145

I have run them with FTN95 and an old Lahey compiler on my Core i5 to produce the following results: Lahey/Fujitsu Fortran 95 Compiler Release 5.50j Copyright (C) 1994-2000 Lahey Computer Systems. All rights reserved. Copyright (C) 1998-2000 FUJITSU LIMITED. All rights reserved.

Options: -o1 -tpp -ntrace -nzero

    10241 equations
     1280 average profile
 13116540 coefficients
   100.07 storage (mb)

Method 1  CPU_time =   12.886  ops/sec =  680.39E+06  Original DO Loop              
Method 2  CPU_time =   29.999  ops/sec =  292.25E+06  F90 syntax                    
Method 3  CPU_time =   10.530  ops/sec =  832.59E+06  F77 wrapper for DO Loop       
Method 4  CPU_time =   10.592  ops/sec =  827.69E+06  F77 wrapper for Dot_Product   
Method 5  CPU_time =   12.636  ops/sec =  693.83E+06  Paul option                   
Method 6  CPU_time =   12.558  ops/sec =  698.14E+06  alternate Paul option         
13085816  Number of dot_product calls
8767279190  Number of itterations


[FTN95/Win32 Ver. 6.10.0 Copyright (c) Silverfrost Ltd 1993-2011]

 /opt /p6 /lgo

Program entered
     10146 equations
      1292 average profile
  13116399 coefficients
    100.07 storage (mb)

 Method 1  CPU_time =   29.469  ops/sec =  300.97E+06  Original DO Loop              
 Method 2  CPU_time =   29.687  ops/sec =  298.75E+06  F90 syntax                    
 Method 3  CPU_time =   10.561  ops/sec =  839.77E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =   10.561  ops/sec =  839.77E+06  F77 wrapper for Dot_Product   
 Method 5  CPU_time =   29.110  ops/sec =  304.68E+06  Paul option                   
 Method 6  CPU_time =   28.923  ops/sec =  306.65E+06  alternate Paul option         
              13085960 Number of dot_product calls
            8869050198 Number of itterations

The best that either do is 840 million itterations per second, for the F77 wrapper. Both are a single processor, without vector instructions. Lahey manages to improve the performance of the other DO loops, but both fail with array sections. Based on this test only, Salford only appears to be able to optimise simple DO loops. Again, the difference between Method 4 and Method 2 for both compilers is difficult to understand, especially by old rules of floating point operations counts. There is a lot of nothing going on with Method 2. I'll need to see what performance can be derived from vector or OpenMP instructions. (Any ideas davidb ?)

John

14 May 2012 5:41 #10147

Can anyone run this on other compilers?

14 May 2012 8:02 (Edited: 14 May 2012 8:04) #10149

My laptop is a bit slower than John's core I5 so I need to re-run using Silverfrost FTN95 run to provide a reference case.

I used the same options for FTN95 as John. For the other compilers I compiled at -O2.

I haven't had time to investigate using OpenMP yet.

[u:59aaed4909]Silverfrost FTN95[/u:59aaed4909]

 10146 equations
  1292 average profile

13116399 coefficients 100.07 storage (mb)

Method 1 CPU_time = 41.527 ops/sec = 213.57E+06 Original DO Loop

Method 2 CPU_time = 41.309 ops/sec = 214.70E+06 F90 syntax

Method 3 CPU_time = 17.659 ops/sec = 502.23E+06 F77 wrapper for DO Loop

Method 4 CPU_time = 18.112 ops/sec = 489.69E+06 F77 wrapper for Dot_Produ ct Method 5 CPU_time = 41.371 ops/sec = 214.38E+06 Paul option

Method 6 CPU_time = 40.919 ops/sec = 216.75E+06 alternate Paul option

          13085960 Number of dot_product calls
        8869050198 Number of itterations

[u:59aaed4909]NAG Fortran Compiler Windows[/u:59aaed4909]

 10164 equations
  1290 average profile

13117181 coefficients 100.08 storage (mb)

Method 1 CPU_time = 25.725 ops/sec = 344.00E+06 Original DO Loop Method 2 CPU_time = 19.141 ops/sec = 462.31E+06 F90 syntax Method 3 CPU_time = 19.017 ops/sec = 465.35E+06 F77 wrapper for DO Loop Method 4 CPU_time = 19.079 ops/sec = 463.83E+06 F77 wrapper for Dot_Product Method 5 CPU_time = 21.747 ops/sec = 406.93E+06 Paul option Method 6 CPU_time = 19.843 ops/sec = 445.96E+06 alternate Paul option 13086688 Number of dot_product calls 8849310852 Number of itterations

[u:59aaed4909]Intel ifort Linux[/u:59aaed4909]

 10220 equations
  1283 average profile

13116753 coefficients 100.07 storage (mb)

Method 1 CPU_time = 14.473 ops/sec = 607.65E+06 Original DO Loop
Method 2 CPU_time = 14.485 ops/sec = 607.15E+06 F90 syntax
Method 3 CPU_time = 14.301 ops/sec = 614.96E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 14.301 ops/sec = 614.96E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 14.417 ops/sec = 610.01E+06 Paul option
Method 6 CPU_time = 14.441 ops/sec = 609.00E+06 alternate Paul option
13086093 Number of dot_product calls 8794467678 Number of itterations

[u:59aaed4909]gfortran linux[/u:59aaed4909]

 10107 equations
  1297 average profile

13116884 coefficients 100.07 storage (mb)

Method 1 CPU_time = 16.669 ops/sec = 534.48E+06 Original DO Loop
Method 2 CPU_time = 16.597 ops/sec = 536.80E+06 F90 syntax
Method 3 CPU_time = 16.613 ops/sec = 536.29E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 16.613 ops/sec = 536.29E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 16.605 ops/sec = 536.54E+06 Paul option
Method 6 CPU_time = 16.637 ops/sec = 535.51E+06 alternate Paul option
13086562 Number of dot_product calls 8909344714 Number of itterations

[u:59aaed4909]open64 linux[/u:59aaed4909]

 10187 equations
  1287 average profile

13116561 coefficients 100.07 storage (mb)

Method 1 CPU_time = 16.649 ops/sec = 530.03E+06 Original DO Loop
Method 2 CPU_time = 16.501 ops/sec = 534.79E+06 F90 syntax
Method 3 CPU_time = 16.633 ops/sec = 530.54E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 16.613 ops/sec = 531.18E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 16.537 ops/sec = 533.62E+06 Paul option
Method 6 CPU_time = 17.929 ops/sec = 492.19E+06 alternate Paul option
13085999 Number of dot_product calls 8824525977 Number of itterations

14 May 2012 8:04 #10150

and these two:

[u:16b215232a]oracle studio linux [/u:16b215232a]

 10101 equations
  1298 average profile

13116850 coefficients 100.07 storage (mb)

Method 1 CPU_time = 41.175 ops/sec = 216.37E+06 Original DO Loop
Method 2 CPU_time = 40.757 ops/sec = 218.58E+06 F90 syntax
Method 3 CPU_time = 40.993 ops/sec = 217.33E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 40.874 ops/sec = 217.96E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 41.270 ops/sec = 215.87E+06 Paul option
Method 6 CPU_time = 41.051 ops/sec = 217.02E+06 alternate Paul option
13086546 Number of dot_product calls 8908901037 Number of itterations

[u:16b215232a]Absoft Windows[/u:16b215232a]

 10248 equations
  1279 average profile

13115873 coefficients 100.07 storage (mb)

Method 1 CPU_time = 28.205 ops/sec = 310.75E+06 Original DO Loop Method 2 CPU_time = 28.642 ops/sec = 306.01E+06 F90 syntax Method 3 CPU_time = 33.805 ops/sec = 259.27E+06 F77 wrapper for DO Loop Method 4 CPU_time = 33.509 ops/sec = 261.56E+06 F77 wrapper for Dot_Product Method 5 CPU_time = 35.740 ops/sec = 245.24E+06 Paul option Method 6 CPU_time = 37.612 ops/sec = 233.03E+06 alternate Paul option 13085128 Number of dot_product calls 8764684555 Number of itterations

15 May 2012 12:52 (Edited: 15 May 2012 1:11) #10151

David,

Thanks for your results. I have tested on a Xeon W3505, which is an old processor, using Salford, Lahey and Intel. Salford and Lahey are similar. Some of the Salford optimised methods fail and non-optimised fail badly. However for Intel, I get a contra result with the F77 wrapper to the DO loop. This is a worry, as this is my typical (80's) coding style of using libraries of basic routines when writing code. I have also tested on a Core i5-2540M which supports AVX, however in a skyline solver 50% of arguments are not 16-byte alligned. These 3 options (/o1 /o2 and /QxAVX) show the benefit of vector and AVX instructions. The most reliable method appears to be 4 : wrapper to Dot-Product, as both Salford and Lahey fail on array sections. Going back to the 80's approach to optimisation, which was to minimise the number of floating point operations, it is amazing that different options produce such a large spead of run times, in comparison to the time required for the floating point operations ( assuming this to be about 13 seconds on Xeon). The options that fail ( > 20 seconds ) are puzzling.

The following have been run on a Xeon W3505

  It is now Monday, 14 May 2012 at 11:00:50.973
[FTN95/Win32 Ver. 6.10.0 Copyright (c) Silverfrost Ltd 1993-2011]
ftn95 col_test.f95 /opt /lgo 
ftn95 col_test.f95 /opt /p6 /lgo 
ftn95 col_test.f95 /opt /p6 /SINGLE_Threaded /lgo 
ftn95 col_test.f95 /opt /pe /lgo 
     10146 equations
      1292 average profile
  13116399 coefficients
    100.07 storage (mb)

 Method 1  CPU_time =   37.391  ops/sec =  237.20E+06  Original DO Loop              
 Method 2  CPU_time =   37.219  ops/sec =  238.30E+06  F90 syntax                    
 Method 3  CPU_time =   13.125  ops/sec =  675.74E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =   13.141  ops/sec =  674.93E+06  F77 wrapper for Dot_Product   
 Method 5  CPU_time =   37.188  ops/sec =  238.50E+06  Paul option                   
 Method 6  CPU_time =   36.797  ops/sec =  241.03E+06  alternate Paul option         
              13085960 Number of dot_product calls
            8869050198 Number of itterations

ftn95 col_test.f95 /p6 /SINGLE_Threaded /lgo 
ftn95 col_test.f95 /p6 /lgo 
ftn95 col_test.f95 /lgo 
     10146 equations
      1292 average profile
  13116399 coefficients
    100.07 storage (mb)

 Method 1  CPU_time =   52.172  ops/sec =  170.00E+06  Original DO Loop              
 Method 2  CPU_time =   55.125  ops/sec =  160.89E+06  F90 syntax                    
 Method 3  CPU_time =   37.484  ops/sec =  236.61E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =   37.469  ops/sec =  236.71E+06  F77 wrapper for Dot_Product   
 Method 5  CPU_time =   52.641  ops/sec =  168.48E+06  Paul option                   
 Method 6  CPU_time =   44.688  ops/sec =  198.47E+06  alternate Paul option         
              13085960 Number of dot_product calls
            8869050198 Number of itterations

ifort 64 Ver 11.1 Build 20100414
ifort /source:col_test.f95 /free /o1 /Qvec- 
Microsoft (R) Incremental Linker Version 9.00.21022.08
     10220 equations
      1283 average profile
  13116753 coefficients
    100.07 storage (mb)
 
 Method 1  CPU_time =   13.031  ops/sec =  674.88E+06  Original DO Loop              
 Method 2  CPU_time =   12.984  ops/sec =  677.31E+06  F90 syntax                    
 Method 3  CPU_time =   12.781  ops/sec =  688.08E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =   12.797  ops/sec =  687.24E+06  F77 wrapper for Dot_Product   
 Method 5  CPU_time =   12.906  ops/sec =  681.41E+06  Paul option                   
 Method 6  CPU_time =   14.594  ops/sec =  602.62E+06  alternate Paul option         
              13086093  Number of dot_product calls
            8794467678  Number of itterations
15 May 2012 1:00 #10152

The other results ifort /source:col_test.f95 /free /o2 10220 equations 1283 average profile 13116753 coefficients 100.07 storage (mb)

 Method 1  CPU_time =    8.781  ops/sec =    1.00E+09  Original DO Loop              
 Method 2  CPU_time =    8.578  ops/sec =    1.03E+09  F90 syntax                    
 Method 3  CPU_time =   12.812  ops/sec =  686.40E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =    8.547  ops/sec =    1.03E+09  F77 wrapper for Dot_Product   
 Method 5  CPU_time =    8.578  ops/sec =    1.03E+09  Paul option                   
 Method 6  CPU_time =    8.641  ops/sec =    1.02E+09  alternate Paul option         
              13086093  Number of dot_product calls
            8794467678  Number of itterations

lf95 -o1 -tpp -ntrace -nzero col_test.f95 
Lahey/Fujitsu Fortran 95 Compiler Release 5.50j   
    10241 equations
     1280 average profile
 13116540 coefficients
   100.07 storage (mb)

Method 1  CPU_time =   16.859  ops/sec =  520.02E+06  Original DO Loop              
Method 2  CPU_time =   38.328  ops/sec =  228.74E+06  F90 syntax                    
Method 3  CPU_time =   13.203  ops/sec =  664.03E+06  F77 wrapper for DO Loop       
Method 4  CPU_time =   13.219  ops/sec =  663.25E+06  F77 wrapper for Dot_Product   
Method 5  CPU_time =   16.672  ops/sec =  525.87E+06  Paul option                   
Method 6  CPU_time =   16.656  ops/sec =  526.37E+06  alternate Paul option         
13085816  Number of dot_product calls
8767279190  Number of itterations

Results on Core i5-2540M showing the benefit of vector instructions.

/o1 /Qvec-
     10220 equations
      1283 average profile
  13116753 coefficients
    100.07 storage (mb)
 
 Method 1  CPU_time =   10.655  ops/sec =  825.39E+06  Original DO Loop              
 Method 2  CPU_time =   10.436  ops/sec =  842.67E+06  F90 syntax                    
 Method 3  CPU_time =   10.483  ops/sec =  838.91E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =   10.405  ops/sec =  845.19E+06  F77 wrapper for Dot_Product   
 Method 5  CPU_time =   10.312  ops/sec =  852.87E+06  Paul option                   
 Method 6  CPU_time =   10.764  ops/sec =  817.02E+06  alternate Paul option         
              13086093  Number of dot_product calls
            8794467678  Number of itterations

/o2
     10220 equations
      1283 average profile
  13116753 coefficients
    100.07 storage (mb)
 
 Method 1  CPU_time =    6.412  ops/sec =    1.37E+09  Original DO Loop              
 Method 2  CPU_time =    6.380  ops/sec =    1.38E+09  F90 syntax                    
 Method 3  CPU_time =   10.343  ops/sec =  850.29E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =    6.380  ops/sec =    1.38E+09  F77 wrapper for Dot_Product   
 Method 5  CPU_time =    6.380  ops/sec =    1.38E+09  Paul option                   
 Method 6  CPU_time =    6.427  ops/sec =    1.37E+09  alternate Paul option         
              13086093  Number of dot_product calls
            8794467678  Number of itterations

/o2 /QxAVX
     10220 equations
      1283 average profile
  13116753 coefficients
    100.07 storage (mb)
 
 Method 1  CPU_time =    6.037  ops/sec =    1.46E+09  Original DO Loop              
 Method 2  CPU_time =    5.944  ops/sec =    1.48E+09  F90 syntax                    
 Method 3  CPU_time =   14.711  ops/sec =  597.82E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =    5.944  ops/sec =    1.48E+09  F77 wrapper for Dot_Product   
 Method 5  CPU_time =    5.881  ops/sec =    1.50E+09  Paul option                   
 Method 6  CPU_time =    6.068  ops/sec =    1.45E+09  alternate Paul option         
              13086093  Number of dot_product calls
            8794467678  Number of itterations
15 May 2012 4:21 #10153

It is interesting to look at the different trends between options and compilers. A number of interesting ones are: Intel's poor performance for method 3 worries me as this is my preferred style of programing; with libraries of simple routines. Absoft method 2 vs Method 4 is unusual.

What is the code doing for '> 30 seconds' would be worth understanding. Is it a poor instruction set or just not suited to the processor 'optimisation'? FTN95 without /opt is a long way off the pace. Is this due to an old x86 instruction set, which hasn't changed since /P6.

I've been looking at this for years, but feel I don't realy understand. Look forward to any other ideas.

John

15 May 2012 11:54 #10155

Last Friday, Dan wrote in this thread '... you can do some parallelization with FTN95 and do that NATIVELY what no other compiler can do'. Now I noticed something interesting:

I start the task manager and use the fourth tab ('Performance'? / in German 'Leistung'). Here I can see the usage of my 4 processor cores in percent. Before I start my FTN95 program, all of them show nearly zero %. Then I start my program, and immediately all cores are working.

Does this really mean that FTN95 make an automatic parallelisation?

Regards - Wilfried

15 May 2012 3:01 (Edited: 15 May 2012 11:41) #10159

General conclusion is that FTN95 is not optimized for some array options. Because several other compilers clearly optimized all 6 John's variants and bring consistently similar speed.

Further optimization of FTN95 plus adding AVX instructions may bring factor of 2 at least. Meantime multithreading may give factor of 4 on 4 cores (John - think how to divide the external loop on 4 )

David - was that older Absoft compiler ? Wilfried - you've probably made Silverfrost's day today - their dream came through 😃 You know, large spontaneous mutation may happen, according to theory of evolution.

15 May 2012 4:29 #10160

Dan,

Yes. Its a very old version 9.0 of the Absoft compiler. I am sure the newer releases are much faster.

Wilfried,

What you are seeing in task manager is that the operating system is spreading the task of your program across multiple cores and hardware threads. But you don't get more than 100% of a core all together this way. With 2 cores, you may get 50% on one, 50% on another.

With a parallel program it is possible to get 100% on each core, or 200% in total. (Actually, you never get exactly 200% for the full run time of the program, but it can be 199% depending on the program.)

[u:3accb49873]General[/u:3accb49873]

I don't fully understand the reason for the difference in speed with FTN95 for methods which are essentially the same. The other compilers show more consistent performance. Paul is having a look at it so we should wait for him to do that.

For John's code, it seems that there is not enough work in the dot product to make parallelisation worthwhile. I agree that the best strategy would be to vectorise the dot product somehow. With FTN95 the only way to do this is to use the inline assembler :? and write SSE2 or AVX machine code. Or use a different compiler that can vectorise the Fortran code automatically. This should buy you a factor of 2 to 4.

It [u:3accb49873]might [/u:3accb49873]be possible to parallelise the outer loop and get another factor of 4 (on a 4 core machine), but I have not studied the code enough to know if this is possible.

17 May 2012 1:20 #10172

Dan and David,

The outer loop is basically:

      A_ptr_0 = 1-IEQ_bot                       ! virtual pointer to A(0) 
! 
      DO J = JB,JT                              ! loop over range of equations 
         {calculate JBAND A_ptr_b and B_ptr_b}
         c = Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )
         A(A_ptr_0+J) = A(A_ptr_0+J) - c
      END DO

As A varies with each ittertion and the updated value of A is used in the next itteration, there is a limit to being able to multi-thread. It might be able to be done by doing: DO J = JB,JT,4 {perform 4 blocks simultaneously, then the lower 4x4 triangle last} This is similar to the approach for a multi-block storage requirement.

The most interesting idea is from davidb and 'With FTN95 the only way to do this is to use the inline assembler (for Vec_Sum or Dot_Product) and write SSE2 or AVX machine code.' I am not capable of doing this or know the limitations as to how to test if the approach is supported by the processor ? I'm also trying to understand the issues associated with memory alignment and if this can be better managed. This may also need to be addressed.

Thanks for the ideas. I hope we can all learn something from this exercise.

John

17 May 2012 1:37 #10173

John, Are threads completely independent in your example above? They do not have to access the same matrix element at the same time

For your task with bandwidth 1500 the most suitable would be using latest NVIDIA Kepler (see today announcement) which has almost exactly such amount of CUDA cores. Your dot product will take less time then you push Enter to complete 😃.

BTW, I hope Silverfrost will add SSE and AVX as an option. Or may be CUDA too to make a true killing machine?

17 May 2012 1:56 #10175

Or possibly a library interface of some basic array procedures as a first step. Dot_Product and Vector_A = Vector_A + const * Vector_B would get my vote.

Also matrix multiplication [A] x [B] or [A]transpose x [B] would be a good start. And put severe limitations on the arguments, to suit the optimisation, say limiting to REAL*8 arguments and not 2D array sections.

Once one of these could be generated, the others might be able to follow, say as a public domain list of compatioble routines for SSE2 or AXV.

Even proving one function works with FTN95 would be a significant step.

John

17 May 2012 9:17 #10180

I have written dot product using the Silverfrost FTN95 inline assembler which uses sse2 instructions to vectorise the contribution from successive sets of four values from each array.

I am still learning and there are probably some optimisations that can be done.

Note that the code only works for arrays whose sizes are exact multiples of 4. This is an interim step. I will modify it soon, to account for more general array sizes.

I don't know how fast the code is yet - I haven't timed it.

The code uses single precision and SSE2 but I can easily modify it for double precision later.

Unfortunately I don't have a machine which supports AVX.

I debugged this with FTN95s debugger. The only issue I had is I can't view any of the SSE registers. (I had to 'poke' values into Fortran variables to see them 😉 )

Don't forget it needs to be compiled in win32 mode. (Not dot net).

Anyway here is the code.

function asm_dotprod(v1,v2,n)
   integer n
   real v1(*), v2(*), v(4)
   real asm_dotprod

   integer m
   
   m = 4*n

   ! Assembly code is between code, edoc lines
   code
      xorps xmm0%,xmm0%     ; set xmm0 to zero
    
      mov eax%, =v1        ; address of v1 argument
      mov ecx%, =v2        ; address of v2 argument

      mov edx%, 0          ; Initialise loop counter
   
   10 movups xmm1%, [eax%+edx%]  ; move next 4 reals in v1 into xmm1
      movups xmm2%, [ecx%+edx%]  ; move next 4 reals in v2 into xmm2
      mulps xmm1%, xmm2%         ; multiply values and store in xmm1
      addps xmm0%, xmm1%         ; accumulate sum in xmm0
      add edx%, 16                ; increment counter
      
      cmp edx%, m
      jne $10                    ; conditional branch back.
    
      ! Can't get final reduction to work, so will do this in Fortran for now
      movups v, xmm0%            ; move xmm0 to v array
   edoc
   
   ! Final reduction, the result is the sum of the four values in v
   asm_dotprod = sum(v)

end function asm_dotprod


program anon

   ! Note. asm_dotprod only works with arrays of size exact multiples of 4.
   ! Will fix this soon.
   real a(12), b(12)
   
   a = (/ 2.0, 2.0, 3.0, 4.0, 1.0, 3.0, 2.0, 4.0, 1.0, 2.0, 3.0, 1.5 /)
   b = (/ 2.5, 1.5, 2.4, 3.2, 2.0, 1.5, 2.0, 3.0, 1.5, 2.0, 1.5, 3.0 /)

   print *, 'Here are the test vectors'
   print *, 'a = '
   print *, a
   print *, 'b = '
   print *, b

   print *
   print *, 'This is the correct value for the dot product, calculated using Fortran'
   print *, dot_product(a,b)

   ! Calculate dot product using assembler and sse2
   s = asm_dotprod(a,b,12)

   print *
   print *, 'Here is the value of the dot product calculated using assembly language and sse2'
   print *, s

end program anon

The output is:

Here are the test vectors a = 2.00000 2.00000 3.00000 4.00000 1.00000 3.00000 2.00000 4.00000 1.00000 2.00000 3.00000 1.50000

b = 2.50000 1.50000 2.40000 3.20000 2.00000 1.50000 2.00000 3.00000 1.50000 2.00000 1.50000 3.00000

This is the correct value for the dot product, calculated using Fortran 65.0000

Here is the value of the dot product calculated using assembly language and sse2 65.0000

18 May 2012 7:28 #10181

David,

Based on your earlier post, coding of the following form appears to work: real8 function vec_sum_4 (a,b,n) ! real8 a(), b() integer4 n ! integer4 i, nn real*8 c ! nn = (n/4)*4 c = 0 do i = 1,nn,4 c = c + sum ( a(i:i+3) * b(i:i+3) ) end do ! if (i==n) c = c + c(n)*b(n) if (i<n) c = c + sum ( a(i:n) * b(i:n) ) ! vec_sum_4 = c end

I've tested that it will run, but not yet tested the results or suitability for AVX

John

Please login to reply.