Silverfrost Forums

Welcome to our forums

Calculating the outer vector product

20 Jun 2018 6:38 #22233

For the outer vector product an intrinsic procedure is not available. It can be calculated by two simple do loops, the spread function or may be by other techniques. The 'spread solution' is recommended in several references (for instance in Numerical Recipes). In the following test program both methods are tested:

     MODULE DataTypes
        INTEGER  , PARAMETER :: I4B    = SELECTED_INT_KIND (9)
        INTEGER  , PARAMETER :: I2B    = SELECTED_INT_KIND (4)
        INTEGER  , PARAMETER :: I1B    = SELECTED_INT_KIND (2)
        INTEGER  , PARAMETER :: SP     = KIND (1.0)
        INTEGER  , PARAMETER :: dp     = KIND (1.0D0)
        INTEGER  , PARAMETER :: LGT    = KIND (.true.)
      END MODULE DataTypes



  



    Program Test_OuterProduct

      Use DataTypes
      Implicit None

      Integer , Parameter :: m = 100, n = 100
      Integer             :: i, j
      Integer             :: seed = 123

      Real (dp) , Dimension (m)      :: a
      Real (dp) , Dimension (n)      :: b
      Real (dp) , Dimension (m,n)    :: c

      Real (dp)                      :: harvest
      Real (dp)                      :: time1, time2

!     ---------------------------------------------------------------------------------------------
!                                         Defining a and b      
      
      call random_seed (seed)

      Do i=1,m
          call random_number (harvest)
          a (i) = harvest
      End Do

      Do i=1,n
          call random_number (harvest)
          b (i) = harvest
      End Do

!     ---------------------------------------------------------------------------------------------
!                      Straightforward method for outer product by 2 do loops

      Call Cpu_Time ( time1 )

      Do i = 1, m
        Do j = 1, n
         c(i,j) = a(i) * b(j)
        End Do
      End Do

      Call Cpu_Time ( time2 )

      write (*,*) c(1,1), c(m,n)

      write (*,*)
      write (*,*) 'Processing time for straightforward method was', time2-time1, ' seconds'
      write (*,*)
!     ---------------------------------------------------------------------------------------------
!                          'Recommended' method via the spread procedure
     
      Call Cpu_Time ( time1 )

      c (1:m,1:n) = spread ( a(1:m),dim=2,ncopies=size(b))*spread(b(1:n),dim=1,ncopies=size(a) )

      Call Cpu_Time ( time2 )

      write (*,*) c(1,1), c(m,n)

      write (*,*)
      write (*,*) 'Processing time employing spread           was', time2-time1, ' seconds'
      write (*,*)

      read  (*,*)

    End Program Test_OuterProduct

With vectors a(m) and b(n) of size 1000 (both), the outer vector product can be calculated by the simple do loops in very short time (app. 15 ms). Using the spread intrinsic procedure of the FTN95 compiler results in extensively increasing computer times. For vector sizes much below 1000, FTN95 needs the following running times:

For m=n=100 approximately 5 s, for 150 app. 24 **s **and for 200 already 76(!) s.

The results of both methods are identical.

I would like to add that this is not the case when running this test case with Code:Blocks and the GNU Fortran compiler. With this compiler, both methods calculate the vector product of size 1000 in app. 15 ms.

Of course, one could question the method using the spread procedure as being too complicated for the offered advantages (parallel computation possible). Anyhow, I report this for eventual checking. Klaus [/code]

20 Jun 2018 8:09 #22234

The key is that for FTN95 you need an explicit call to MATMUL. Presumably gFortran is clever enough to work out that a matrix product is intended.

My guess is that FTN95 makes a literal interpretation and ends up doing the SPREAD for each element in turn.

You will need to check the details but for FTN95 you will need something like

c = matmul(spread(a,dim=2,ncopies=n),spread(b,dim=1,ncopies=m))/n

(Do we divide by n or m?)

20 Jun 2018 8:44 #22235

Thank you very much, Paul.

Unfortunately, my last sentence of my mail was cut-off:

'Of course, one could question the method using the spread procedure as being too complicated for the offered advantages (parallel computation possible). Anyhow, I report this for eventual checking.'

I will have a closer look to your proposal and give feedback. But very likely I will not use the 'spread solution'. My problem was that this was included in another solution and it took some time to find out the source of the problem.

Klaus

20 Jun 2018 8:45 #22236

Thank you very much, Paul.

Unfortunately, my last sentence of my mail was cut-off:

'Of course, one could question the method using the spread procedure as being too complicated for the offered advantages (parallel computation possible). Anyhow, I report this for eventual checking.'

I will have a closer look to your proposal and give feedback. But very likely I will not use the 'spread solution'. My problem was that this was included in another solution and it took some time to find out the source of the problem.

Klaus

21 Jun 2018 5:34 #22238

Klaus,

I would replace CPU_Time with Elapse_Time, which calls System_Clock or RDTSC_VAL@(). You won't get a meaningful result with CPU_Time for your first test.

I am trying m=n=1000, which certainly shows FTN95 implementation of SPREAD does not work well. ( Still hasn't finished ! I was expecting a stack overflow error )

Alternatives could be: forall (i=1:m, j=1:n) c(i,j) = a(i)*b(j) or Do j = 1, n c(:,j) = a(:) * b(j) End Do

You want 'I' to be the inner loop, which my second option guarantees. With FORALL, the order is uncertain.

Given the uncertainty of the stack requirement for SPREAD, how could this be a recommended approach ?

John

21 Jun 2018 6:03 #22239

Please note that there is no implied problem with SPREAD. In this context, FTN95 does require an explicit call to MATMUL and this seems right to me.

21 Jun 2018 6:57 #22240

Quoted from PaulLaidler FTN95 does require an explicit call to MATMUL and this seems right to me. Paul, wouldn't this be a different calculation ?

oops ! The following adaptation still has a delay:

       Call elapse_Time ( time1 ) 

!       c (1:m,1:n) = spread ( a(1:m),dim=2,ncopies=size(b) )   &
!                   * spread ( b(1:n),dim=1,ncopies=size(a) ) 
       allocate ( aa(m,n) )
       allocate ( bb(m,n) )
       aa = spread ( a(1:m),dim=2,ncopies=size(b) ) 
       bb = spread ( b(1:n),dim=1,ncopies=size(a) ) 
       c = aa * bb

       Call elapse_Time ( time2 ) 

       write (*,*) c(1,1), c(m,n) 

       write (*,*) 
       write (*,11) time2-time1, ' seconds : Processing time employing spread method : ',n
       write (*,*) 

See the linked test program that shows SPREAD is very slow.

https://www.dropbox.com/s/4r5kzn5onefvkka/kl2.f90?dl=0

21 Jun 2018 9:23 #22241

Dear Paul, dear John,

thank you for your remarks. I am aware of the deficiency of my time evaluation. I just wanted to demonstrate the difference of computational costs between both methods: 2 (or even more) orders of magnitude!

The problem arises when using standard programs from Numerical Recipes. A quick look reveals that 13 procedures make use of the utility program outerprod, which uses the “spread solution”. Amongst these programs are important solvers such as gaussj, ludcmp and svdcmp. Not modifying the procedure outerprod ends in a disaster for larger systems when using ftn95. Whether one uses do loops, forall loops in various variations instead does not really matter. How would your proposal, Paul, look in detail taking also different m and n into account? Or is your position that the “spread solution” is an unnecessary, excessive use of Fortran rules?

I would like to add that any discussion what formal requirements intrinsic Fortran procedures must fulfill by far exceeds my knowledge. Klaus

21 Jun 2018 9:46 #22242

John

My point is that the program supplied by Klaus does not do what he intends in the calls to SPREAD.

The form

c = SPREAD*SPREAD

does not do a MATMUL on the right and then an element-by-element copy to the left.

A MATMUL is required to get what is intended. I am only guessing but this seems to me to be the mostly likely explanation.

21 Jun 2018 10:48 #22243

The computation of the outer product of two m X n matrices involves m.n multiplications plus 3 m.n load/store operations.

There is no need to do matrix multiplication in this. Firstly, the two input matrices A and B cannot be multiplied unless m = n. Secondly, the number of arithmetic operations needed for doing matrix multiplication is an order of magnitude higher than for forming the outer product. For these reasons, I do not see why MATMUL made its way into the discussion.

I find that with Gfortran the calculation with the 'straightforward method' can be made about 10 times faster if the nested DO loops are replaced by

      Do j = 1, n 
         c(1:m,j) = a(1:m) * b(j) 
      End Do 

Perhaps this modification (i.e., using vector operations) could be tried with FTN95 as well.

21 Jun 2018 11:01 #22244

The 'spread solution' is not my idea. It has been taken from Numerical Recipes in Fortran 90, Second edition, where it is discussed in detail (Chapter 22 and 23). Klaus

21 Jun 2018 11:05 #22245

The link I provided demonstrates a number of loop alternatives, including array syntax instead of the inner loop, which was the fastest option with FTN95. As Klaus reported, it also shows that FTN95 implementation of SPREAD is extremely slow. This needs to be reviewed. I used both win32 and /64 in tests.

21 Jun 2018 11:44 #22246

John

As I understand it, the initial program on this thread does not demonstrate that the FTN95 SPREAD function is slow. If you can generate a direct and simple use of SPREAD that shows its slowness then please let me know.

The program as shown calls SPREAD 20000 times whilst the alternative with an explicit call to MATMUL calls SPREAD twice.

22 Jun 2018 1:09 #22251

Paul,

As I posted yesterday: See the linked test program that shows SPREAD is very slow.

https://www.dropbox.com/s/4r5kzn5onefvkka/kl2.f90?dl=0

Download it and run this test. I put write statements after each use of SPREAD. Can't you see very clearly how slow SPREAD is for Klaus's modified example ?

Where is the 20,000 times ? It takes ages to do 1 iteration of the following loop. ( 2.6 seconds for a single SPREAD ! )

       allocate ( aa(m,n) )
       allocate ( bb(m,n) )
       Call elapse_Time ( time1 ) 

!       c (1:m,1:n) = spread ( a(1:m),dim=2,ncopies=size(b) )   &
!                   * spread ( b(1:n),dim=1,ncopies=size(a) ) 
           x = del_sec ()
       do k = 1,pass
         aa = spread ( a(1:m),dim=2,ncopies=size(b) ) 
           x = del_sec ()
           write (*,*) x, ' spread aa'
         bb = spread ( b(1:n),dim=1,ncopies=size(a) ) 
           x = del_sec ()
           write (*,*) x, ' spread bb'
         c = aa * bb
           x = del_sec ()
           write (*,*) x, ' multiply c'
       end do

       Call elapse_Time ( time2 ) 

I can run this example in Plato, selecting either FTN95 or gFortran with 'Release x64' (try gFortran first )

Let me know if it can not be reproduced.

22 Jun 2018 6:11 #22252

John

The way that I tested this was to put a print statement within the code for the SPREAD function. I have done the same for your test program with similar results and this was after changing line 148 to reduce the number of passes in this DO loop to one.

There may be a problem within FTN95 in that it may be treating SPREAD as 'elemental' when it is not. Either way, I have no evidence at the moment that there is a problem with the code for SPREAD.

22 Jun 2018 11:23 (Edited: 22 Jun 2018 2:17) #22254

Here is a simplified version of John Campbell's test program, with only the SPREAD based version of the outer product computation being included. I hope that this version succeeds in convincing readers that the FTN95 implementation of SPREAD needs improvement.

      MODULE DataTypes 
         INTEGER  , PARAMETER :: I4B    = SELECTED_INT_KIND (9) 
         INTEGER  , PARAMETER :: I2B    = SELECTED_INT_KIND (4) 
         INTEGER  , PARAMETER :: I1B    = SELECTED_INT_KIND (2) 
         INTEGER  , PARAMETER :: SP     = KIND (1.0) 
         INTEGER  , PARAMETER :: dp     = KIND (1.0D0) 
         INTEGER  , PARAMETER :: LGT    = KIND (.true.) 
       END MODULE DataTypes 

  program test

       call Test_OuterProduct (  10, 10)
       call Test_OuterProduct (  50, 50)
       call Test_OuterProduct ( 100,100)
       call Test_OuterProduct ( 200,200)

  end program

     subroutine Test_OuterProduct (m,n)

       Use DataTypes 
       Implicit None 

       Integer :: m,n

       Real (dp) , allocatable, Dimension (:)   :: a (:), b(:), c(:,:)
       Integer :: seed = 123 
       Real (dp) :: time1, time2

       allocate ( a(m), b(n), c(m,n) )
!
       call random_seed (seed) 
       call random_number (a) 
       call random_number (b)

       Call elapse_Time ( time1 ) 

       c = spread ( a,dim=2,ncopies=n ) * spread ( b,dim=1,ncopies=m ) 

       Call elapse_Time ( time2 ) 

       write (*,11) n,time2-time1
    11 format ( I4,f12.6)
     End subroutine Test_OuterProduct 

     Subroutine elapse_time ( sec )
       real*8 sec
       integer*8 tick, rate
       integer*4 :: kk = 0

       call system_clock ( tick, rate )
       if (kk == 0 ) then
         write (*,'(A,T15,I12,//,A,/)') 'System_clock rate =',rate,'   n    time (s)'
         kk = rate
       end if
       sec = dble(tick)/dble(rate)
     end subroutine elapse_time

Here are the outputs from FTN95 8.30 with /opt /64:

System_clock rate =  10000

   n    time (s)

  10    0.004300
  50    0.091000
 100    1.417600
 200   25.589800

and Gfortran 7.2, 64 bit:

System_clock r     3579545

   n    time (s)

  10    0.006306
  50    0.000151
 100    0.000203
 200    0.001016

The results were from running the program on a desktop PC with an AMD Athlon 64-4200+, running Windows 10-64.

For n = 200, the run time of the EXE generated by FTN95 is ~25,000 times that of the EXE generated by Gfortran.

22 Jun 2018 11:42 #22255

Paul,

Your suggestion of FTN95 may be treating SPREAD as 'elemental' is certainly consistent with the timing performance. I hope it can be fixed.

Imagine if Dot_Product had the same problem; imagine the improvement we could get.

John

22 Jun 2018 6:05 #22256

I may be wrong but as far as I know the operator '*' is not overloaded to mean matrix multiplication in this context. If matrix multiplication is intended then FTN95 requires a call to MATMUL. This then results in two calls to SPREAD. Otherwise FTN95 provides a vast number of calls to SPREAD presumably doing the whole calculation repeatedly for each element in turn.

There is no implied problem with SPREAD but FTN95 is not doing what the programmer intended.

23 Jun 2018 12:28 #22257

The Fortran 95 standard documents the properties of the arguments and result value of intrinsic procedures in section 13.14. For each such procedure, the second attribute listed is 'Class'. For functions such as ABS, SQRT, etc., the listed class is 'Elemental Function'. For SPREAD, MATMUL, etc., the listed class is 'Transformational Function'.

From this I conclude that SPREAD is not an elemental function and I think that the implementation of SPREAD in FTN95 should not be forced to be elemental, given the severe performance penalty that has been documented in this thread.

Unless use cases can be presented for which the 'Elemental' attribute is beneficial, that attribute should, perhaps, not be added by a particular compiler vendor.

23 Jun 2018 6:19 #22258

I understand that SPREAD is not elemental and that FTN95 is not doing what the programmer intended. At the moment I am not clear whether FTN95 can do anything to correct this. It may not even be able to provide an error or warning.

I will aim to look at it when I get a moment.

Please login to reply.