
forums.silverfrost.com Welcome to the Silverfrost forums

View previous topic :: View next topic 
Author 
Message 
KL
Joined: 16 Nov 2009 Posts: 141

Posted: Wed Jun 20, 2018 7:38 am Post subject: Calculating the outer vector product 


For the outer vector product an intrinsic procedure is not available. It can be calculated by two simple do loops, the spread function or may be by other techniques. The "spread solution" is recommended in several references (for instance in Numerical Recipes). In the following test program both methods are tested:
Code: 
MODULE DataTypes
INTEGER , PARAMETER :: I4B = SELECTED_INT_KIND (9)
INTEGER , PARAMETER :: I2B = SELECTED_INT_KIND (4)
INTEGER , PARAMETER :: I1B = SELECTED_INT_KIND (2)
INTEGER , PARAMETER :: SP = KIND (1.0)
INTEGER , PARAMETER :: dp = KIND (1.0D0)
INTEGER , PARAMETER :: LGT = KIND (.true.)
END MODULE DataTypes

Code: 
Program Test_OuterProduct
Use DataTypes
Implicit None
Integer , Parameter :: m = 100, n = 100
Integer :: i, j
Integer :: seed = 123
Real (dp) , Dimension (m) :: a
Real (dp) , Dimension (n) :: b
Real (dp) , Dimension (m,n) :: c
Real (dp) :: harvest
Real (dp) :: time1, time2
! 
! Defining a and b
call random_seed (seed)
Do i=1,m
call random_number (harvest)
a (i) = harvest
End Do
Do i=1,n
call random_number (harvest)
b (i) = harvest
End Do
! 
! Straightforward method for outer product by 2 do loops
Call Cpu_Time ( time1 )
Do i = 1, m
Do j = 1, n
c(i,j) = a(i) * b(j)
End Do
End Do
Call Cpu_Time ( time2 )
write (*,*) c(1,1), c(m,n)
write (*,*)
write (*,*) 'Processing time for straightforward method was', time2time1, ' seconds'
write (*,*)
! 
! "Recommended" method via the spread procedure
Call Cpu_Time ( time1 )
c (1:m,1:n) = spread ( a(1:m),dim=2,ncopies=size(b))*spread(b(1:n),dim=1,ncopies=size(a) )
Call Cpu_Time ( time2 )
write (*,*) c(1,1), c(m,n)
write (*,*)
write (*,*) 'Processing time employing spread was', time2time1, ' seconds'
write (*,*)
read (*,*)
End Program Test_OuterProduct

With vectors a(m) and b(n) of size 1000 (both), the outer vector product can be calculated by the simple do loops in very short time (app. 15 ms). Using the spread intrinsic procedure of the FTN95 compiler results in extensively increasing computer times. For vector sizes much below 1000, FTN95 needs the following running times:
For m=n=100 approximately 5 s, for 150 app. 24 s and for 200 already 76(!) s.
The results of both methods are identical.
I would like to add that this is not the case when running this test case with Code:Blocks and the GNU Fortran compiler. With this compiler, both methods calculate the vector product of size 1000 in app. 15 ms[/b:969912 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 6161 Location: Salford, UK

Posted: Wed Jun 20, 2018 9:09 am Post subject: 


The key is that for FTN95 you need an explicit call to MATMUL. Presumably gFortran is clever enough to work out that a matrix product is intended.
My guess is that FTN95 makes a literal interpretation and ends up doing the SPREAD for each element in turn.
You will need to check the details but for FTN95 you will need something like
Code:  c = matmul(spread(a,dim=2,ncopies=n),spread(b,dim=1,ncopies=m))/n 
(Do we divide by n or m?) 

Back to top 


KL
Joined: 16 Nov 2009 Posts: 141

Posted: Wed Jun 20, 2018 9:44 am Post subject: 


Thank you very much, Paul.
Unfortunately, my last sentence of my mail was cutoff:
"Of course, one could question the method using the spread procedure as being too complicated for the offered advantages (parallel computation possible). Anyhow, I report this for eventual checking."
I will have a closer look to your proposal and give feedback. But very likely I will not use the "spread solution". My problem was that this was included in another solution and it took some time to find out the source of the problem.
Klaus 

Back to top 


KL
Joined: 16 Nov 2009 Posts: 141

Posted: Wed Jun 20, 2018 9:45 am Post subject: 


Thank you very much, Paul.
Unfortunately, my last sentence of my mail was cutoff:
"Of course, one could question the method using the spread procedure as being too complicated for the offered advantages (parallel computation possible). Anyhow, I report this for eventual checking."
I will have a closer look to your proposal and give feedback. But very likely I will not use the "spread solution". My problem was that this was included in another solution and it took some time to find out the source of the problem.
Klaus 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2157 Location: Sydney

Posted: Thu Jun 21, 2018 6:34 am Post subject: 


Klaus,
I would replace CPU_Time with Elapse_Time, which calls System_Clock or RDTSC_VAL@(). You won't get a meaningful result with CPU_Time for your first test.
I am trying m=n=1000, which certainly shows FTN95 implementation of SPREAD does not work well. ( Still hasn't finished ! I was expecting a stack overflow error )
Alternatives could be:
forall (i=1:m, j=1:n) c(i,j) = a(i)*b(j)
or
Do j = 1, n
c(:,j) = a(:) * b(j)
End Do
You want "I" to be the inner loop, which my second option guarantees. With FORALL, the order is uncertain.
Given the uncertainty of the stack requirement for SPREAD, how could this be a recommended approach ?
John 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 6161 Location: Salford, UK

Posted: Thu Jun 21, 2018 7:03 am Post subject: 


Please note that there is no implied problem with SPREAD. In this context, FTN95 does require an explicit call to MATMUL and this seems right to me. 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2157 Location: Sydney

Posted: Thu Jun 21, 2018 7:57 am Post subject: Re: 


PaulLaidler wrote:  FTN95 does require an explicit call to MATMUL and this seems right to me.  Paul, wouldn't this be a different calculation ?
oops ! The following adaptation still has a delay:
Code:  Call elapse_Time ( time1 )
! c (1:m,1:n) = spread ( a(1:m),dim=2,ncopies=size(b) ) &
! * spread ( b(1:n),dim=1,ncopies=size(a) )
allocate ( aa(m,n) )
allocate ( bb(m,n) )
aa = spread ( a(1:m),dim=2,ncopies=size(b) )
bb = spread ( b(1:n),dim=1,ncopies=size(a) )
c = aa * bb
Call elapse_Time ( time2 )
write (*,*) c(1,1), c(m,n)
write (*,*)
write (*,11) time2time1, ' seconds : Processing time employing spread method : ',n
write (*,*)

See the linked test program that shows SPREAD is very slow.
https://www.dropbox.com/s/4r5kzn5onefvkka/kl2.f90?dl=0 

Back to top 


KL
Joined: 16 Nov 2009 Posts: 141

Posted: Thu Jun 21, 2018 10:23 am Post subject: 


Dear Paul, dear John,
thank you for your remarks. I am aware of the deficiency of my time evaluation. I just wanted to demonstrate the difference of computational costs between both methods: 2 (or even more) orders of magnitude!
The problem arises when using standard programs from Numerical Recipes. A quick look reveals that 13 procedures make use of the utility program outerprod, which uses the “spread solution”. Amongst these programs are important solvers such as gaussj, ludcmp and svdcmp. Not modifying the procedure outerprod ends in a disaster for larger systems when using ftn95. Whether one uses do loops, forall loops in various variations instead does not really matter. How would your proposal, Paul, look in detail taking also different m and n into account? Or is your position that the “spread solution” is an unnecessary, excessive use of Fortran rules?
I would like to add that any discussion what formal requirements intrinsic Fortran procedures must fulfill by far exceeds my knowledge.
Klaus 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 6161 Location: Salford, UK

Posted: Thu Jun 21, 2018 10:46 am Post subject: 


John
My point is that the program supplied by Klaus does not do what he intends in the calls to SPREAD.
The form
c = SPREAD*SPREAD
does not do a MATMUL on the right and then an elementbyelement copy to the left.
A MATMUL is required to get what is intended. I am only guessing but this seems to me to be the mostly likely explanation. 

Back to top 


mecej4
Joined: 31 Oct 2006 Posts: 1234

Posted: Thu Jun 21, 2018 11:48 am Post subject: 


The computation of the outer product of two m X n matrices involves m.n multiplications plus 3 m.n load/store operations.
There is no need to do matrix multiplication in this. Firstly, the two input matrices A and B cannot be multiplied unless m = n. Secondly, the number of arithmetic operations needed for doing matrix multiplication is an order of magnitude higher than for forming the outer product. For these reasons, I do not see why MATMUL made its way into the discussion.
I find that with Gfortran the calculation with the "straightforward method" can be made about 10 times faster if the nested DO loops are replaced by
Code:  Do j = 1, n
c(1:m,j) = a(1:m) * b(j)
End Do 
Perhaps this modification (i.e., using vector operations) could be tried with FTN95 as well. 

Back to top 


KL
Joined: 16 Nov 2009 Posts: 141

Posted: Thu Jun 21, 2018 12:01 pm Post subject: 


The "spread solution" is not my idea. It has been taken from Numerical Recipes in Fortran 90, Second edition, where it is discussed in detail (Chapter 22 and 23). Klaus 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2157 Location: Sydney

Posted: Thu Jun 21, 2018 12:05 pm Post subject: 


The link I provided demonstrates a number of loop alternatives, including array syntax instead of the inner loop, which was the fastest option with FTN95.
As Klaus reported, it also shows that FTN95 implementation of SPREAD is extremely slow. This needs to be reviewed. I used both win32 and /64 in tests. 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 6161 Location: Salford, UK

Posted: Thu Jun 21, 2018 12:44 pm Post subject: 


John
As I understand it, the initial program on this thread does not demonstrate that the FTN95 SPREAD function is slow. If you can generate a direct and simple use of SPREAD that shows its slowness then please let me know.
The program as shown calls SPREAD 20000 times whilst the alternative with an explicit call to MATMUL calls SPREAD twice. 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2157 Location: Sydney

Posted: Fri Jun 22, 2018 2:09 am Post subject: 


Paul,
As I posted yesterday: See the linked test program that shows SPREAD is very slow.
https://www.dropbox.com/s/4r5kzn5onefvkka/kl2.f90?dl=0
Download it and run this test. I put write statements after each use of SPREAD.
Can't you see very clearly how slow SPREAD is for Klaus's modified example ?
Where is the 20,000 times ? It takes ages to do 1 iteration of the following loop. ( 2.6 seconds for a single SPREAD ! )
Code:  allocate ( aa(m,n) )
allocate ( bb(m,n) )
Call elapse_Time ( time1 )
! c (1:m,1:n) = spread ( a(1:m),dim=2,ncopies=size(b) ) &
! * spread ( b(1:n),dim=1,ncopies=size(a) )
x = del_sec ()
do k = 1,pass
aa = spread ( a(1:m),dim=2,ncopies=size(b) )
x = del_sec ()
write (*,*) x, ' spread aa'
bb = spread ( b(1:n),dim=1,ncopies=size(a) )
x = del_sec ()
write (*,*) x, ' spread bb'
c = aa * bb
x = del_sec ()
write (*,*) x, ' multiply c'
end do
Call elapse_Time ( time2 )

I can run this example in Plato, selecting either FTN95 or gFortran with "Release x64" (try gFortran first )
Let me know if it can not be reproduced. 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 6161 Location: Salford, UK

Posted: Fri Jun 22, 2018 7:11 am Post subject: 


John
The way that I tested this was to put a print statement within the code for the SPREAD function. I have done the same for your test program with similar results and this was after changing line 148 to reduce the number of passes in this DO loop to one.
There may be a problem within FTN95 in that it may be treating SPREAD as "elemental" when it is not. Either way, I have no evidence at the moment that there is a problem with the code for SPREAD. 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

Powered by phpBB © 2001, 2005 phpBB Group
