Silverfrost Forums

Welcome to our forums

Calculating the outer vector product

24 Jun 2018 11:16 #22259

Quoted from PaulLaidler I understand that SPREAD is not elemental and that FTN95 is not doing what the programmer intended. At the moment I am not clear whether FTN95 can do anything to correct this. It may not even be able to provide an error or warning.

Paul, I hope that this problem can be addressed because, otherwise, I regret to say, SPREAD is unusable in FTN95. To see this, please consider the following test program, which contains nothing more than the most basic use of SPREAD -- to construct a square matrix each of whose columns contains a copy of a given vector.

program v2m
! construct matrix A(n,n) with elements A(i,j) = i for i = 1:n, j=1:n
implicit none
integer :: i,k,n
real, allocatable :: v(:),A(:,:)
real :: t1,t2
!
n=50
do k=1,3
   allocate(v(n),A(n,n))
   v = (/ (i, i=1,n) /)
   call cpu_time(t1)
   A = spread(v, DIM=2, NCOPIES = n)
   call cpu_time(t2)
   deallocate(v,A)
   write(*,'(I4,2x,F6.2,1x,A)')n,t2-t1,'s'
   n=n*2
end do
end program

For n = 200, this code took 63 seconds with 32-bit FTN95 and 17 seconds with 64-bit FTN95, both compiled with /opt specified.

24 Jun 2018 2:19 #22260

I never even heard of the SPREAD function before reading this thread, so I have learnt something. As it suffers the same(-ish) fault in the 32-bit compiler version, I wonder just how widely it is used.

While I am an enthusiast for things working as they should (and in this context, standard-conforming must also include working efficiently as well as properly), in practical terms it is a matter of supreme indifference to me, as someone who is rather old-fashioned in programming, whether it works or not. I suspect that I am not entirely alone in this view.

I wondered what people did before such functions were part of Fortran.

I therefore had a go at writing a routine to do pretty much what mecej4 described with his code example. Here it is:

      PROGRAM V2M_GERIATRIC
      DIMENSION A(500,500)
      DO 30 K = 50, 500, 50
      CALL CPU_TIME (T1)
      DO 20 J = 1,K
      DO 10 I = 1,K
      A(I,J)  = I
  10  CONTINUE
  20  CONTINUE
      CALL CPU_TIME (T2)
      WRITE(*,'(I4,F15.8,1X,A)') K, T2-T1, 's'
  30  CONTINUE
      END

Rather interestingly, the whole thing takes, as closely as I can judge, 6 seconds, which is the length of the nag screen introduced into the PE when one uses /LGO in version 8.30! (32 bit only).

I am reminded of the opening line from L. P. Hartley's 'The Go Between': 'The past is a foreign country, they do things differently there.'

Emigrating to the present is not always an advance.

The book is a good read too ...

Eddie

25 Jun 2018 5:53 #22261

Yes there is certainly a problem in FTN95 when calling SPREAD. It works correctly when used within MATMUL but Mecj4's program demonstrates a bug that has only just come to light after more than 20 years.

I have noted that this needs fixing.

25 Jun 2018 8:09 #22262

Quoted from John-Silver Can I cast some doubt on the timings for gfortran aas given at bottom of mecej4's 22 jun comment at the top of this, p. 2 of the post .....

I note that for N=10, ftn95 is 75% of the time of gfortran. Faster !!!!

Now, between N=10 and N=50, gfortran achieves the impossible ..... it REDUCES the time by a factor of 40 from 0.00606 to 0.000151

How can this be so ???

I did notice the inconsistent timings, but investigating them struck me as something that would be definitely off-topic here. Were there a forum for Gfortran, other than Usenet:comp.lang.fortran, this sub-problem could go there.

25 Jun 2018 9:32 #22263

I think that it has taken 20 years to identify the problem with SPREAD indicates how little this function is used. I would prefer to write the code using a DO loop approach, rather than check the documentation for SPREAD, probably each time I reviewed the code. It is good to identify the problem in case it has a more general effect (assumption for what are elemental functions ?) I am not suggesting it is a high priority fix, although we all have different priorities.

25 Jun 2018 11:22 #22264

This bug has now been fixed for the next release of FTN95. It turns out that FTN95 already works correctly for fixed size arrays but not (as here) with ALLOCATE.

With this fix elapsed times are no longer significant.

30 Jun 2018 1:32 #22288

Please see...

https://forums.silverfrost.com/Forum/Topic/3403

1 Jul 2018 7:20 (Edited: 3 Jul 2018 12:11) #22291

The updated FTN95 8.3 EXE and DLLs in the Dropbox download fix the SPREAD problem that was exhibited with my test program v2m (see above).

Thanks to Silverfrost and Paul for the fast response and effective remedy.

I hope that Klaus L. and John Campbell will be able to confirm that the same fixes also work for their test programs as well as Klaus's real application in which matrix outer products are computed.

3 Jul 2018 8:19 #22304

I have copied the files of the beta version into their corresponding directories, but I see no improvement. Neither with Visual Studio 2015 nor with Plato. I will further look for a potential error on my side but the updated files are on their correct place. Could someone who stated that the problem has been fixed just run my little program? That would help me a lot.

Many thanks, Klaus

3 Jul 2018 10:00 #22306

Klaus

There has been no attempt to 'fix' FTN95 with respect the initial program that you posted on this thread.

FTN95 does not process the following line as you would like...

c = spread(a,dim=2,ncopies=n) * spread(b,dim=1,ncopies=m )

If you want to multiply two matrices together then FTN95 expects you to call MATMUL.

c = MATMUL(spread(a,dim=2,ncopies=n), spread(b,dim=1,ncopies=m))
3 Jul 2018 11:15 #22307

Thank you very much, Paul. I misunderstood what was meant by 'the problem had been fixed'.

Your proposal works well for m = n (if divided by m). But it is not the fastest method: the method mentioned also in this thread (to eliminate the inner do loop) is faster by a factor of 2-3. However, for m /= n the two array shapes are non-conformant.

I have rerun the case with Code::Blocks and the GNU compiler. With this compiler, the 'spread solution' is by far the fastest method. No idea why, but obviously both intrinsic spread functions (ftn95/GNU) differ in their conception. As mentioned earlier, I have no insight to get any further.

Klaus

4 Jul 2018 12:09 #22316

There is/was no evidence of a fault in SPREAD. So SPREAD has not been fixed.

5 Jul 2018 3:51 #22327

Paul,

Run mecej4's test program using FTN95 and gFortran. The following adaptation gives an indication that gFortran is running ! I ran it in PLATO selecting Release Win32 then Release x64 : Tools>Options>'Use gFortran/gcc for x64' program v2m ! construct matrix A(n,n) with elements A(i,j) = i for i = 1:n, j=1:n implicit none integer :: i,k,n real, allocatable :: v(:),A(:,:) integer8 :: t1,t2,rate real4 :: sec ! n = 25 do k=1,4 allocate(v(n),A(n,n)) v = (/ (i, i=1,n) /) call system_clock (t1,rate) A = spread(v, DIM=2, NCOPIES = n) call system_clock (t2,rate) deallocate(v,A) sec = real(t2-t1)/real(rate) ; write (,) sec write(,'(I4,2x,F8.4,1x,A)')n,sec,'s' n=n2 end do ! end program

There may not be a fault, but there is definitely a performance problem.

5 Jul 2018 5:30 #22328

John

There was a fault in FTN95 relating to the way it called SPREAD in certain contexts and this was demonstrated in Mecej4's sample program. This has been fixed in the latest beta download. If there is still a performance hit compared with gFortran then it should be negligible.

5 Jul 2018 11:38 #22330

Quoted from PaulLaidler John

There was a fault in FTN95 relating to the way it called SPREAD in certain contexts and this was demonstrated in Mecej4's sample program. This has been fixed in the latest beta download. If there is still a performance hit compared with gFortran then it should be negligible.

Paul, I agree that there is no justification for blaming SPREAD itself. The problem is that FTN95 compiles some Fortran expressions containing SPREAD in such a way that the resulting program is extremely inefficient and slow, because a large number of calls to SPREAD are made where just a single call would suffice.

Let us note at the outset that no matrix multiplication is involved in the following (or in the example codes that were posted earlier). My v2m example was constructed to show the existence of the inefficiency in the simplest way that I could think of. The v8.30.279 beta release fixes that.

Unfortunately, the inefficiency is still a major problem if the expression containing SPREAD involves anything beyond a simple reference to SPREAD. Klaus and John C. provided example codes where the expression was the product of two references to SPREAD. Below I give timings from an adaptation of John's example with that expression split into two statements. In place of

c = spread ( a,dim=2,ncopies=n ) * spread ( b,dim=1,ncopies=m )

write

c = spread ( a,dim=2,ncopies=n )
c = c * spread ( b,dim=1,ncopies=m ) ! This is NOT a matrix multiply operation

Examination of the generated code using /64 /opt /explist shows the problem clearly:

(i) Calculation of each element of the result matrix C involves the MULSD instruction, which occurs only once in the listing; (ii) The MULSD is in a loop, and a call to SPREAD is located in the same loop. As a result, calculating the final result C involves making m X n calls to SPREAD, which is extremely expensive. (iii) A single call to SPREAD should suffice for the second Fortran statement above, since C has already been allocated and initialised in the Fortran statement preceding it. Here are the timing results (2.1 GHz Intel T4300, Windows 10 X64, FTN95 8.30.279)

System_clock rate =  10000

   n    time (s)

  10    0.004900
  50    0.047500
 100    0.697100
 200   11.048800

And, from Gfortran 7.3 with -O2 :

System_clock rate =  1000000000

   n    time (s)

  10    0.001192
  50    0.000013
 100    0.000057
 200    0.000585
5 Jul 2018 4:58 #22331

mecej4

My understanding of this issue differs from yours. I will take another look at it when I get a moment.

Please login to reply.