While revamping and modernizing the Netlib Odepack package, we ran into an astounding slowdown when the code was compiled with FTN95 8.91. The code contains two blocks of code in which the SPREAD intrinsic is used. The first instance is
#ifdef SPREAD
allocate (d1v(l-1))
do j = 1, l - 1
r = r*rh
d1v(j) = r
end do
yh(:n,2:l) = yh(:n,2:l)*spread(d1v,dim = 1,ncopies = n)
deallocate (d1v)
#else
do j = 2,l
r = r*rh
do i = 1,n
yh(i,j) = yh(i,j)*r
end do
end do
#endif
and it should be noted that the second alternative (following the '#else') is the F77 way, using double loops. This instance of SPREAD did not cause any noticeable slowdowns when the code was compiled with FTN95.
The second instance involves the product of two SPREADs:
#ifdef SPREAD
yh(:n,:l) = yh(:n,:l) + spread(el(:l),dim = 1,ncopies = n) &
* spread(acor(:n),dim = 2,ncopies = l)
#else
do j = 1,l
do i = 1,n
yh(i,j) = yh(i,j) + el(j)*acor(i)
end do
end do
#endif
Here are some timing results.
Internal timing Shell timing
WITH WITHOUT WITH WITHOUT (-D SPREAD)
COMPILER
LF 0.062 0.047 0.105 0.096
GF 0.046 0.047 0.116 0.142
SF64 8.469 0.094 8.606 0.221
SF32 41.047 0.109 41.134 0.177
ABS64 0.063 0.047 0.124 0.131
ABS32 0.114 0.063 0.129 0.137
LF: Lahey 7.1; GF: Gfortran 11.3; SF64: FTN95 /64; SF32: FTN95
Note the tremendous slowdown shown for SF64 and SF32 in the second column; the use of SPREAD slowed down the code by a factor of 400!
There was an extensive discussion in 2018 of the FTN95 implementation of SPREAD https://forums.silverfrost.com/Forum/Topic/3399 in this forum, at the end of which Paul wrote, 'My understanding of this issue differs from yours. I will take another look at it when I get a moment.'
The source code and batch files to build and reproduce the problem are available at https://www.dropbox.com/s/y0n4r22ioyr7y96/lsodpk.zip?dl=0 .
It should be noted that neither instance of SPREAD involves any matrix multiplication. Readers unused to SPREAD may look at the F77 versions of the code blocks, and note that for the yh array of size M X N there is a need for just M X N multiplications in the first block. In the second block, the need is for M X N multiplications and M X N additions. In neither instance does SPREAD occur inside any loop, so each run through the subroutine containing the code should require only three evaluations of SPREAD.
I am curious to know, in addition: does anyone use SPREAD frequently? I do not, myself, and it was code polishing software that inserted it. It did take some investigation to track the slowdown to SPREAD.
