Quoted from PaulLaidler
John
There was a fault in FTN95 relating to the way it called SPREAD in certain contexts and this was demonstrated in Mecej4's sample program. This has been fixed in the latest beta download. If there is still a performance hit compared with gFortran then it should be negligible.
Paul, I agree that there is no justification for blaming SPREAD itself. The problem is that FTN95 compiles some Fortran expressions containing SPREAD in such a way that the resulting program is extremely inefficient and slow, because a large number of calls to SPREAD are made where just a single call would suffice.
Let us note at the outset that no matrix multiplication is involved in the following (or in the example codes that were posted earlier). My v2m example was constructed to show the existence of the inefficiency in the simplest way that I could think of. The v8.30.279 beta release fixes that.
Unfortunately, the inefficiency is still a major problem if the expression containing SPREAD involves anything beyond a simple reference to SPREAD. Klaus and John C. provided example codes where the expression was the product of two references to SPREAD. Below I give timings from an adaptation of John's example with that expression split into two statements. In place of
c = spread ( a,dim=2,ncopies=n ) * spread ( b,dim=1,ncopies=m )
write
c = spread ( a,dim=2,ncopies=n )
c = c * spread ( b,dim=1,ncopies=m ) ! This is NOT a matrix multiply operation
Examination of the generated code using /64 /opt /explist shows the problem clearly:
(i) Calculation of each element of the result matrix C involves the MULSD instruction, which occurs only once in the listing;
(ii) The MULSD is in a loop, and a call to SPREAD is located in the same loop. As a result, calculating the final result C involves making m X n calls to SPREAD, which is extremely expensive.
(iii) A single call to SPREAD should suffice for the second Fortran statement above, since C has already been allocated and initialised in the Fortran statement preceding it.
Here are the timing results (2.1 GHz Intel T4300, Windows 10 X64, FTN95 8.30.279)
System_clock rate = 10000
n time (s)
10 0.004900
50 0.047500
100 0.697100
200 11.048800
And, from Gfortran 7.3 with -O2 :
System_clock rate = 1000000000
n time (s)
10 0.001192
50 0.000013
100 0.000057
200 0.000585