Paul and others,
I have continued to test different alternatives to array syntax, this time with dot_product.
I have developed small test programs to better show the effects.
The first test involved comparing 1d:vectors and 2d:arrays for different array syntax,
1 array(ieq+j0) = array(ieq+j0) - dot_product (array(k1:ku+k0), array(j1:ku+j0)) * factor
2 array(ieq+j0) = array(ieq+j0) - vec_sum (array(k1), array(j1), kb) * factor
3 array(ieq+j0) = array(ieq+j0) - vecsum (array(k1), array(j1), kb) * factor
4 a2(ieq,jeq) = a2(ieq,jeq) - dot_product (a2(kl:ku,k), a2(kl:ku,jeq) ) * factor
5 a2(ieq,jeq) = a2(ieq,jeq) - vec_sum (a2(kl,k), a2(kl,jeq), kb) * factor
6 a2(ieq,jeq) = a2(ieq,jeq) - vecsum (a2(kl,k), a2(kl,jeq), kb) * factor
I compiled these with and without /opt.
Without /opt, option 4 was 20% slower than the rest, with options 1 & 2 nearly identical.
With /opt, all but option 1 improved substantially, with a reduction to 33% of elapsed time.
In these, VECSUM is simply a do loop, while VEC_SUM is only a dot_product call.
( although the help for /opt implies that the do loop may be replaced by dot_product )
I looked at the assembler for option 1, and it appears that dot_product is replaced by the expanded
instructions, rather than using a maths library function.
I am certainly surprised that the convenient array syntax has such an overhead for vectors.
It is also surprising that option 4 (2d addressing) improved with /opt, but option 1 did not.
Also there was insignificant difference between VECSUM and VEC_SUM.
I next did a test with different vector syntax. Options 1 and 2 differ in the expression of
the array size, to see if that helps the optimiser.
1 A(J+I0) = A(J+I0) - dot_product (A(I1:I2), B(J1:J2))
2 A(J+I0) = A(J+I0) - dot_product (A(I1:I1+jband), B(J1:J1+jband))
3 A(J+I0) = A(J+I0) - VEC_SUM (A(I1), B(J1), JBAND)
4 A(J+I0) = A(J+I0) - VECSUM (A(I1), B(J1), JBAND)
The results show no difference between 1 & 2 and also none between 3 & 4, but a significant
difference between array syntax and function calls.
On my 3.2ghz desktop, with /opt, there is a consistent 3:1 difference between the expanded dot_product
code and the VEC... function. What is the program doing? I assume it's calculating array addresses ?
The old programming approach of counting multiplies does not appear to apply here.
On my 1.86ghz Notebook, the ratio started at 2:1 then declined to 1.3:1 as JBAND ranged from 500 to 1000.
I presume this could be due to the larger cache and possible different relative timing of the instructions,
relative to multiply and add.
I tested /pentium, /p6 and /sse2 as extra options, but none showed any significant effect on either machine.
It appears that 1d vector sub-arrays do not work well or respond to /opt
The compiler expansion of dot_product in this case has up to 3:1 performance ratio, compared to
calling a dot_product function.
My previous post showed a similar effect for vector multiplication and subtraction (VECSUB)
Do you have any comments on these results ?
I would be pleased to send you the full code for both tests.
REAL*8 FUNCTION VECSUM (A, B, N)
!
! Performs a vector dot product VECSUM = [A] . **
! account is NOT taken of the leading zero terms in the vectors
!
integer4, intent (in) :: n
real8, dimension(n), intent (in) :: a
real8, dimension(n), intent (in) :: b
!
real8 c
integer*4 i
!
c = 0.0
do i = 1,n
c = c + a(i)*b(i)
end do
!
vecsum = c
return
!
end
REAL*8 FUNCTION VEC_SUM (A, B, N)
!
! Performs a vector dot product VECSUM = [A] . **
! account is NOT taken of the leading zero terms in the vectors
!
integer4, intent (in) :: n
real8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
vec_sum = dot_product (a,b)
return
!
end