forums.silverfrost.com

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Paul,

From time to time, I am puzzled by the inefficiency in the use of some array syntax. Good examples are the use of either vector subtraction or dot product. The following example gives vector subtraction using either array syntax or a subroutine call in fortran 77 style syntax. In the following cut-out of code the call to a subroutine is much faster than the array syntax by about 30%. The values of JBAND range up to about 5,000, so the real*8 computation is dominant and not the integer point calculation. ie the array code :
D(I0:I1) = D(I0:I1) - B(J0:J1) * D(J)
is much slower than :
CALL VECSUB (D(I0), B(J0), D(J), JBAND)

I thought that the compiler may be creating copies of D(I0:I1) and B(J0:J1) on the stack. Could this explain the problem. Memory grabbing also appears to effect the performance of VECSUB, but I'm not certain of any of the inefficiency causes. I have included the style of array definitions, that I have used, if that's an issue. I even use /opt.

Do you have any suggestions why there is this apparent inefficiency ?

SUBROUTINE Back_Sub (D, B, NB)
!
! Backsubstitutes for load vector D with block B
!
REAL*8 D(*), B(*)
INTEGER*4 NB(*)
EXTERNAL VECSUB
!
INTEGER*4 I0, J, J0, JBAND, i1,j1
....
! Slow Array syntax
if (array_code) then
D(I0:I1) = D(I0:I1) - B(J0:J1) * D(J)
!
! Fast execution code
else
CALL VECSUB (D(I0), B(J0), D(J), JBAND)
end if

SUBROUTINE VECSUB (A, B, FAC, N)
!
! Performs the vector opperation [A] = [A] - * FAC
!
integer*4 n, i
real*8 a(*), b(*), fac, c
!
c = -fac
do i = n,1,-1 ! the reverse loop performs faster than a forward loop !!
a(i) = a(i) + b(i) * c
end do
return
!
END

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Paul and others,

I have continued to test different alternatives to array syntax, this time with dot_product.
I have developed small test programs to better show the effects.

The first test involved comparing 1d:vectors and 2d:arrays for different array syntax,

1 array(ieq+j0) = array(ieq+j0) - dot_product (array(k1:ku+k0), array(j1:ku+j0)) * factor
2 array(ieq+j0) = array(ieq+j0) - vec_sum (array(k1), array(j1), kb) * factor
3 array(ieq+j0) = array(ieq+j0) - vecsum (array(k1), array(j1), kb) * factor
4 a2(ieq,jeq) = a2(ieq,jeq) - dot_product (a2(kl:ku,k), a2(kl:ku,jeq) ) * factor
5 a2(ieq,jeq) = a2(ieq,jeq) - vec_sum (a2(kl,k), a2(kl,jeq), kb) * factor
6 a2(ieq,jeq) = a2(ieq,jeq) - vecsum (a2(kl,k), a2(kl,jeq), kb) * factor

I compiled these with and without /opt.
Without /opt, option 4 was 20% slower than the rest, with options 1 & 2 nearly identical.
With /opt, all but option 1 improved substantially, with a reduction to 33% of elapsed time.
In these, VECSUM is simply a do loop, while VEC_SUM is only a dot_product call.
( although the help for /opt implies that the do loop may be replaced by dot_product )

I looked at the assembler for option 1, and it appears that dot_product is replaced by the expanded
instructions, rather than using a maths library function.
I am certainly surprised that the convenient array syntax has such an overhead for vectors.
It is also surprising that option 4 (2d addressing) improved with /opt, but option 1 did not.
Also there was insignificant difference between VECSUM and VEC_SUM.

I next did a test with different vector syntax. Options 1 and 2 differ in the expression of
the array size, to see if that helps the optimiser.

1 A(J+I0) = A(J+I0) - dot_product (A(I1:I2), B(J1:J2))
2 A(J+I0) = A(J+I0) - dot_product (A(I1:I1+jband), B(J1:J1+jband))
3 A(J+I0) = A(J+I0) - VEC_SUM (A(I1), B(J1), JBAND)
4 A(J+I0) = A(J+I0) - VECSUM (A(I1), B(J1), JBAND)

The results show no difference between 1 & 2 and also none between 3 & 4, but a significant
difference between array syntax and function calls.
On my 3.2ghz desktop, with /opt, there is a consistent 3:1 difference between the expanded dot_product
code and the VEC... function. What is the program doing? I assume it's calculating array addresses ?
The old programming approach of counting multiplies does not appear to apply here.
On my 1.86ghz Notebook, the ratio started at 2:1 then declined to 1.3:1 as JBAND ranged from 500 to 1000.
I presume this could be due to the larger cache and possible different relative timing of the instructions,
relative to multiply and add.
I tested /pentium, /p6 and /sse2 as extra options, but none showed any significant effect on either machine.

It appears that 1d vector sub-arrays do not work well or respond to /opt
The compiler expansion of dot_product in this case has up to 3:1 performance ratio, compared to
calling a dot_product function.
My previous post showed a similar effect for vector multiplication and subtraction (VECSUB)

Do you have any comments on these results ?
I would be pleased to send you the full code for both tests.

REAL*8 FUNCTION VECSUM (A, B, N)
!
! Performs a vector dot product VECSUM = [A] .
! account is NOT taken of the leading zero terms in the vectors
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
real*8 c
integer*4 i
!
c = 0.0
do i = 1,n
c = c + a(i)*b(i)
end do
!
vecsum = c
return
!
end

REAL*8 FUNCTION VEC_SUM (A, B, N)
!
! Performs a vector dot product VECSUM = [A] .
! account is NOT taken of the leading zero terms in the vectors
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in)

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Paul,

I don't appear to be getting any interest in this question, so I will try to make it shorter.

Why does the use of the dot_product intrinsic inside a larger DO loop run between 2 - 3 times slower than calling function VEC_SUM, when all VEC_SUM does is reference dot_product?
1 A(J+I0) = A(J+I0) - dot_product (A(I1:I2), B(J1:J2))
3 A(J+I0) = A(J+I0) - VEC_SUM (A(I1), B(J1), JBAND)

I have tested it on 4 recent P4 PC's and ftn95 ver 4.8 & 4.91. There is some variation between desktop and notebook relative performance.
I certainly don't understand what the type 1 call is doing, to run for 3 times as long.
In the VEC_SUM version, the loop averages about 10 processor cycles (hz) per multiply and add, while the DO - dot_product is about 30.

Any ideas where the problem may be ?

As dot_product appears to be expanded out in the assembler listing, rather than using an intrinsic function call, I am wondering if all use of sub arrays, of the form A(I:J) should be replaced by function or subroutine interfaces wherever possible.

John

PaulLaidler · Posted: Mon Oct 09, 2006 12:53 am Post subject: Array syntax inefficiency

John

I may be able to help you with this in due course but time does not permit at the moment.

Regards

Paul

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Thanks,

Let me know when you can and I'll send you the test examples I have developed.
It would be good to understand the reason for the significant performance variation in the different REAL*8 array examples.

Regards
John

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Paul,

Any time available yet ?

If I replace the following array syntax in my equation solver, such as:
A(J+I0) = A(J+I0) - dot_product (A(I1:I1+n-1), B(J1:J1+n-1))
or
A(J+I0) = A(J+I0) - dot_product (A(I1:I2), B(J1:J2))

With
A(J+I0) = A(J+I0) - VEC_SUM (A(I1), B(J1), N)
where vec_sum is a 1 line function that has:
REAL*8 FUNCTION VEC_SUM (A, B, N)
vec_sum = dot_product (a,b)

The performance change is significant for non trivial values of N about 600. Replacing dot_product with a do loop in vec_sum has a minimal improvement.
My test program gives 39.4 seconds reduced 14.1 seconds in full test loop or from 107 multiplies per micro second to 297 multiplies per micro second on a 3ghz P4.
ie, it takes nearly 3 times as long using the array syntax addressing as for simple vector addressing. I always thought multiplication took longer than an array addressing pointer!

These are pretty significant delays for any compute intensive application that might use sub-array syntax, which does not appear too complex.

The full stand-alone test program is waiting for an email address.

Regards John

PaulLaidler · Posted: Fri Jul 06, 2007 7:30 am Post subject:

I can think of two general points which may help. A detailed response would require a significant investment of time and may need some financial support.

1. You can see how the compiler handles the optimisation by looking at an /EXPLIST of the code. You do not need to have a full understanding of assembly code. With a little effort you can see how the do loops are handled and basically how values are being copied etc.

2. When passing arrays (and in particular array sections) to subprograms the compiler may have to copy the array on entry and on exit. Hence it can improve the efficiency if you can use INTENT(IN) or INTENT(OUT).

PaulLaidler · Posted: Fri Jul 06, 2007 9:43 am Post subject:

I have had a quick look at the bit about calling vec_sum rather than dot_product.

The difference is simply that the call to dot_product uses an array section and so requires a "copy in" or "copy out" process or both (maybe "requires" is too strong - I assume that the general case requires copying and FTN95 does not handle specific cases that do not require a it).

On the other hand vec_sum does not use an array section and the base address is passed without the need to take a copy.

Thinking about this as I write, I guess that if the subprogram is internal/module or has an interface then the compiler will know the INTENT nature of the arguments and can limit the amount of copying that needs to be done.

Anyway the solution may simply be to avoid passing array sections where ever possible.