The code for single and double precision dot products using SSE/SSE2.
This new version allows arrays of any size.
I don't know how to align data on 16 byte boundary's with FTN95. If I can find out how to do this, the code can be made faster by changing movups to movaps in the single precision version, and movupd to movapd in the double precision version.
[u:72eb36481a]Edit[/u:72eb36481a]. I found out that the correct alignment can be specified using an align(128) attribute. But it doesn't seem to work. I have logged this on the support page. Until this is fixed, the codes below will have to make do with using the slower movups, movupd codes.
Also not all of the opcodes are implemented, like shuffling with shufps (grrrr).
Maybe Paul could comment on why some op codes don't work? And is it possible to define my own op codes (since they are just data really)?
!** SINGLE PRECISION VERSION
function asm_dotprod(v1,v2,n)
integer n
real v1(*), v2(*)
real asm_dotprod
integer nn, nnb, m
real :: v(4)
! Number of elements in sets of 4
nn = 4*(n/4)
! Number of bytes in sets of 4
nnb = 4*nn
! Number of bytes
m = 4*n
! Assembly code is between code, edoc lines
code
xorps xmm0%, xmm0% ; set xmm0 to zero
mov eax%, =v1 ; address of v1 argument
mov ecx%, =v2 ; address of v2 argument
mov edx%, 0 ; Initialise loop counter
mov edi%, nnb ; counter when loop is exited
cmp edi%, 0
jz $20 ; conditional skip loop, no terms
! Loop to accumulate dot product of elements 1 to nn
10 movups xmm1%, [eax%+edx%] ; move next 4 reals in v1 into xmm1
movups xmm2%, [ecx%+edx%] ; move next 4 reals in v2 into xmm2
mulps xmm1%, xmm2% ; multiply values and store in xmm1
addps xmm0%, xmm1% ; accumulate sum in xmm0
add edx%, 16 ; increment counter
cmp edx%, edi%
jl $10 ; conditional branch back
20 mov edi%, m
cmp edx%, edi%
jge $40 ; skip loop if n = nn
! Loop to accumulate dot product of last nn+1 to n elements in low 32 bits of xmm0
30 movss xmm1%, [eax%+edx%]
mulss xmm1%, [ecx%+edx%]
addss xmm0%, xmm1%
add edx%, 4
cmp edx%, edi%
jl $30
! Can't get final reduction to work, so will do this in Fortran for now
40 movups v, xmm0% ; move xmm0 to v array
edoc
! Final reduction, the result is the sum of the four values in v
asm_dotprod = sum(v)
end function asm_dotprod