Silverfrost Forums

Welcome to our forums

Fortran 2003/2008

18 May 2012 7:49 (Edited: 19 May 2012 8:14) #10182

The code for single and double precision dot products using SSE/SSE2.

This new version allows arrays of any size.

I don't know how to align data on 16 byte boundary's with FTN95. If I can find out how to do this, the code can be made faster by changing movups to movaps in the single precision version, and movupd to movapd in the double precision version.

[u:72eb36481a]Edit[/u:72eb36481a]. I found out that the correct alignment can be specified using an align(128) attribute. But it doesn't seem to work. I have logged this on the support page. Until this is fixed, the codes below will have to make do with using the slower movups, movupd codes.

Also not all of the opcodes are implemented, like shuffling with shufps (grrrr).

Maybe Paul could comment on why some op codes don't work? And is it possible to define my own op codes (since they are just data really)?

!** SINGLE PRECISION VERSION
function asm_dotprod(v1,v2,n)
   integer n
   real v1(*), v2(*)
   real asm_dotprod

   integer nn, nnb, m
   real :: v(4)
      
   ! Number of elements in sets of 4
   nn = 4*(n/4)

   ! Number of bytes in sets of 4
   nnb = 4*nn

   ! Number of bytes
   m = 4*n

   ! Assembly code is between code, edoc lines
   code
      xorps xmm0%, xmm0%         ; set xmm0 to zero

      mov eax%, =v1              ; address of v1 argument
      mov ecx%, =v2              ; address of v2 argument      

      mov edx%, 0                ; Initialise loop counter
      mov edi%, nnb              ; counter when loop is exited
   
      cmp edi%, 0
      jz $20                     ; conditional skip loop, no terms

      ! Loop to accumulate dot product of elements 1 to nn
   10 movups xmm1%, [eax%+edx%]  ; move next 4 reals in v1 into xmm1
      movups xmm2%, [ecx%+edx%]  ; move next 4 reals in v2 into xmm2
      mulps xmm1%, xmm2%         ; multiply values and store in xmm1
      addps xmm0%, xmm1%         ; accumulate sum in xmm0
      add edx%, 16               ; increment counter
      cmp edx%, edi%
      jl $10                     ; conditional branch back

   20 mov edi%, m
      cmp edx%, edi%
      jge $40                    ; skip loop if n = nn

      ! Loop to accumulate dot product of last nn+1 to n elements in low 32 bits of xmm0 
   30 movss xmm1%, [eax%+edx%]
      mulss xmm1%, [ecx%+edx%]
      addss xmm0%, xmm1%
      add edx%, 4
      cmp edx%, edi%
      jl $30

      ! Can't get final reduction to work, so will do this in Fortran for now
  40  movups v, xmm0%            ; move xmm0 to v array
   edoc
   
   ! Final reduction, the result is the sum of the four values in v
   asm_dotprod = sum(v)

end function asm_dotprod
18 May 2012 7:54 (Edited: 19 May 2012 7:59) #10183
!** DOUBLE PRECISION VERSION
function asm_ddotprod(v1,v2,n)
   integer n
   real*8 v1(*), v2(*)
   real*8 asm_ddotprod

   integer nn, nnb, m
   real*8 :: v(2)
      
   ! Number of elements in sets of 2
   nn = 2*(n/2)

   ! Number of bytes in sets of 2
   nnb = 8*nn

   ! Number of bytes
   m = 8*n

   ! Assembly code is between code, edoc lines
   code
      xorps xmm0%, xmm0%         ; set xmm0 to zero

      mov eax%, =v1              ; address of v1 argument
      mov ecx%, =v2              ; address of v2 argument      

      mov edx%, 0                ; Initialise loop counter
      mov edi%, nnb              ; counter when loop is exited
   
      cmp edi%, 0
      jz $20                     ; conditional skip loop, no terms

      ! Loop to accumulate dot product of elements 1 to nn
   10 movupd xmm1%, [eax%+edx%]  ; move next 2 doubles in v1 into xmm1
      movupd xmm2%, [ecx%+edx%]  ; move next 2 doubles in v2 into xmm2
      mulpd xmm1%, xmm2%         ; multiply values and store in xmm1
      addpd xmm0%, xmm1%         ; accumulate sum in xmm0
      add edx%, 16               ; increment counter
      cmp edx%, edi%
      jl $10                     ; conditional branch back

   20 mov edi%, m
      cmp edx%, edi%
      jge $40                    ; skip loop if n = nn

      ! Loop to accumulate dot product of last nn+1 to n elements in low 64 bits of xmm0 
   30 movsd xmm1%, [eax%+edx%]
      mulsd xmm1%, [ecx%+edx%]
      addsd xmm0%, xmm1%
      add edx%, 8
      cmp edx%, edi%
      jl $30

      ! Can't get final reduction to work, so will do this in Fortran for now
  40  movupd v, xmm0%            ; move xmm0 to v array
   edoc
   
   ! Final reduction, the result is the sum of the two values in v
   asm_ddotprod = sum(v)

end function asm_ddotprod
18 May 2012 8:04 #10187

Quoted from JohnCampbell

I am not capable of doing this [SSE] or know the limitations as to how to test if the approach is supported by the processor ?

FTN95 has functions HAS_MMX@ and HAS_SSE@ which can be used to check support for MMX and SSE.

To check if AVX is supported there is a function @CPUID which can be used; it returns whether AVX is supported in bit 28 of the ECX register.

So you can write code which selects the most appropriate strategy depending on the support in the CPU.

20 May 2012 9:57 #10195

David,

It's really generous of you to create these routines and put them on the forum.

In the last bit of code (around statement 30), might it be better to simply fill the unused space with zeroes, so that you still use the SIMD instruction, so that (if there was only one pair to multiply) you were doing

XY + 00 + 00 + 00

... although I can't imagine the difference (even if my suggestion isn't worse!) would be measurable. On average across multiple invocations you would be doing 1.5 pairs of zero multiplies (3, 2, 1 or 0), not always the worst case of 3, and you can't always assume that the last loop is the best case of 1 - on average it must also be 1.5 (0, 1, 2 or 3) [I think!]

As well as the dot product with two long vectors, another case arises in transforming from one coordinate system to another. If this is in 3D, then it involves transformation matrices with a 3x3 size, and in order to exploit the SIMD SSE instructions, presumably the 4th pair must always be 0. In that case, the routine would be very short, but invoked a huge number of times. Do you think the overhead in the function call might not be worse than the saving by using the SIMD instruction (and worse in 2D transformations)? Or is this just a bad guess by me?

Eddie

20 May 2012 3:37 (Edited: 20 May 2012 9:50) #10197

Eddie,

I can't pack the 1, 2 or 3 remaining values in a register without using the shufps (single precision) or shufpd (double precision) instructions to move the values to the correct bits in the xmm(n) registers. Unfortunately, these operations aren't supported in FTN95. So I just did the remaining values using a small non-SIMD loop. For double precision code, the 'loop' count is either 0 or 1 and the final compare/jump back is redundant (I left it in so the code pattern is the same as the single precision version).

The non-SIMD loop won't really make much difference when the largest part of the loop is done using the SIMD instructions.

At the moment, these are just fully-functional demonstration routines to prove that SIMD is possible in FTN95. I should really be decreasing my counters (not increasing them) and should be vectorising more than 2 doubles at a time and interleaving some of the instructions to get some benefit from pipelining.

All these optimizations have to wait though because I am waiting for Paul to confirm there is an issue with 128 bit alignment (see the support page of the forum). As they are, these routines provide a modest improvement in run speed compared to a simple DO loop which compiles to non-SIMD code. Once I can make sure the data is aligned and have done the optimizations, the next versions of the routines should be much faster.

I'm not an assembly language programmer, so please all chip in if you think there's a better way to do it.

I'm not sure about your 3D transformation example. If you write the routine to do lots of transformations, with the correct optimizations and aligned data, then there could be some gains in performance. If all the routine does is one transformation, then the overhead could be too large to make it worthwhile.

20 May 2012 7:16 #10198

Hi David,

Thanks for the explanations. I did think about decrementing, but then forgot. I did a lot of ASM programming, but wasn't much good at it. CODE EDOC is much easier than MASM, as you don't need to write the entry and exit bit of a routine. I agree that the final loop bit can't possibly make much difference, and that the real benefit will be seen when the vector length is big.

Now I understand the shufp* bit - I already understood the alignment bit. FTN77/DBOS, if my memory serves me, always used to make common blocks etc aligned. This has been relaxed probably to save a few bytes here and there!

In the case of the 3D transformation bit, you might be operating on thousands of points. Saving time on each transformation would be useful, but it really needs to be done with inline code, not a subroutine call.

I do hope that Silverfrost see the benefits of enhancing the Assembler to include the additional opcodes and fix the alignment problem.

Recently I have been mulling over big programs that run for hours taking over the whole resources of a massively-configured Windows computer. It seems to me that Windows is the wrong environment somehow! But any cheap increase in processing speed gets my vote.

Eddie

20 May 2012 9:46 #10199

I think FTN95 does some alignment. But it needs to align an array so that the address of the first element is exactly divisible by 16 for SSE/SSE2. This is supposed to be possible using:

real, align(128) :: X(100)      ! address of X(1) is a multiple of 16
real*8, align(128) :: Y(100) ! address of Y(1) is a multiple of 16

It is this that doesn't seem to work.

But alignment like the above only helps, it isn't the full solution.

The assembly code will need to check for alignment since it won't always be called when the first element of the dummy arrays matching the first (aligned) element of the actual array arguments. Also cases like the following where the arrays are skewed can't be aligned automatically and the programmer has to put in 'pad' variables.

real, align(128) :: X(100)
real, align(128) :: Y(101)
s = asm_dotprod(X(1), Y(2), 100)

This is almost certainly the case with John's code where the arrays are 'not easily predictable' sections from a bigger matrix and could start anywhere in memory.

The assembly code needs to detect these cases and act accordingly, using instructions which assume alignment only when appropriate, and ones which don't at other times. I have worked out how to do this.

With AVX alignment needs to be to 32 bytes (256 bits) I think and this is a problem since align(256) isn't allowed in FTN95 (well not yet . 😉 )

21 May 2012 7:58 #10200

David,

I have been trying to understand the alignment problem. This is a property of the two arguments. It can be considered in 2 ways.

  1. How the arguments are aligned wrt 16-byte or 32-byte, and also
  2. The relative alignment of the two arguments. If the two arguments are both equally non-aligned, then this can be recovered to an aligned situation. However, I think the general case of the two arguments not having the same alignment means that there are problems with any recovery strategy. At best one argument can be treated as aligned, but the other can not. In the case of a skyline equation solver, to achieve alignment for each Dot_Product call is difficult. I think that by creating two versions of the active column, then the alignment of both arguments can be achieved. This might not mean that both arguments start on 16-byte alignment, but have the same alignment condition. A solution may be to have multiple loops in Dot_Product, depending on the argument alignment condition, having a preferred case where both arguments have the same alignment and an alternative where the arguments are not aligned. I think the best that can be done in the aligned cased is to :
  1. include an itteration if required to achieve alignment
  2. use a bad loop for one argument as aligned or a good loop for both arguments now aligned.
  3. include an itteration to complete the loop

There may be issues for 16-byte (SSE) or 32-byte alignment (AVX) but I do not have any definate knowledge of this.

Certainly demonstrating the working of SSE/SSE2 and possibly AVX is a significant achievment. Thanks for doing this.

John

23 May 2012 2:42 #10212
  1. I think that there may be some possibility for you to manually optimise your code by either using LOC and FCORE4 (or DCORE8) to calculate addresses of array elements or by coding the critical parts in a C function.

  2. There may be some scope for us to extend the range of CODE/EDOC instructions that FTN95 can handle if this would help.

23 May 2012 5:37 #10214

Hi Paul,

I can optimise the code easy enough to find alignment opportunities. I have already done something on this; I just haven't posted it yet. The issues are:

(1) It would be nice if the ALIGN(128) attribute worked. See my post/code on the support page of the forum. This seems to be a bug.

(2) Support for further SSE/SSE2 instructions in the assembler would be very helpful. Without these it is difficult to optimize things.

Regards David.

24 May 2012 6:29 #10216

David

I should be grateful if you would post a short program that illustrates the ALIGN(128) problem.

24 May 2012 4:43 #10219

Paul,

I already posted code illustrating the ALIGN(128) problem in the support section of the Forum.

The post is at this link:

ALIGN attribute is not working

Regards David

30 May 2012 2:08 #10234

David,

You made the comment:

At the moment, these are just fully-functional demonstration routines to prove that SIMD is possible in FTN95.

What do you think is required to get a version of Dot_Product, that incorporates these instructions, working in FTN95 ?

What you have done is demonstrate that access to the SSE instructions can be available in FTN95, which I think is a significant advance.

John

PS: I have taken the real*8 code you have listed in the forum and installed it in my FEA program. For the example problem I have been discussing recently, it reduces the equation reduction time from 301 seconds to 211 seconds. The error check has changed from 1.2146E-10 to 1.2764E-10, confirming that for this example the routine works perfectly. This appear to be a significant improvement over the default Dot_Product, when compiled with /OPT

30 May 2012 6:51 (Edited: 31 May 2012 11:11) #10235

John,

In the longer term it would be nice if some of the SSE/SSE2 and AVX instructions were generated by the compiler for array processing like the dot_product intrinsic. There would have to be a compiler option to select this optimization. I don't think it would be too difficult to phase some of this into the compiler, starting with dot_product, but its entirely up to Silverfrost whether they want to do this.

Since FTN95 has the inline assembler, a good starting point would be a set of routines that users could incorporate into there own codes. But we will need Silverfrost to fix the alignment issue and add support for some of the missing SSE/SSE2 instructions (Please 😃 ). I don't think any AVX instructions are supported at present.

I have a faster version of the code I posted above, which I will post at the weekend. In the double precision version above, 2 multiplies and additions are carried out in parallel in the inner loop. The new version 'unrolls' the inner loop so that 4 multiples and additions are done. Two pairs of these are each done in parallel using SSE2 instructions, and the products and sums of the 2 pairs are interleaved to get additional benefit from out-of-order execution instruction level parallelisation. This should improve the performance further.

30 May 2012 11:19 #10237

I was hoping that adding more assembler instructions would be a simple matter but apparently this is not the case.

Similarly the alignment problem is easy to locate but there is a base address somewhere that also needs fixing. Still looking for this.

I will let you know when I have more information on both these issues.

31 May 2012 10:36 (Edited: 31 May 2012 11:04) #10245

Here is the fast version of the dot product for double precision arrays.

Features: (1) finds alignments and uses fast movapd, mulpd when possible, (2) decreases loop counters, (3) unwraps vectorised loops, (4) out of order execution.

!** DOUBLE PRECISION VERSION
function fast_asm_ddotprod(v1,v2,n)
   integer n
   real*8 v1(*), v2(*)
   real*8 fast_asm_ddotprod

   integer nn, m
   real*8 :: v(2)
     
   ! Number of sets of 4
   nn = n/4

   ! Number of residuals
   m = n - 4*nn

   ! Assembly code is between code, edoc lines
   code
      xorps xmm0%, xmm0%         ; set xmm0 to zero
      xorps xmm1%, xmm1%         ; set xmm1 to zero

      mov eax%, =v1              ; address of v1
      mov ecx%, =v2              ; address of v2

      mov edx%, nn               ; initialise loop counter for vectorised loop
      mov edi%, m                ; initialise loop counter for scalar loop
      cmp edx%, 0
      jz $60                     ; no terms in vectorized loop

      ! Check for alignment of v1(1) and v2(1) on 16 byte boundaries
      and eax%, 15               ; low nibble of v1 address
      and ecx%, 15               ; low nibble of v2 address
      cmp eax%, 0
      jnz $10                    ; v1(1) is unaligned, check for alignment of v1(2)
      cmp ecx%, 0
      jnz $10                    ; v2(1) is unaligned, check for alignment of v2(2)

      ! v1(1) and v2(1) are aligned on 16 byte boundaries
      mov eax%, =v1              ; address of v1
      mov ecx%, =v2              ; address of v2
      jmp $40                    ; branch to aligned vectorized loop

      ! Check for alignment of v1(2) and v2(2) on 16 byte boundaries
  10  xor eax%, 8
      jnz $20                    ; branch to unaligned vectorized loop
      xor ecx%, 8
      jnz $20                    ; branch to unaligned vectorized loop
      
      ! v1(2) and v2(2) are aligned on 16 byte boundaries
      mov eax%, =v1              ; address of v1 argument
      mov ecx%, =v2              ; address of v2 argument
      movsd xmm0%, [eax%]
      mulsd xmm0%, [ecx%]        ; initialise with product v1(1)*v2(1)
      lea eax%, [eax%+8]         ; address of v1(2)
      lea ecx%, [ecx%+8]         ; address of v2(2)
      dec edi%                   ; decrease loop counter for scalar loop
      jge $40                    ; branch to aligned vectorized loop
      mov edi%, 3                ; adjust loop counter for scalar loop
      dec edx%                   ; adjust loop counter for vectorised loop
      jz $60                     ; no terms in vectorized loop
      jmp $40                    ; branch to aligned vectorized loop
31 May 2012 10:41 #10246

continued ....

      ! Initialise unaligned vectorised loop
  20  mov eax%, =v1              ; address of v1 argument
      mov ecx%, =v2              ; address of v2 argument
   
      ! Unaligned vectorised loop to accumulate dot product of elements 1 to nn
  30  movupd xmm2%, [eax%]       ; move next 2 doubles in v1 into xmm2
      movupd xmm4%, [eax%+16]    ; move next 2 doubles in v1 into xmm4
      movupd xmm3%, [ecx%]       ; move next 2 doubles in v2 into xmm3
      movupd xmm5%, [ecx%+16]    ; move next 2 doubles in v2 into xmm5
      mulpd xmm2%, xmm3%         ; multiply values and store in xmm2
      mulpd xmm4%, xmm5%         ; multiply values and store in xmm4
      addpd xmm0%, xmm2%         ; accumulate sum in xmm0
      addpd xmm1%, xmm4%         ; accumulate sum in xmm1
      lea eax%, [eax%+32]
      lea ecx%, [ecx%+32]
      dec edx%                   ; decrement counter
      jnz $30                    ; conditional branch back
      jmp $50

      ! Aligned vectorised loop to accumulate dot product of elements 1 to nn
  40  movapd xmm2%, [eax%]       ; move next 2 doubles in v1 into xmm2
      movapd xmm4%, [eax%+16]    ; move next 2 doubles in v1 into xmm4   
      mulpd xmm2%, [ecx%]        ; multiply values and store in xmm2
      mulpd xmm4%, [ecx%+16]     ; multiply values and store in xmm4
      addpd xmm0%, xmm2%         ; accumulate sum in xmm0
      addpd xmm1%, xmm4%         ; accumulate sum in xmm1
      lea eax%, [eax%+32]
      lea ecx%, [ecx%+32]
      dec edx%                   ; decrement counter
      jnz $40                    ; conditional branch back

  50  addpd xmm0%, xmm1%         ; add xmm0 and xmm1

      ! Check for entries in the scalar loop
  60  cmp edi%, 0
      jz $80                     ; conditional skip loop, no terms in scalar loop

      ! Scalar loop to accumulate dot product of last nn+1 to n elements in low 64 bits of xmm0
  70  movsd xmm1%, [eax%]
      mulsd xmm1%, [ecx%]
      addsd xmm0%, xmm1%
      lea eax%, [eax%+8]
      lea ecx%, [ecx%+8]
      dec edi%
      jnz $70

      ! Can't get final reduction to work, so will do this in Fortran for now
  80  movupd v, xmm0%            ; move xmm0 to v array
   edoc
   
   ! Final reduction, the result is the sum of the two values in v
   fast_asm_ddotprod = sum(v)

end function fast_asm_ddotprod
1 Jun 2012 7:26 #10247

David,

I have tested your revised example and confirmed that it has better performance. I have now tested 3 of the options you proposed. I am including the run test for 4 different approaches, using /opt compilation. 1: Original Vec_Sum function, using a DO loop 2: Modified loop, using your 4 group suggestion 3: asm_ddotprod 4: fast_asm_ddotprod

!  4 Group coding
         nn = mod (n, 4)
         if (nn <  1) c = 0.
         if (nn == 1) c = a(1)*b(1)
         if (nn >  1) c = sum (a(1:nn)*b(1:nn))
!
         do k = nn+1,n,4
            c = c + sum ( a(K:K+3) * b(K:K+3) )
         end do

I have run all 4 in my FEA program, using the problem I have been reporting. The run times are:

option 1  Vec_sum loop
  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    0.97    0.04   0.252
  3  MOVE BLOCK B             3    0.44   -0.01   0.142
  4  REDUCE BLOCK A           4  296.87    0.69  74.389
  5  APPLY BLOCK B            3    5.63    0.05   1.892
  6  WRITE BLOCK A            4    1.23    0.00   0.309
 99  TOTAL REDUCTION          4  305.14    0.77  76.477

option 2  dotprod_4
  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    1.01    0.00   0.254
  3  MOVE BLOCK B             3    0.47   -0.02   0.148
  4  REDUCE BLOCK A           4  330.86    2.16  83.255
  5  APPLY BLOCK B            3    6.32    0.12   2.146
  6  WRITE BLOCK A            4    1.19    0.03   0.303
 99  TOTAL REDUCTION          4  339.86    2.27  85.533

option 3  asm_ddotprod
  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    0.76    0.01   0.193
  3  MOVE BLOCK B             3    0.42    0.01   0.142
  4  REDUCE BLOCK A           4  204.36    0.89  51.314
  5  APPLY BLOCK B            3    3.82    0.01   1.278
  6  WRITE BLOCK A            4    1.05    0.00   0.260
 99  TOTAL REDUCTION          4  210.41    0.92  52.833

option 4  fast_asm_ddotprod
  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    0.76    0.01   0.194
  3  MOVE BLOCK B             3    0.42    0.00   0.140
  4  REDUCE BLOCK A           4  188.43    0.50  47.233
  5  APPLY BLOCK B            3    3.56   -0.01   1.183
  6  WRITE BLOCK A            4    0.98    0.00   0.247
 99  TOTAL REDUCTION          4  194.17    0.49  48.667

where: Num is the number of blocks used Cpu is found from GetProcessTimes I/O is elapse time difference between QueryPerform and Cpu

These results demonstrate that your approach has significant performance improvement and that SSE vector instructions can be used in FTN95 to improve performance.

John

ps : the other main loop of the equation solver is in the back substitution where :

    SUBROUTINE VECSUB (A, B, FAC, N)
!
!   Performs the vector opperation  [A] = [A] - [B] * FAC
!
    integer*4 n, i
    real*8    a(*), b(*), fac, c
!
    c = -fac
    do i = n,1,-1
       a(i) = a(i) + b(i) * c
    end do
    return
!
    END
1 Jun 2012 3:11 #10252

I will post up some sse2 code to do the back substitution bit at the weekend. If anything, it is easier than the dot-product.

Don't forget that you probably aren't getting full benefit from alignment of data, and you might get some more performance once ALIGN(128) is fixed, but this wouldn't get released until towards the end of the year. In the short term, you might experiment with ALIGN(64) in your declaration in the big array module. This should give alignment on 16 byte boundaries half of the time.

Then longer term, you need to think about how to use all of the cores on your computer, because I still think there's another factor of 2-4 improved performance to obtain by parallelising the outer loop somehow. 😃

DanRRight has shown one way this might be possible if you're sticking with FTN95. I will have a little play with this method at the weekend to see if I can make it work with your problem.

1 Jun 2012 3:31 #10253

Quoted from JohnCampbell

!  4 Group coding
         nn = mod (n, 4)
         if (nn <  1) c = 0.
         if (nn == 1) c = a(1)*b(1)
         if (nn >  1) c = sum (a(1:nn)*b(1:nn))
!
         do k = nn+1,n,4
            c = c + sum ( a(K:K+3) * b(K:K+3) )
         end do

You can save the expense of the three tests at the start because sum() works when there is 1 or 0 terms in the array:

!  4 Group coding
         nn = mod (n, 4)
         ! sum intrinsic returns 0.0 when array is empty
         c = sum (a(1:nn)*b(1:nn))
         do k = nn+1,n,4
            c = c + sum ( a(K:K+3) * b(K:K+3) )
         end do

In Fortran 90, empty arrays are always defined, and it is legal to multiply two together to create another empty array. And the sum of the elements in an empty array is zero.

Its academic anyway, since this approach isn't very fast. The unrolling is basically what is done in fast_asm_ddotprod but there I only unroll 2 times. I tried to do 4 but ran out of registers.

Please login to reply.