forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Fortran 2003/2008
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Fri May 18, 2012 8:49 am    Post subject: Reply with quote

The code for single and double precision dot products using SSE/SSE2.

This new version allows arrays of any size.

I don't know how to align data on 16 byte boundary's with FTN95. If I can find out how to do this, the code can be made faster by changing movups to movaps in the single precision version, and movupd to movapd in the double precision version.

Edit. I found out that the correct alignment can be specified using an align(128) attribute. But it doesn't seem to work. I have logged this on the support page. Until this is fixed, the codes below will have to make do with using the slower movups, movupd codes.

Also not all of the opcodes are implemented, like shuffling with shufps (grrrr).

Maybe Paul could comment on why some op codes don't work? And is it possible to define my own op codes (since they are just data really)?

Code:

!** SINGLE PRECISION VERSION
function asm_dotprod(v1,v2,n)
   integer n
   real v1(*), v2(*)
   real asm_dotprod

   integer nn, nnb, m
   real :: v(4)
     
   ! Number of elements in sets of 4
   nn = 4*(n/4)

   ! Number of bytes in sets of 4
   nnb = 4*nn

   ! Number of bytes
   m = 4*n

   ! Assembly code is between code, edoc lines
   code
      xorps xmm0%, xmm0%         ; set xmm0 to zero

      mov eax%, =v1              ; address of v1 argument
      mov ecx%, =v2              ; address of v2 argument     

      mov edx%, 0                ; Initialise loop counter
      mov edi%, nnb              ; counter when loop is exited
   
      cmp edi%, 0
      jz $20                     ; conditional skip loop, no terms

      ! Loop to accumulate dot product of elements 1 to nn
   10 movups xmm1%, [eax%+edx%]  ; move next 4 reals in v1 into xmm1
      movups xmm2%, [ecx%+edx%]  ; move next 4 reals in v2 into xmm2
      mulps xmm1%, xmm2%         ; multiply values and store in xmm1
      addps xmm0%, xmm1%         ; accumulate sum in xmm0
      add edx%, 16               ; increment counter
      cmp edx%, edi%
      jl $10                     ; conditional branch back

   20 mov edi%, m
      cmp edx%, edi%
      jge $40                    ; skip loop if n = nn

      ! Loop to accumulate dot product of last nn+1 to n elements in low 32 bits of xmm0
   30 movss xmm1%, [eax%+edx%]
      mulss xmm1%, [ecx%+edx%]
      addss xmm0%, xmm1%
      add edx%, 4
      cmp edx%, edi%
      jl $30

      ! Can't get final reduction to work, so will do this in Fortran for now
  40  movups v, xmm0%            ; move xmm0 to v array
   edoc
   
   ! Final reduction, the result is the sum of the four values in v
   asm_dotprod = sum(v)

end function asm_dotprod

_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl


Last edited by davidb on Sat May 19, 2012 9:14 am; edited 9 times in total
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Fri May 18, 2012 8:54 am    Post subject: Reply with quote

Code:

!** DOUBLE PRECISION VERSION
function asm_ddotprod(v1,v2,n)
   integer n
   real*8 v1(*), v2(*)
   real*8 asm_ddotprod

   integer nn, nnb, m
   real*8 :: v(2)
     
   ! Number of elements in sets of 2
   nn = 2*(n/2)

   ! Number of bytes in sets of 2
   nnb = 8*nn

   ! Number of bytes
   m = 8*n

   ! Assembly code is between code, edoc lines
   code
      xorps xmm0%, xmm0%         ; set xmm0 to zero

      mov eax%, =v1              ; address of v1 argument
      mov ecx%, =v2              ; address of v2 argument     

      mov edx%, 0                ; Initialise loop counter
      mov edi%, nnb              ; counter when loop is exited
   
      cmp edi%, 0
      jz $20                     ; conditional skip loop, no terms

      ! Loop to accumulate dot product of elements 1 to nn
   10 movupd xmm1%, [eax%+edx%]  ; move next 2 doubles in v1 into xmm1
      movupd xmm2%, [ecx%+edx%]  ; move next 2 doubles in v2 into xmm2
      mulpd xmm1%, xmm2%         ; multiply values and store in xmm1
      addpd xmm0%, xmm1%         ; accumulate sum in xmm0
      add edx%, 16               ; increment counter
      cmp edx%, edi%
      jl $10                     ; conditional branch back

   20 mov edi%, m
      cmp edx%, edi%
      jge $40                    ; skip loop if n = nn

      ! Loop to accumulate dot product of last nn+1 to n elements in low 64 bits of xmm0
   30 movsd xmm1%, [eax%+edx%]
      mulsd xmm1%, [ecx%+edx%]
      addsd xmm0%, xmm1%
      add edx%, 8
      cmp edx%, edi%
      jl $30

      ! Can't get final reduction to work, so will do this in Fortran for now
  40  movupd v, xmm0%            ; move xmm0 to v array
   edoc
   
   ! Final reduction, the result is the sum of the two values in v
   asm_ddotprod = sum(v)

end function asm_ddotprod

_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl


Last edited by davidb on Sat May 19, 2012 8:59 am; edited 4 times in total
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Fri May 18, 2012 9:04 pm    Post subject: Re: Reply with quote

JohnCampbell wrote:

I am not capable of doing this [SSE] or know the limitations as to how to test if the approach is supported by the processor ?


FTN95 has functions HAS_MMX@ and HAS_SSE@ which can be used to check support for MMX and SSE.

To check if AVX is supported there is a function @CPUID which can be used; it returns whether AVX is supported in bit 28 of the ECX register.

So you can write code which selects the most appropriate strategy depending on the support in the CPU.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2055
Location: Yateley, Hants, UK

PostPosted: Sun May 20, 2012 10:57 am    Post subject: Reply with quote

David,

It's really generous of you to create these routines and put them on the forum.

In the last bit of code (around statement 30), might it be better to simply fill the unused space with zeroes, so that you still use the SIMD instruction, so that (if there was only one pair to multiply) you were doing

X*Y + 0*0 + 0*0 + 0*0

... although I can't imagine the difference (even if my suggestion isn't worse!) would be measurable. On average across multiple invocations you would be doing 1.5 pairs of zero multiplies (3, 2, 1 or 0), not always the worst case of 3, and you can't always assume that the last loop is the best case of 1 - on average it must also be 1.5 (0, 1, 2 or 3) [I think!]

As well as the dot product with two long vectors, another case arises in transforming from one coordinate system to another. If this is in 3D, then it involves transformation matrices with a 3x3 size, and in order to exploit the SIMD SSE instructions, presumably the 4th pair must always be 0. In that case, the routine would be very short, but invoked a huge number of times. Do you think the overhead in the function call might not be worse than the saving by using the SIMD instruction (and worse in 2D transformations)? Or is this just a bad guess by me?

Eddie
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Sun May 20, 2012 4:37 pm    Post subject: Reply with quote

Eddie,

I can't pack the 1, 2 or 3 remaining values in a register without using the shufps (single precision) or shufpd (double precision) instructions to move the values to the correct bits in the xmm(n) registers. Unfortunately, these operations aren't supported in FTN95. So I just did the remaining values using a small non-SIMD loop. For double precision code, the "loop" count is either 0 or 1 and the final compare/jump back is redundant (I left it in so the code pattern is the same as the single precision version).

The non-SIMD loop won't really make much difference when the largest part of the loop is done using the SIMD instructions.

At the moment, these are just fully-functional demonstration routines to prove that SIMD is possible in FTN95. I should really be decreasing my counters (not increasing them) and should be vectorising more than 2 doubles at a time and interleaving some of the instructions to get some benefit from pipelining.

All these optimizations have to wait though because I am waiting for Paul to confirm there is an issue with 128 bit alignment (see the support page of the forum). As they are, these routines provide a modest improvement in run speed compared to a simple DO loop which compiles to non-SIMD code. Once I can make sure the data is aligned and have done the optimizations, the next versions of the routines should be much faster.

I'm not an assembly language programmer, so please all chip in if you think there's a better way to do it.

I'm not sure about your 3D transformation example. If you write the routine to do lots of transformations, with the correct optimizations and aligned data, then there could be some gains in performance. If all the routine does is one transformation, then the overhead could be too large to make it worthwhile.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl


Last edited by davidb on Sun May 20, 2012 10:50 pm; edited 1 time in total
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2055
Location: Yateley, Hants, UK

PostPosted: Sun May 20, 2012 8:16 pm    Post subject: Reply with quote

Hi David,

Thanks for the explanations. I did think about decrementing, but then forgot. I did a lot of ASM programming, but wasn't much good at it. CODE EDOC is much easier than MASM, as you don't need to write the entry and exit bit of a routine. I agree that the final loop bit can't possibly make much difference, and that the real benefit will be seen when the vector length is big.

Now I understand the shufp* bit - I already understood the alignment bit. FTN77/DBOS, if my memory serves me, always used to make common blocks etc aligned. This has been relaxed probably to save a few bytes here and there!

In the case of the 3D transformation bit, you might be operating on thousands of points. Saving time on each transformation would be useful, but it really needs to be done with inline code, not a subroutine call.

I do hope that Silverfrost see the benefits of enhancing the Assembler to include the additional opcodes and fix the alignment problem.

Recently I have been mulling over big programs that run for hours taking over the whole resources of a massively-configured Windows computer. It seems to me that Windows is the wrong environment somehow! But any cheap increase in processing speed gets my vote.

Eddie
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Sun May 20, 2012 10:46 pm    Post subject: Reply with quote

I think FTN95 does some alignment. But it needs to align an array so that the address of the first element is exactly divisible by 16 for SSE/SSE2. This is supposed to be possible using:

Code:

real, align(128) :: X(100)      ! address of X(1) is a multiple of 16
real*8, align(128) :: Y(100) ! address of Y(1) is a multiple of 16


It is this that doesn't seem to work.

But alignment like the above only helps, it isn't the full solution.

The assembly code will need to check for alignment since it won't always be called when the first element of the dummy arrays matching the first (aligned) element of the actual array arguments. Also cases like the following where the arrays are skewed can't be aligned automatically and the programmer has to put in "pad" variables.

Code:

real, align(128) :: X(100)
real, align(128) :: Y(101)
s = asm_dotprod(X(1), Y(2), 100)


This is almost certainly the case with John's code where the arrays are "not easily predictable" sections from a bigger matrix and could start anywhere in memory.

The assembly code needs to detect these cases and act accordingly, using instructions which assume alignment only when appropriate, and ones which don't at other times. I have worked out how to do this.

With AVX alignment needs to be to 32 bytes (256 bits) I think and this is a problem since align(256) isn't allowed in FTN95 (well not yet . Wink )
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2145
Location: Sydney

PostPosted: Mon May 21, 2012 8:58 am    Post subject: Reply with quote

David,

I have been trying to understand the alignment problem.
This is a property of the two arguments. It can be considered in 2 ways.
1) How the arguments are aligned wrt 16-byte or 32-byte, and also
2) The relative alignment of the two arguments.
If the two arguments are both equally non-aligned, then this can be recovered to an aligned situation.
However, I think the general case of the two arguments not having the same alignment means that there are problems with any recovery strategy. At best one argument can be treated as aligned, but the other can not.
In the case of a skyline equation solver, to achieve alignment for each Dot_Product call is difficult. I think that by creating two versions of the active column, then the alignment of both arguments can be achieved. This might not mean that both arguments start on 16-byte alignment, but have the same alignment condition.
A solution may be to have multiple loops in Dot_Product, depending on the argument alignment condition, having a preferred case where both arguments have the same alignment and an alternative where the arguments are not aligned.
I think the best that can be done in the aligned cased is to :
a) include an itteration if required to achieve alignment
b) use a bad loop for one argument as aligned or a good loop for both arguments now aligned.
c) include an itteration to complete the loop

There may be issues for 16-byte (SSE) or 32-byte alignment (AVX) but I do not have any definate knowledge of this.

Certainly demonstrating the working of SSE/SSE2 and possibly AVX is a significant achievment. Thanks for doing this.

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 6092
Location: Salford, UK

PostPosted: Wed May 23, 2012 3:42 pm    Post subject: Reply with quote

1) I think that there may be some possibility for you to manually optimise your code by either using LOC and FCORE4 (or DCORE8) to calculate addresses of array elements or by coding the critical parts in a C function.

2) There may be some scope for us to extend the range of CODE/EDOC instructions that FTN95 can handle if this would help.
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Wed May 23, 2012 6:37 pm    Post subject: Reply with quote

Hi Paul,

I can optimise the code easy enough to find alignment opportunities. I have already done something on this; I just haven't posted it yet. The issues are:

(1) It would be nice if the ALIGN(128) attribute worked. See my post/code on the support page of the forum. This seems to be a bug.

(2) Support for further SSE/SSE2 instructions in the assembler would be very helpful. Without these it is difficult to optimize things.

Regards
David.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 6092
Location: Salford, UK

PostPosted: Thu May 24, 2012 7:29 am    Post subject: Reply with quote

David

I should be grateful if you would post a short program that illustrates the ALIGN(128) problem.
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Thu May 24, 2012 5:43 pm    Post subject: Reply with quote

Paul,

I already posted code illustrating the ALIGN(128) problem in the support section of the Forum.

The post is at this link:

ALIGN attribute is not working

Regards
David
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2145
Location: Sydney

PostPosted: Wed May 30, 2012 3:08 am    Post subject: Reply with quote

David,

You made the comment:
Quote:
At the moment, these are just fully-functional demonstration routines to prove that SIMD is possible in FTN95.

What do you think is required to get a version of Dot_Product, that incorporates these instructions, working in FTN95 ?

What you have done is demonstrate that access to the SSE instructions can be available in FTN95, which I think is a significant advance.

John

PS: I have taken the real*8 code you have listed in the forum and installed it in my FEA program.
For the example problem I have been discussing recently, it reduces the equation reduction time from 301 seconds to 211 seconds.
The error check has changed from 1.2146E-10 to 1.2764E-10, confirming that for this example the routine works perfectly.
This appear to be a significant improvement over the default Dot_Product, when compiled with /OPT
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Wed May 30, 2012 7:51 am    Post subject: Reply with quote

John,

In the longer term it would be nice if some of the SSE/SSE2 and AVX instructions were generated by the compiler for array processing like the dot_product intrinsic. There would have to be a compiler option to select this optimization. I don't think it would be too difficult to phase some of this into the compiler, starting with dot_product, but its entirely up to Silverfrost whether they want to do this.

Since FTN95 has the inline assembler, a good starting point would be a set of routines that users could incorporate into there own codes. But we will need Silverfrost to fix the alignment issue and add support for some of the missing SSE/SSE2 instructions (Please Smile ). I don't think any AVX instructions are supported at present.

I have a faster version of the code I posted above, which I will post at the weekend. In the double precision version above, 2 multiplies and additions are carried out in parallel in the inner loop. The new version "unrolls" the inner loop so that 4 multiples and additions are done. Two pairs of these are each done in parallel using SSE2 instructions, and the products and sums of the 2 pairs are interleaved to get additional benefit from out-of-order execution instruction level parallelisation. This should improve the performance further.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl


Last edited by davidb on Fri Jun 01, 2012 12:11 am; edited 2 times in total
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 6092
Location: Salford, UK

PostPosted: Wed May 30, 2012 12:19 pm    Post subject: Reply with quote

I was hoping that adding more assembler instructions would be a simple matter but apparently this is not the case.

Similarly the alignment problem is easy to locate but there is a base address somewhere that also needs fixing. Still looking for this.

I will let you know when I have more information on both these issues.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
Page 5 of 10

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group