forums.silverfrost.com

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

The code for single and double precision dot products using SSE/SSE2.

This new version allows arrays of any size.

I don't know how to align data on 16 byte boundary's with FTN95. If I can find out how to do this, the code can be made faster by changing movups to movaps in the single precision version, and movupd to movapd in the double precision version.

Edit. I found out that the correct alignment can be specified using an align(128) attribute. But it doesn't seem to work. I have logged this on the support page. Until this is fixed, the codes below will have to make do with using the slower movups, movupd codes.

Also not all of the opcodes are implemented, like shuffling with shufps (grrrr).

Maybe Paul could comment on why some op codes don't work? And is it possible to define my own op codes (since they are just data really)?

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

LitusSaxonicum · Posted: Sun May 20, 2012 10:57 am Post subject:

David,

It's really generous of you to create these routines and put them on the forum.

In the last bit of code (around statement 30), might it be better to simply fill the unused space with zeroes, so that you still use the SIMD instruction, so that (if there was only one pair to multiply) you were doing

X*Y + 0*0 + 0*0 + 0*0

... although I can't imagine the difference (even if my suggestion isn't worse!) would be measurable. On average across multiple invocations you would be doing 1.5 pairs of zero multiplies (3, 2, 1 or 0), not always the worst case of 3, and you can't always assume that the last loop is the best case of 1 - on average it must also be 1.5 (0, 1, 2 or 3) [I think!]

As well as the dot product with two long vectors, another case arises in transforming from one coordinate system to another. If this is in 3D, then it involves transformation matrices with a 3x3 size, and in order to exploit the SIMD SSE instructions, presumably the 4th pair must always be 0. In that case, the routine would be very short, but invoked a huge number of times. Do you think the overhead in the function call might not be worse than the saving by using the SIMD instruction (and worse in 2D transformations)? Or is this just a bad guess by me?

Eddie

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

Eddie,

I can't pack the 1, 2 or 3 remaining values in a register without using the shufps (single precision) or shufpd (double precision) instructions to move the values to the correct bits in the xmm(n) registers. Unfortunately, these operations aren't supported in FTN95. So I just did the remaining values using a small non-SIMD loop. For double precision code, the "loop" count is either 0 or 1 and the final compare/jump back is redundant (I left it in so the code pattern is the same as the single precision version).

The non-SIMD loop won't really make much difference when the largest part of the loop is done using the SIMD instructions.

At the moment, these are just fully-functional demonstration routines to prove that SIMD is possible in FTN95. I should really be decreasing my counters (not increasing them) and should be vectorising more than 2 doubles at a time and interleaving some of the instructions to get some benefit from pipelining.

All these optimizations have to wait though because I am waiting for Paul to confirm there is an issue with 128 bit alignment (see the support page of the forum). As they are, these routines provide a modest improvement in run speed compared to a simple DO loop which compiles to non-SIMD code. Once I can make sure the data is aligned and have done the optimizations, the next versions of the routines should be much faster.

I'm not an assembly language programmer, so please all chip in if you think there's a better way to do it.

I'm not sure about your 3D transformation example. If you write the routine to do lots of transformations, with the correct optimizations and aligned data, then there could be some gains in performance. If all the routine does is one transformation, then the overhead could be too large to make it worthwhile.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

LitusSaxonicum · Posted: Sun May 20, 2012 8:16 pm Post subject:

Hi David,

Thanks for the explanations. I did think about decrementing, but then forgot. I did a lot of ASM programming, but wasn't much good at it. CODE EDOC is much easier than MASM, as you don't need to write the entry and exit bit of a routine. I agree that the final loop bit can't possibly make much difference, and that the real benefit will be seen when the vector length is big.

Now I understand the shufp* bit - I already understood the alignment bit. FTN77/DBOS, if my memory serves me, always used to make common blocks etc aligned. This has been relaxed probably to save a few bytes here and there!

In the case of the 3D transformation bit, you might be operating on thousands of points. Saving time on each transformation would be useful, but it really needs to be done with inline code, not a subroutine call.

I do hope that Silverfrost see the benefits of enhancing the Assembler to include the additional opcodes and fix the alignment problem.

Recently I have been mulling over big programs that run for hours taking over the whole resources of a massively-configured Windows computer. It seems to me that Windows is the wrong environment somehow! But any cheap increase in processing speed gets my vote.

Eddie

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

I think FTN95 does some alignment. But it needs to align an array so that the address of the first element is exactly divisible by 16 for SSE/SSE2. This is supposed to be possible using:

JohnCampbell · Joined: 16 Feb 2006 Posts: 2615 Location: Sydney

David,

I have been trying to understand the alignment problem.
This is a property of the two arguments. It can be considered in 2 ways.
1) How the arguments are aligned wrt 16-byte or 32-byte, and also
2) The relative alignment of the two arguments.
If the two arguments are both equally non-aligned, then this can be recovered to an aligned situation.
However, I think the general case of the two arguments not having the same alignment means that there are problems with any recovery strategy. At best one argument can be treated as aligned, but the other can not.
In the case of a skyline equation solver, to achieve alignment for each Dot_Product call is difficult. I think that by creating two versions of the active column, then the alignment of both arguments can be achieved. This might not mean that both arguments start on 16-byte alignment, but have the same alignment condition.
A solution may be to have multiple loops in Dot_Product, depending on the argument alignment condition, having a preferred case where both arguments have the same alignment and an alternative where the arguments are not aligned.
I think the best that can be done in the aligned cased is to :
a) include an itteration if required to achieve alignment
b) use a bad loop for one argument as aligned or a good loop for both arguments now aligned.
c) include an itteration to complete the loop

There may be issues for 16-byte (SSE) or 32-byte alignment (AVX) but I do not have any definate knowledge of this.

Certainly demonstrating the working of SSE/SSE2 and possibly AVX is a significant achievment. Thanks for doing this.

John

PaulLaidler · Posted: Wed May 23, 2012 3:42 pm Post subject:

1) I think that there may be some possibility for you to manually optimise your code by either using LOC and FCORE4 (or DCORE8) to calculate addresses of array elements or by coding the critical parts in a C function.

2) There may be some scope for us to extend the range of CODE/EDOC instructions that FTN95 can handle if this would help.

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

Hi Paul,

I can optimise the code easy enough to find alignment opportunities. I have already done something on this; I just haven't posted it yet. The issues are:

(1) It would be nice if the ALIGN(128) attribute worked. See my post/code on the support page of the forum. This seems to be a bug.

(2) Support for further SSE/SSE2 instructions in the assembler would be very helpful. Without these it is difficult to optimize things.

Regards
David.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

PaulLaidler · Posted: Thu May 24, 2012 7:29 am Post subject:

David

I should be grateful if you would post a short program that illustrates the ALIGN(128) problem.

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

Paul,

I already posted code illustrating the ALIGN(128) problem in the support section of the Forum.

The post is at this link:

ALIGN attribute is not working

Regards
David
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

JohnCampbell · Joined: 16 Feb 2006 Posts: 2615 Location: Sydney

David,

You made the comment:

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

John,

In the longer term it would be nice if some of the SSE/SSE2 and AVX instructions were generated by the compiler for array processing like the dot_product intrinsic. There would have to be a compiler option to select this optimization. I don't think it would be too difficult to phase some of this into the compiler, starting with dot_product, but its entirely up to Silverfrost whether they want to do this.

Since FTN95 has the inline assembler, a good starting point would be a set of routines that users could incorporate into there own codes. But we will need Silverfrost to fix the alignment issue and add support for some of the missing SSE/SSE2 instructions (Please Smile

). I don't think any AVX instructions are supported at present.

I have a faster version of the code I posted above, which I will post at the weekend. In the double precision version above, 2 multiplies and additions are carried out in parallel in the inner loop. The new version "unrolls" the inner loop so that 4 multiples and additions are done. Two pairs of these are each done in parallel using SSE2 instructions, and the products and sums of the 2 pairs are interleaved to get additional benefit from out-of-order execution instruction level parallelisation. This should improve the performance further.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

PaulLaidler · Posted: Wed May 30, 2012 12:19 pm Post subject:

I was hoping that adding more assembler instructions would be a simple matter but apparently this is not the case.

Similarly the alignment problem is easy to locate but there is a base address somewhere that also needs fixing. Still looking for this.

I will let you know when I have more information on both these issues.