forums.silverfrost.com

JohnCampbell · Joined: 16 Feb 2006 Posts: 2621 Location: Sydney

This has been a rewarding colaborative effort as I think we are getting some useful results, thanks to David's coding efforts.
Thanks to Dan's prompting, I have now written 4 different test programs to test Dot_Product performance and the alignment impact. I would be pleased to forward these to anyone who wants them, if they provide me with email addresses, via messages.
The simplest is the MATMUL test, which I have now run using both Salford Ver 6.10 and Intel Ver 11.1 on my old Xeon processor. The performance times for 5/6 options I have tested are

JohnCampbell · Joined: 16 Feb 2006 Posts: 2621 Location: Sydney

Building on the example of Davidb, would it be possible to add some support for SSE2, SSE4 or AVX instructions in FTN95 via a math library ?
By placing this in a library, this could limit the changes required to FTN95 to making the instructions available in a CODE / EDOC block.

I have identified two main vector calculations in past posts which I would like supported and it would be good to have these available as they could significantly improve run times when used.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2621 Location: Sydney

Paul,

I also note there is a compiler option /SSE2 listed from FTN95 /?.
This is not listed in FTN95.CHM.
Is this a new or obsolete option ?

John

PaulLaidler · Posted: Fri Oct 11, 2013 8:03 am Post subject:

It is old and does not appear to do anything so I have now removed it from the list.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2621 Location: Sydney

Paul,

I sent an email of an example for some of the real*8 math library, which uses the SSE routines that Davidb provided. I am not sure if they are SSE2 or SSE4 instructions.
I would like to see this library developed and either provided with FTN95 or in the coding examples.
This would provide us users of FTN95 with some access to faster computing, by way of vector instructions for a few basic types of array calculations, using CODE EDOC structures.
If it is possible to document these routines, this could provide a vehicle for enlarging the set.
My understanding of the change required to FTN95 is that more of the SSE4, AVX and possibly FMA3 instructions need to be supported by CODE/EDOC. Is this the case ?

I am at present using the 2 routines that David provided for
[a] . [b] (dot_product) and
[a] = [a] + const * [b]

There are two main areas of computing performance change that is leaving FTN95 behind, being multiple processor or larger vector registers.
My experience from testing both vector instructions and OpenMP is that vector instructions can be a much easier way of achieving 200% to 300% run time improvement, by targeting key inner loops of computation.

I'd be interested in your comments and those from others who might find this of use.

John

PaulLaidler · Posted: Wed Oct 16, 2013 7:55 am Post subject:

John

Thanks for the enquiry and information.
I have logged it for investigation.

Paul

DanRRight · Posted: Wed Oct 16, 2013 9:30 pm Post subject:

John,
Post ***simplest possible*** benchmarks here demonstrating the effect on multiple common tasks. Dot product or matrix inversion are OK but great would be to have at least few applications like LU-factorization, Gauss or band matrix solvers for AX=B equations. In a bigger picture, after SSE support added and multithreading now working (buy 16-core latest Intel server admittedly very expensive chips and you may see tremendous speedups) this compiler will miss mostly only 64bit support

JohnCampbell · Joined: 16 Feb 2006 Posts: 2621 Location: Sydney

Dan,

I have written a sample program and will try to save it to dropbox.com, so that it can be downloaded.

I have run a number of tests of matrix multiplication, being:
1) using FTN95 MATMUL for C(l,n) = A(l,m) x B(m,n)
2) using vector addition using Vec_Add_SSE written by Davidb
3) using dot_product using Vec_Sum_SSE, by providing A'

The test results are in seconds

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

It would be good if the alignment issue in FTN95 could be fixed. Then we could at least write some user contributed routines.

In the meantime, my preference has been just to use FTN95 for development, testing and debugging, and compile with gfortran (or ifort) when I want efficient release code. This is easier now that Clearwin+ is supported on other compilers (though I don't use it much personally).

This way I can take advantage of OpenMP, SSE, AVX etc.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

DanRRight · Posted: Tue Oct 22, 2013 2:35 am Post subject:

John, davidb,

The speedups I see tell me not too much because it is not clear to me right now if SSE will be actually applied exactly to the bottlenecks of my tasks. We need more tests on general purpose cases. But it is good to ask the authors of my parallel algebra library if they use SSE and if not then ask them to try it on multiple matrix equation solvers they have in there.

Meantime want an exercise of implementing SSE on A*X = B system of linear equations solver? Here is simplest Gauss elimination program modified to solve usual square dense matrix system of equation as well as block matrix ones. The text explains it all. Play with it first (it is written intentionally a bit verbose for others if they not yet started using Clearwin+) and then try to modify the short 18 lines Gauss solver adding SSE case to the last subroutine I left empty. Run the comparison. That way we will see the difference at least on some most commonly used general purpose programs

DanRRight · Posted: Tue Oct 22, 2013 2:37 am Post subject:

JohnCampbell · Joined: 16 Feb 2006 Posts: 2621 Location: Sydney

Dan,

I have not tested the code, but basically the changes I made to your code were:
* change A to A transpose to allow sequential memory access, and
* Include the vector routines in the inner loops.

This should show the benefit of the new instructions.
Note that the 2 routines in the library are required for this solver.

John

ps: I will test the update and copy to dropbox link shortly

The change is:

JohnCampbell · Joined: 16 Feb 2006 Posts: 2621 Location: Sydney

Dan,

I tested on my i5 notebook.
For 1,000 equations : gauss 4.103 seconds, SSE 0.312 seconds
For 1,500 equations : gauss 17.113 seconds, SSE 1.076 seconds
For 2,000 equations : gauss 44.819 seconds, SSE 2.574 seconds
For 3,000 equations : gauss 175.938 seconds, SSE 8.564 seconds

I would suggest that if you applied A' storage to your Gauss routine, then it would run faster, but I would still expect a saving of 2 to 4x with SSE.

I'll post the code tonight. You could add a 3rd option with A' storage.
Also you could test the solution ?

This demonstrates that there is the potential for significant performance improvement for FTN95 if these SSE ( I think SSE4 ) instructions are made available. AVX could provide a further improvement ( unless the SSE4 instructions are implemented in the AVX register on the i5 ?? )

John

DanRRight · Posted: Wed Oct 23, 2013 9:14 am Post subject:

DanRRight · Posted: Wed Oct 23, 2013 11:59 am Post subject:

The solution looks fine, you've done good job and incredibly fast, John.
Instead of transposing block matrix case I included parallel library case. Please do the transpose case if you'll have time, my brain does not work now, 3am. Anyway, SSE is still very good, specifically for smaller matrices where parallelization takes too much overhead. Got the following timings