forums.silverfrost.com

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

Here is the fast version of the dot product for double precision arrays.

Features: (1) finds alignments and uses fast movapd, mulpd when possible, (2) decreases loop counters, (3) unwraps vectorised loops, (4) out of order execution.

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

continued ....

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

David,

I have tested your revised example and confirmed that it has better performance. I have now tested 3 of the options you proposed. I am including the run test for 4 different approaches, using /opt compilation.
1: Original Vec_Sum function, using a DO loop
2: Modified loop, using your 4 group suggestion
3: asm_ddotprod
4: fast_asm_ddotprod

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

I will post up some sse2 code to do the back substitution bit at the weekend. If anything, it is easier than the dot-product.

Don't forget that you probably aren't getting full benefit from alignment of data, and you might get some more performance once ALIGN(128) is fixed, but this wouldn't get released until towards the end of the year. In the short term, you might experiment with ALIGN(64) in your declaration in the big array module. This should give alignment on 16 byte boundaries half of the time.

Then longer term, you need to think about how to use all of the cores on your computer, because I still think there's another factor of 2-4 improved performance to obtain by parallelising the outer loop somehow. Smile

DanRRight has shown one way this might be possible if you're sticking with FTN95. I will have a little play with this method at the weekend to see if I can make it work with your problem.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

LitusSaxonicum · Posted: Fri Jun 01, 2012 5:05 pm Post subject:

Once again, a huge thank you to David for his efforts. This isn't just a working routine, it is a clear demonstration of the speed gains possible with SSE. However, this is a very obvious application. Are there others?

I note that the additional SSE4 registers are only available in 64 bit mode, and with those available, operations such as this could become even faster ...

Eddie

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Eddie,
I don’t understand why SSE4 registers are not available. Aren’t they in the processor? I would have thought they were available to 32-bit application using a 64-bit capable processor, such as my core i5 and similar.
I should point out that this test example is a problem I have been testing since 2007. (5 years !)
The stiffness matrix is 2.3gb and in 2007 it took 25 minutes to reduce the stiffness matrix, with a significant proportion of time for disk I/O (7 minutes). The following is the reduce stats from a run at that time. (Not sure of the hardware, but probably only 2gb of memory).

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

I looked at another application of dot_product and wrote a simple test example, which I compiled with default and /opt compilation modes.
The relative performance of each option on my core i5 are dramatic.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

That's a 17 x improvement using David's new dot_product !

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

David,
You said,

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

John,

Here is a SAXPY double precision algorithm using SSE2. The subroutine overwrites Y with Y + A*X where X and Y are two arrays and A is a scalar.

To do backsubstitution, instead of doing:

CALL VECSUB (A, B, FAC, N)

You should do the following with the minus sign in the FAC argument:

CALL fast_asm_dsaxpy(A, B, N, -FAC)

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

continued ...

DanRRight · Posted: Sat Jun 02, 2012 9:41 pm Post subject:

Glad users and developers closely look at the options for speeding up and optimizing FTN95. I started to see in latest versions substantial speedup of the order of 1.5 when i use /opt. Before /opt just crashed my code. But there still exist a lot of amazing possibilities as David just showed.

BTW, I have never parallel library created with IVF for linear algebra tasks which unfortunately is only in the LIB form. I have only older CVF which does not make DLLs. With FTN95 it is working with only some routines like dense matrices and some others but is not with the block matrix (not sure why, the author who tried to make native LIB for FTN95 failed by some unknown reasons only he and FTN95 developers know, library complains about missing some system routines). It is 1.6 times faster then older Microsoft library. Anyone here interested to convert it to DLL to use with any compiler? PM me. Intel says how to do that

>Can IVF convert existing lib files into dll ?

Yes, you can, with some limitations. If the library consists of routines only,
and there is no shared data such as COMMON blocks that are expected to be used
by the "end user", then you can do this.

Making a DLL is the best approach I can suggest. In general, objects/libraries from one Fortran compiler are not linkable with another Fortran compiler. The run-time libraries are different and sometimes the calling ABI is different. The issue you are seeing is that the IVF-compiled code has calls to the Intel run-time libraries which are not present with the other compiler. You could add the Intel libraries to the link, but that may create conflicts.

Wrapping your code in a DLL isolates the run-time library issues, though you need to make sure that you don't try to do things such as I/O or ALLOCATE/DEALLOCATE across compilers. There should not be a performance impact of using the DLL approach.

You will need to create a ".DEF"
file which lists all the routines to be exported from the DLL. It is a text
file that has a line for each routine to be exported like this:

EXPORT routine1
EXPORT routine2
EXPORT routine3

You may have to experiment with the names - the case (upper/lower/mixed) must
match and there may be prefixes or suffixes that you have to consider.

Name this file with a .DEF (or .def) file type. This is an input to the
linker, so you can, from the command line, say:

link /dll mylib.lib mydef.def

This will require that the run-time libraries of the other compiler be
available and you may have to add them to the link command.

--
Steve Lionel
Developer Products Division
Intel Corporation
Nashua, NH

On the other hand i am interested if anyone used BLAS for parallel algebra. No time to look at that. Specifically i need only one subroutine which solves block matrix equations (with blocks residing near the diagonal like Substitute_VAG_S or Substitute_VAG_D in my library).

LitusSaxonicum · Posted: Sun Jun 03, 2012 6:15 pm Post subject:

For John,

I was only parrotting the entry in Wikipedia:

"SSE originally added eight new 128-bit registers known as XMM0 through XMM7. The AMD64 extensions from AMD (originally called x86-64) added a further eight registers XMM8 through XMM15, and this extension is duplicated in the Intel 64 architecture. There is also a new 32-bit control/status register, MXCSR. The registers XMM8 through XMM15 are accessible only in 64-bit operating mode."

If true, then a 64-bit OS opens up the opportunity to do more operations per cycle.

For David,

My question " are there any more useful applications of this" or words to that effect was a far from complete question, with the simple answer "Yes". One might divide the answers into two:

Applications where a high level programmer might readily detect that there was an opportunity to simply call a more efficient subroutine like David's dot product, and

Applications where it is by no means obvious as above, but after a certain amount of syntactical analysis in the compiler, it becomes evident that there is a faster approach to generating the executable code.

In the former case, the solution is a library (which could easily be supported in whole or part by the user community or by 3rd party software suppliers), in the latter case it is part of the optimisation done by the compiler and the user community has no role. Just reading the Polyhedron benchmarks makes it seem likely that some of the other compilers do precisely this.

A third approach might be to make it clear which language constructs in Fortran and FTN95 compile to faster code than others. For example suppose A has dimensions 1000x1000, then is