Topic: SSE Instructions in Suggestions

JohnCampbell

Posts: 2526 Sydney

Back to Top

24 Mar 2014 12:40 #13887

I have recently done testing of SSE/AVX and OpenMP instructions on other compilers for a range of (cheaper) hardware. The results have shown that for large vector calculations, memory access speeds are the bottleneck and using FTN95 with David's SSE routines compare very favourably with those alternatives. These routines for real8 dot product and vector addition would be a valuable addition to FTN95. If provided as a library for real8 vector calculations these would be a valuable addition for FTN95's performance. These could possibly be expanded to a few other basic routines of similar structure. Perhaps a /SSE switch could incorporate these. For general use, these SSE instructions can be applied at the inner DO loop. The complexity in their use relates to management of alignment of variables, which could be assisted by reviewing the management of array alignment.

John

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

24 Mar 2014 2:22 #13888

Don’t I remember that the old DBOS version of FTN77 used to need specific memory alignments for some purposes? My guess is that without CHARACTER, and with all other data types of the same length as in Fortran 66, memory alignment was more or less guaranteed then.

Using SSE3/AVX etc automatically is an extension to the /P6 option, and as I doubt that anyone uses that Pentium CPU any more, and indeed, it is hard to believe that any hardware regularly in use doesn’t support most SSE/AVX options, then perhaps the defaults need changing.

Eddie

JohnCampbell

Posts: 2526 Sydney

Back to Top

24 Mar 2014 11:02 #13890

Eddie,

Via experimentation, I have learnt some interesting results of AVX and SSE instructions. I certainly don't understand why the alignment issue is there. They (Intel) should have designed an instruction set that could cope with 8 bytes spanning memory 'segments'. What I have found is that I can't get AVX instructions to be very effective in comparison to SSE for large memory problems. They only work well if the information is in the cache and you would think that memory alignment could have been dealt with better in the cache. Trying to push performance via SSE, AVX or OpenMP requires that the variables you are working on are cached, so for large calculations the key processor measure is memory access speed, rather than CPU clock rate. ( Large calculations are when the inner loop is accessing more memory than can be stored in the cache, which is about 10 to 20 Mb.) With Davidb’s SSE routines and FTN95, the SSE instructions show significant utilisation of the cache, while FTN95 does not appear to utilise the cache for the tests I performed. It is also interesting, that relying on experimentation to claim you know how a computer works is an uncertain approach, as it does not take long until you are proven wrong. When you don't have the range of hardware performance to test your hypothesis, it is easy to reach the wrong conclusion. Anyway, memory access speed appears to be the performance limiter at the moment; that is for compilers that support SSE, AVX or OpenMP instructions.

John