forums.silverfrost.com

DanRRight

The latest Intel processors developed for the tablets and large phones have TDP 4.5W and still almost beat 150W (if overclocked) desktop processors. Here are results of the same linear algebra tests we have wrote here a year ago

First table shows the times for the matrix algebra tests you have seen here before which uses almost the fastest processor on the planet a year ago and which is still probably near the top because of being overclocked to 4.5GHz (fastest today are close to 4.9-5.0 GHz with water cooling, fastest in terms of single core performance)

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

Some interesting results there. I think that fused multiply add us already implemented in Intel's Haswell processors. Of course whether this is used depends on the compiler.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

DanRRight · Posted: Sat Feb 14, 2015 3:21 am Post subject:

David, can you evaluate if it is needed to add FMA to your assembler library to benefit linear algebra? The first impression is that it might further improve the efficiency of the code

And Silverfrost to comment if this will be added to the compiler?

SSE is really a big thing as you see from these tests.The LAIPE developer was working on this too as he informed me a year ago. But i did not check yet if all was done.

John Campbell also was researching AWS. Any progress, John?

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

I thought that the fused multiply and add was introduced in SSE2. (Dot_Product is basically multiply and accumulate)

We should change the test program reporting to also report Gflops, as that is a useful comparison of different size problems. (just need to define what is a floating point "op")

I have recently been trying to use AVX instructions and combine with !$OMP. I now have a good parallel skyline solver running on i5 and i7 CPU's.
I have not been very successful is showing a significant improvement in AVX in comparison to SSE2.
My latest tests are identifying the significance of cache size and trying to minimise the frequency that the cache is updated. If the vectors are not in the cache, then AVX does not appear to work well. There are lots of possible reasons for this. I just need to be able to differentiate between the causes and the associations.

It would certainly be good if some of these instructions were available in FTN95, even if it was in a restricted syntax DOT_PRODUCT and a few other basic vector calculation functions: a FTN95 vector library!! This could give us the flexibility to improve out FTN95 performance, while enjoying the power of FTN95 error checking.

John

DanRRight · Posted: Sun Feb 15, 2015 4:58 am Post subject:

Yes AMD has it longer but with Intel only latest Haswell and Broadwell 22nm and 14nm chips started to have it since mid 2013. Some server chips planned to get FMA just last year. On Anandtech Ian Cutress discussed FMA3/4 last year and had wishes to test acceleration but I did not see anything since then.

Good to find original Linpack routine which was used to measure flops in the past. Just for comparison. Interesting how many 8-core Cray-2 our laptops, tablets and cellphones have.

Why haven't you try parallel LAIPE, it had anomalous speed boost on AMD chips exactly with skyline? You can download his test for multiple compilers and it uses skyline

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

DanRRight · Posted: Sat Mar 19, 2016 2:37 am Post subject:

Small update.

Installed more DDR3 RAM on desktop computer and upgraded RAM from 1600 to 2400MHz (from 9-9-9-27 to 11-13-13-31). That gave almost no effect on any programs besides the ones using SSE which now runs faster - before on 4000 equations run lasted 16.2 seconds now 12.97 s.

And by some reason new versions of code affected transposed matrix solver which started to run way slower - before was 17.5 s, now 32.58s. Changing RAM did not affect it. Same effect was on the Lenovo tablet where transposed matrix also was way slower on new software. We don't use this solver anyway, we use parallel solver LAIPE and it is insensitive to the RAM