View previous topic :: View next topic |
Author |
Message |
DanRRight
Joined: 10 Mar 2008 Posts: 2816 Location: South Pole, Antarctica
|
Posted: Tue Feb 10, 2015 8:28 am Post subject: Tablet almost beating oveclocked desktop CPU |
|
|
The latest Intel processors developed for the tablets and large phones have TDP 4.5W and still almost beat 150W (if overclocked) desktop processors. Here are results of the same linear algebra tests we have wrote here a year ago
First table shows the times for the matrix algebra tests you have seen here before which uses almost the fastest processor on the planet a year ago and which is still probably near the top because of being overclocked to 4.5GHz (fastest today are close to 4.9-5.0 GHz with water cooling, fastest in terms of single core performance)
Code: |
i7 4770k 4.5GHz overclocked, 4 cores/8 threads
matrix size --> 1000 2000 3000 4000
--------------------------------------------
Dense/Block 2.22 30.4 127. 297.
Dense/Block Tr. 0.20 2.06 7.36 17.5
SSE 0.12 1.81 6.70 16.2
LAIPE 0.09 0.75 2.44 5.90 |
And here is Lenovo Yoga 3 Pro thinnest and lightest convertible 13.3" tablet-laptop
Code: |
i7 5Y70 1.1-2.6 GHz (turbo) 2 cores/4 threads, Lenovo Yoga3 Pro tablet
matrix size --> 1000 2000 3000 4000 5000 6000
--------------------------------------------------------------
Dense/Block 1.9 23.5 112 335. xxxx xxxx
Dense/Block Tr. 0.94 7.5 26.7 65.7 128. xxxx
SSE 0.27 2.9 8.8 20.7 42.9 73.7
LAIPE 0.2 2.1 7.0 22.1 50.1 90.4
|
And look at that, DavidB's SSE method is beating parallel LAIPE! Possibly the thermal throttling is the reason why two cores can not work for a long time in parallel. The CPUZ shows that multiplier gradualy drops from 26 to 20 in SSE case (means 2.6GHz to 2.0GHz) while in LAIPE one it drops from 26 to 16.
By the way, Intel planned to add more instructions to its extended sets like fused multiply-add |
|
Back to top |
|
|
davidb
Joined: 17 Jul 2009 Posts: 560 Location: UK
|
Posted: Fri Feb 13, 2015 7:22 pm Post subject: |
|
|
Some interesting results there. I think that fused multiply add us already implemented in Intel's Haswell processors. Of course whether this is used depends on the compiler. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2816 Location: South Pole, Antarctica
|
Posted: Sat Feb 14, 2015 3:21 am Post subject: |
|
|
David, can you evaluate if it is needed to add FMA to your assembler library to benefit linear algebra? The first impression is that it might further improve the efficiency of the code
And Silverfrost to comment if this will be added to the compiler?
SSE is really a big thing as you see from these tests.The LAIPE developer was working on this too as he informed me a year ago. But i did not check yet if all was done.
John Campbell also was researching AWS. Any progress, John? |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Sat Feb 14, 2015 5:03 am Post subject: |
|
|
Dan,
I thought that the fused multiply and add was introduced in SSE2. (Dot_Product is basically multiply and accumulate)
We should change the test program reporting to also report Gflops, as that is a useful comparison of different size problems. (just need to define what is a floating point "op")
I have recently been trying to use AVX instructions and combine with !$OMP. I now have a good parallel skyline solver running on i5 and i7 CPU's.
I have not been very successful is showing a significant improvement in AVX in comparison to SSE2.
My latest tests are identifying the significance of cache size and trying to minimise the frequency that the cache is updated. If the vectors are not in the cache, then AVX does not appear to work well. There are lots of possible reasons for this. I just need to be able to differentiate between the causes and the associations.
It would certainly be good if some of these instructions were available in FTN95, even if it was in a restricted syntax DOT_PRODUCT and a few other basic vector calculation functions: a FTN95 vector library!! This could give us the flexibility to improve out FTN95 performance, while enjoying the power of FTN95 error checking.
John |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2816 Location: South Pole, Antarctica
|
Posted: Sun Feb 15, 2015 4:58 am Post subject: |
|
|
Yes AMD has it longer but with Intel only latest Haswell and Broadwell 22nm and 14nm chips started to have it since mid 2013. Some server chips planned to get FMA just last year. On Anandtech Ian Cutress discussed FMA3/4 last year and had wishes to test acceleration but I did not see anything since then.
Good to find original Linpack routine which was used to measure flops in the past. Just for comparison. Interesting how many 8-core Cray-2 our laptops, tablets and cellphones have.
Why haven't you try parallel LAIPE, it had anomalous speed boost on AMD chips exactly with skyline? You can download his test for multiple compilers and it uses skyline
Last edited by DanRRight on Mon Feb 16, 2015 5:20 am; edited 2 times in total |
|
Back to top |
|
|
davidb
Joined: 17 Jul 2009 Posts: 560 Location: UK
|
Posted: Sun Feb 15, 2015 7:18 pm Post subject: Re: |
|
|
DanRRight wrote: | David, can you evaluate if it is needed to add FMA to your assembler library to benefit linear algebra? |
Yes this would be helpful. But it would be difficult for Silverfrost to keep the assembler up to date to include such new instructions. It is already quite a bit behind what is possible with current chips. The last time Paul looked it seemed quite a bit of work to add new instructions.
I don't even know if the assembler facility is included in the new compiler (32 bit/64 bit) when it comes out. We will have to wait and see. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2816 Location: South Pole, Antarctica
|
Posted: Sat Mar 19, 2016 2:37 am Post subject: |
|
|
Small update.
Installed more DDR3 RAM on desktop computer and upgraded RAM from 1600 to 2400MHz (from 9-9-9-27 to 11-13-13-31). That gave almost no effect on any programs besides the ones using SSE which now runs faster - before on 4000 equations run lasted 16.2 seconds now 12.97 s.
And by some reason new versions of code affected transposed matrix solver which started to run way slower - before was 17.5 s, now 32.58s. Changing RAM did not affect it. Same effect was on the Lenovo tablet where transposed matrix also was way slower on new software. We don't use this solver anyway, we use parallel solver LAIPE and it is insensitive to the RAM
Code: | i7 4770k 4.5GHz overclocked, 4 cores/8 threads, 2400MHz SDRAM
matrix size --> 1000 2000 3000 4000 5000 6000
------------------------------------------------------------
Gauss Regular 2.23 30.09 126.11 294.60 xxxxx xxxxxx
Gauss Transp 0.51 4.11 13.78 32.58 63.32 109.29
Gauss SSE 0.11 1.41 5.28 12.97 25.29 43.46
LAIPE 0.09 0.73 2.37 5.85 11.03 19.44
|
Anyone has 3200MHz or faster DDR4 RAM on similar type of processors? |
|
Back to top |
|
|
|