|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
DanRRight
Joined: 10 Mar 2008 Posts: 2852 Location: South Pole, Antarctica
|
Posted: Sun Nov 10, 2019 9:05 pm Post subject: AVX512 and Linear Algebra |
|
|
Anyone has tried new version of MKL library which claims improvements and AVX512 support? Here is the code to try we tested with FTN95 with mecej4 couple years back. I'd like to get Intel compiled and optimized for AVX512 version of EXE for this code to test latest processors from AMD and INTEL to decide which one is better. AMD does not support AVX512 yet (though has SSE256). Good about AMD is that it is cheaper and has huge ~3-4x larger then Intel Level3 cache which may help to keep large piece of matrix inside. But Intel is better by having AVX512 (not clear though if it has any effect on in our case. In some other cases speedup could be 20% or in some specific cases 300%) and can run all cores at 5 GHz.
Also lately you can find a lot of cheap but somewhat older workstations with supercomputer grade Xeon processors with large number of cores and large memory (HP Z820 Workstation Intel Xeon 16 Core 2.6GHz 128GB RAM 500GB Solid State Drive). They abruptly became obsolete after AMD made 7nm server Epic and workstation Ryzen processors with up to 64 cores with much smaller price tag and as capable as Intel ones. New workstation processors from AMD with 16 and 32 cores will be available in couple weeks. Monopolist Intel also slashing prices but still trying to work also in opposite direction charging sometimes $1000 per core for latest server chips while Asia already showed $1 per core with some ARM mobile processors.
Code: | program MKLtest
implicit none
integer :: i,j,neq,nrhs=1,lda,ldb, info
real*8,allocatable :: A(:,:),b(:)
integer, allocatable :: piv(:)
Integer count_0, count_1, count_rate, count_max
do neq=2000,20000,2000
lda=neq; ldb=neq
allocate(A(neq,neq),b(neq),piv(neq))
call random_number(A)
call random_number(b)
Call system_clock(count_0, count_rate, count_max)
CALL dgesv (nEq,nrhs,A,ldA,piv, b, ldb, info)
Call system_clock(count_1, count_rate, count_max)
Write (*, '(1x,A,i6,A,2x,F8.3,A)') 'nEqu = ',nEq,' ', &
dble(count_1-count_0)/count_rate, ' s'
deallocate(A,b,piv)
end do
Pause
end program |
Intel MKL is free for a year or two to try |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2578 Location: Sydney
|
Posted: Tue Nov 12, 2019 5:12 am Post subject: |
|
|
Dan,
One of the key parameters on these workstations could be memory speed and memory bandwidth, especially if arrays are larger than cache or you are using lots of threads. I would be careful with old disposed-of processors, as they would probably not have the memory bandwidth to support many threads for large arrays.
I used Xeon processors 5-10 years ago and found them to be very slow, although I probably did not know how to use them; much prefer i7.
The i7 I use (4 or 6 cores) do not support AVX512. It is a Coffee Lake !
AVX512 is on Xeon Phi, which is different from Xeon. It is also in a lot of other very recent 'Lakes ( I get totally confused with the Intel processor names )
With the large array problems I have (2gb - 16gb) it is difficult to understand what architecture combination is best. I have found if the arrays are not in cache, then AVX performance doesn't happen like claims.
Another factor is "many threads" often require new algorithms. I struggle with load balancing between threads for my type of calculation (skyline solver for large linear equations) Other types of problems could be very different. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2852 Location: South Pole, Antarctica
|
Posted: Tue Nov 12, 2019 11:56 pm Post subject: |
|
|
Server processors may have multichannel RAM chipsets. Currently i've seen 8 channels. So the RAM might not be a problem as desktop Intel and AMD processors are mostly dual channels, and only recently AMD started using more. So it is interesting how these new AMD processors will go. The 16 core Ryzen 3950x is dual channel though. Duopoly to hoard money from people. Even some recent mobile processors have 8 memory channels by the way. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2578 Location: Sydney
|
Posted: Wed Nov 13, 2019 4:56 am Post subject: |
|
|
Dan,
My knowledge on this topic is always only as good as my last project.
My latest project was to use a 16gb skyline matrix and solve for many time steps on 10 options (threads). I tried both an i7-4790K (8 thread) and i7-8700K (12 threads). With the 4790, there was a severe memory bottleneck for 2 passes of 5 threads, taking 9+ seconds per time step ( then 2 x passes), while the 8700 took about 3 to 4 seconds per time step. On the 8700, I then introduced 2 x !$OMP BARRIER for the start and middle of each time step, which better aligned the memory usage between threads in use resulting in average 2.5 seconds per time step, which is about 10x faster than the 4790.
In this example I was really surprised at the difference between the gen 4 and the gen 8 i7, which I understood is mainly due to memory <> cache transfer rate and capacity.
In other past testing of AVX instructions, I have found if the information (arrays) is not in cache, then the AVX advantage can be minimal. This can be addressed by arranging the computation so there is an increased probability the data is in the cache (modifies numerical algorithm).
Both these examples have shown me that use of SIMD (AVX) needs to be tuned to the numerical problem and the other performance limitations of the processor, not just the existence of AVX512. This can be by adjusting the solution algorithm ( eg cache blocking of calcs or other adjusting of OpenMP)
My examples use large arrays ( 2GB - 16GB + ) so more intense calcs on smaller arrays may be different and have different bottlenecks to overcome to approach the quoted AVX rates.
There is another interesting example of MATMUL for large matrices at gFortran Ver 7 ( eg Real*8, dimension(8000,8000) :: a,b,c ; c = MATMUL (a,b) ), where the solution involved partitioning the MATMUL to sub-matrices aa(4,4) and bb(4,4) which fit into L1 cache. This produces about a 10x speed improvement using AVX2 instructions over the previous compiler version.
Getting AVX or AVX512 to produce the claimed performance heavily depends on getting the arrays into cache, at the rate required. Identifying how to do this can be difficult and does depend on the processor and memory mix being used. Unfortunately, for me, it is a learning experience with each new type of project. (With increased number of threads, the shared memory transfer rates also need to be increased.)
Old processors with older, slower memory looks to be a very unlikely win for my type of calculations. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2852 Location: South Pole, Antarctica
|
Posted: Thu May 13, 2021 6:43 pm Post subject: |
|
|
Downloaded latest Intel MKL (it's free, btw), compiled this code above with it
Code: | ftn95 tlapack.f90 /64 /err /no_truncate /zeroise >a_FTN95___
slink64 tlapack.obj "c:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.1.143\windows\redist\intel64\mkl\mkl_rt.1.dll" /file:tlapack.exe >a_link___ |
and it fails in some Intel DLL. Can anyone compile anything with MKL ? |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1896
|
Posted: Thu May 27, 2021 1:11 pm Post subject: |
|
|
I ran the Lapack example code given above, and it ran fine.
I used the latest versions (as of May 2021) FTN95, XSLINK64 (this is the 64-it SLINK64 Version 3.02, which I renamed in order to keep the older SLINK64 as a backup) and Intel MKL.
Code: | ftn95 /64 tlapack.f90
xslink64 tlapack.obj c:mkl_rt.1.dll
path %path%;c:\LANG\OneAPI\mkl\2021.2.0\redist\intel64
T:\lang\mkl>tlapack
nEqu = 2000 4.089 s
nEqu = 4000 0.222 s
nEqu = 6000 0.674 s
nEqu = 8000 1.564 s
nEqu = 10000 2.836 s
nEqu = 12000 5.169 s
nEqu = 14000 8.281 s
nEqu = 16000 13.103 s
nEqu = 18000 21.511 s
nEqu = 20000 26.495 s
**** PAUSE:
Press ENTER to continue: |
A probable explanation for the high consumption of CPU time for nEqu = 2000 is that it includes the time taken to load the big MKL DLL on first use. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2852 Location: South Pole, Antarctica
|
Posted: Fri May 28, 2021 6:57 am Post subject: |
|
|
Thanks Mecej4, will look again what's the problem with my setup |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2852 Location: South Pole, Antarctica
|
Posted: Fri May 28, 2021 10:13 pm Post subject: |
|
|
Still having problems. May be MKL does not like competitor's AMD processor?
First it complained about path not found, then started to crash. Crashes even if i place EXE into MKL dir with its DLLs
Code: |
Unknown exception (c06d007e) at address 7ffa3c6d4b89
Within file KERNELBASE.dll
In RaiseException at address 69
In mkl_serv_getenv at address 5C8
Within file mkl_intel_thread.1.dll
In mkl_serv_getenv at address 59AF
In mkl_serv_mkl_get_max_threads at address 22C
In mkl_lapack_dgetrf at address 2F1
In mkl_lapack_dgesv at address CE
Within file mkl_core.1.dll
In dgesv at address 375
Within file MKL_RT.1.DLL
Within file tlapack.exe
in TLAPACK at address 27c
RAX = 00007ffa3ec347b1 RBX = 0000000000000000 RCX = 00000002fffe3130 RDX = 00000000025e0000
RBP = 00000002fffe3799 RSI = 00007ff9dbdc5700 RDI = 0000000000000000 RSP = 00000002fffe3650
R8 = 00007ffa3ec6f4d7 R9 = 00000002fffe2fe8 R10 = 00000000025e5d8e R11 = 00000002fffe3100
R12 = 00007ff9df096438 R13 = 00007ff9df1440d8 R14 = 000000000000001b R15 = 00007ff9de895b7c
7ffa3c6d4b89) db 0f,1f,44,00,00 |
Previous version also stopped working a year ago without me touching anything. I think it complained at one point about expiring Intel license, but license for MKL is not needed as i understand. And now i have problems on fresh installation of Windows and new type of processor
Can anyone install this piece of work ? With not zero probability you will need MKL at some point of your life. It is fast and specifically gains from multicore processors. Even if will save you 3 seconds per day you will save 24hours per entire life. That's 3 working days, or around thousand bucks wasted if you ignore |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1896
|
Posted: Sat May 29, 2021 12:37 pm Post subject: |
|
|
It has been over 15 years since I acquired a PC with an AMD CPU, so I cannot try out your program and MKL on such a machine. Errors of the type that you report can be caused by trying to run Intel-specific instructions on an AMD CPU that does not support them. I remember that each time that I installed a new version of the Intel compiler on that PC I had to spend some time to ascertain which compiler options to use and still obtain an EXE that ran properly and fast on that AMD CPU (Athlon X2 - 4200+).
I suggest that you try the example after compiling the source using the Intel Fortran compiler. If you still receive the error traceback after the exception is taken, you can file a bug report on Intel's MKL forum, giving details on what compiler version and compiler options you used, and the version of MKL used. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2393 Location: Yateley, Hants, UK
|
Posted: Sat May 29, 2021 3:28 pm Post subject: |
|
|
Look at Wikipedia ( https://en.wikipedia.org/wiki/Math_Kernel_Library ) for 'Performance and Vendor Lock-in'. It may help.
I have used AMD cpus in desktop computers I have built myself since Pentium days. For what I do (and I know that others may have a different experience) I get far superior performance compared to the Intel-based machines I was cursed with at the Uni where I worked. Perhaps at the very top-end, using the most cutting-edge software, it may be the other way round.
It wouldn't surprise me (it didn't when I read the section in the Wikipedia article) that Intel cripples its software on AMD processors. That's in the dirty tricks arena, but is it any surprise?
Eddie |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2852 Location: South Pole, Antarctica
|
Posted: Sat May 29, 2021 9:13 pm Post subject: |
|
|
I planned to check if faster memory influencing parallel linear algebra in Intel case. Interesting is that another example of LAIPE parallel linear algebra library made by the gFortran guy does not depend on RAM speed, RAM MHz or latency at all. Faster RAM may change RAMdisk speed for example by 1/3 |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2578 Location: Sydney
|
Posted: Sun May 30, 2021 2:53 am Post subject: |
|
|
Dan,
I have a Ryzen 5900X so I will try to test your program. This processor does not support AVX512 instructions. I am not familiar with Lapack and MKL libraries but they might not be familiar with (identify) the latest AMD processors.
mecej4 has provided details on the test he carried out using xslink64.
I am still using FTN95 Ver 8.64.0 so might need to upgrade.
Eddie, there have been denials from Intel that AMD processors are restricted. I too was forced to use low spec Intel Xeon W3520 processors by a previous employer, who only believed HP salesmen. It was a revelation when I could choose alternatives.
Interestingly, I did a test of a gFortran .exe, generated for an i7-8700. This ran slower on the Ryzen, but when recompiled on the Ryzen using -march=native, it ran much faster than on the 8700.
gFortran can detect the instruction set available on Ryzen processors (even ones it didn't know about) but perhaps Intel is not as aggressive for not Intel processors. Not sure if this is wrong.
I am amazed at gFortran's support for different instruction sets, which would be a huge ask for FTN95 to contemplate. FTN95 does report some available instructions, as does CPUID. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1896
|
Posted: Sun May 30, 2021 11:33 am Post subject: |
|
|
John Campbell wrote: "mecej4 has provided details on the test he carried out using xslink64. I am still using FTN95 Ver 8.64.0 so might need to upgrade."
Let me clarify.
For the test example, tlapack.f90, I have no reason to believe that the older 8.64 compiler or the 32-bit slink64 will not suffice(I have not tried, since I have replaced them with the newer versions).
When you use the MKL libraries with a main program compiled with a non-Intel compiler, some of the speed features such as instruction set detection, OpenMP and other threading, etc., will not be available unless one writes code to enable those features and an API for doing the same is available.
I use the 64-bit slink64, which I renamed to "xslink64", in order to give it a good workout and report bugs that I find (I found one, reported it, and that bug has already been fixed). |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2852 Location: South Pole, Antarctica
|
Posted: Thu Jul 08, 2021 10:08 pm Post subject: |
|
|
John, Have you succeeded with installation of Intel parallel MKL library?
By the way is you new AMD-based computer rock stable? Mine AMD one crashes every two days despite i do not overclock it (it does not allow, almost no room is left for that, AMD squeezed everything from it already to look good vs Intel) and having water cooling. I start to regret i took it. Third time i buy AMD and then regret or return back. Unfortunately INTEL is too late with newer chips and motherboards. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1896
|
Posted: Fri Jul 09, 2021 11:16 am Post subject: |
|
|
Dan, how old is that computer with the AMD CPU? Please elaborate on what you mean by "crash". If it your program or Windows that crashes? Have you looked into the details, using tools such as Event Viewer (eventvwr.exe)?
The traceback that you posted on May 28 shows, at the end,
Code: | 7ffa3c6d4b89) db 0f,1f,44,00,00 |
which implies that the runtime package was unable to disassemble the instruction that caused the exception -- an instruction that the companion compiler generated. That sort of thing should not happen! |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|