forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

AVX512 and Linear Algebra
Goto page 1, 2  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Sun Nov 10, 2019 9:05 pm    Post subject: AVX512 and Linear Algebra Reply with quote

Anyone has tried new version of MKL library which claims improvements and AVX512 support? Here is the code to try we tested with FTN95 with mecej4 couple years back. I'd like to get Intel compiled and optimized for AVX512 version of EXE for this code to test latest processors from AMD and INTEL to decide which one is better. AMD does not support AVX512 yet (though has SSE256). Good about AMD is that it is cheaper and has huge ~3-4x larger then Intel Level3 cache which may help to keep large piece of matrix inside. But Intel is better by having AVX512 (not clear though if it has any effect on in our case. In some other cases speedup could be 20% or in some specific cases 300%) and can run all cores at 5 GHz.

Also lately you can find a lot of cheap but somewhat older workstations with supercomputer grade Xeon processors with large number of cores and large memory (HP Z820 Workstation Intel Xeon 16 Core 2.6GHz 128GB RAM 500GB Solid State Drive). They abruptly became obsolete after AMD made 7nm server Epic and workstation Ryzen processors with up to 64 cores with much smaller price tag and as capable as Intel ones. New workstation processors from AMD with 16 and 32 cores will be available in couple weeks. Monopolist Intel also slashing prices but still trying to work also in opposite direction charging sometimes $1000 per core for latest server chips while Asia already showed $1 per core with some ARM mobile processors.

Code:
program MKLtest
 implicit none
 integer :: i,j,neq,nrhs=1,lda,ldb, info
 real*8,allocatable :: A(:,:),b(:)
 integer, allocatable :: piv(:)
 Integer count_0, count_1, count_rate, count_max
 

 do neq=2000,20000,2000
    lda=neq; ldb=neq
    allocate(A(neq,neq),b(neq),piv(neq))
    call random_number(A)
    call random_number(b)
    Call system_clock(count_0, count_rate, count_max)
    CALL dgesv (nEq,nrhs,A,ldA,piv, b, ldb, info)
    Call system_clock(count_1, count_rate, count_max)
    Write (*, '(1x,A,i6,A,2x,F8.3,A)') 'nEqu = ',nEq,' ', &
         dble(count_1-count_0)/count_rate, ' s'

    deallocate(A,b,piv)
 end do
 Pause
 end program 


Intel MKL is free for a year or two to try
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2551
Location: Sydney

PostPosted: Tue Nov 12, 2019 5:12 am    Post subject: Reply with quote

Dan,

One of the key parameters on these workstations could be memory speed and memory bandwidth, especially if arrays are larger than cache or you are using lots of threads. I would be careful with old disposed-of processors, as they would probably not have the memory bandwidth to support many threads for large arrays.

I used Xeon processors 5-10 years ago and found them to be very slow, although I probably did not know how to use them; much prefer i7.
The i7 I use (4 or 6 cores) do not support AVX512. It is a Coffee Lake !

AVX512 is on Xeon Phi, which is different from Xeon. It is also in a lot of other very recent 'Lakes ( I get totally confused with the Intel processor names )

With the large array problems I have (2gb - 16gb) it is difficult to understand what architecture combination is best. I have found if the arrays are not in cache, then AVX performance doesn't happen like claims.

Another factor is "many threads" often require new algorithms. I struggle with load balancing between threads for my type of calculation (skyline solver for large linear equations) Other types of problems could be very different.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Tue Nov 12, 2019 11:56 pm    Post subject: Reply with quote

Server processors may have multichannel RAM chipsets. Currently i've seen 8 channels. So the RAM might not be a problem as desktop Intel and AMD processors are mostly dual channels, and only recently AMD started using more. So it is interesting how these new AMD processors will go. The 16 core Ryzen 3950x is dual channel though. Duopoly to hoard money from people. Even some recent mobile processors have 8 memory channels by the way.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2551
Location: Sydney

PostPosted: Wed Nov 13, 2019 4:56 am    Post subject: Reply with quote

Dan,

My knowledge on this topic is always only as good as my last project.
My latest project was to use a 16gb skyline matrix and solve for many time steps on 10 options (threads). I tried both an i7-4790K (8 thread) and i7-8700K (12 threads). With the 4790, there was a severe memory bottleneck for 2 passes of 5 threads, taking 9+ seconds per time step ( then 2 x passes), while the 8700 took about 3 to 4 seconds per time step. On the 8700, I then introduced 2 x !$OMP BARRIER for the start and middle of each time step, which better aligned the memory usage between threads in use resulting in average 2.5 seconds per time step, which is about 10x faster than the 4790.

In this example I was really surprised at the difference between the gen 4 and the gen 8 i7, which I understood is mainly due to memory <> cache transfer rate and capacity.

In other past testing of AVX instructions, I have found if the information (arrays) is not in cache, then the AVX advantage can be minimal. This can be addressed by arranging the computation so there is an increased probability the data is in the cache (modifies numerical algorithm).

Both these examples have shown me that use of SIMD (AVX) needs to be tuned to the numerical problem and the other performance limitations of the processor, not just the existence of AVX512. This can be by adjusting the solution algorithm ( eg cache blocking of calcs or other adjusting of OpenMP)

My examples use large arrays ( 2GB - 16GB + ) so more intense calcs on smaller arrays may be different and have different bottlenecks to overcome to approach the quoted AVX rates.

There is another interesting example of MATMUL for large matrices at gFortran Ver 7 ( eg Real*8, dimension(8000,8000) :: a,b,c ; c = MATMUL (a,b) ), where the solution involved partitioning the MATMUL to sub-matrices aa(4,4) and bb(4,4) which fit into L1 cache. This produces about a 10x speed improvement using AVX2 instructions over the previous compiler version.

Getting AVX or AVX512 to produce the claimed performance heavily depends on getting the arrays into cache, at the rate required. Identifying how to do this can be difficult and does depend on the processor and memory mix being used. Unfortunately, for me, it is a learning experience with each new type of project. (With increased number of threads, the shared memory transfer rates also need to be increased.)

Old processors with older, slower memory looks to be a very unlikely win for my type of calculations.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Thu May 13, 2021 6:43 pm    Post subject: Reply with quote

Downloaded latest Intel MKL (it's free, btw), compiled this code above with it
Code:
ftn95 tlapack.f90 /64  /err /no_truncate /zeroise  >a_FTN95___
slink64  tlapack.obj "c:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.1.143\windows\redist\intel64\mkl\mkl_rt.1.dll" /file:tlapack.exe  >a_link___


and it fails in some Intel DLL. Can anyone compile anything with MKL ?
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1884

PostPosted: Thu May 27, 2021 1:11 pm    Post subject: Reply with quote

I ran the Lapack example code given above, and it ran fine.

I used the latest versions (as of May 2021) FTN95, XSLINK64 (this is the 64-it SLINK64 Version 3.02, which I renamed in order to keep the older SLINK64 as a backup) and Intel MKL.

Code:
ftn95 /64 tlapack.f90
xslink64 tlapack.obj c:mkl_rt.1.dll
path %path%;c:\LANG\OneAPI\mkl\2021.2.0\redist\intel64

T:\lang\mkl>tlapack
 nEqu =   2000      4.089 s
 nEqu =   4000      0.222 s
 nEqu =   6000      0.674 s
 nEqu =   8000      1.564 s
 nEqu =  10000      2.836 s
 nEqu =  12000      5.169 s
 nEqu =  14000      8.281 s
 nEqu =  16000     13.103 s
 nEqu =  18000     21.511 s
 nEqu =  20000     26.495 s
**** PAUSE:
Press ENTER to continue:


A probable explanation for the high consumption of CPU time for nEqu = 2000 is that it includes the time taken to load the big MKL DLL on first use.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Fri May 28, 2021 6:57 am    Post subject: Reply with quote

Thanks Mecej4, will look again what's the problem with my setup
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Fri May 28, 2021 10:13 pm    Post subject: Reply with quote

Still having problems. May be MKL does not like competitor's AMD processor?
First it complained about path not found, then started to crash. Crashes even if i place EXE into MKL dir with its DLLs

Code:

Unknown exception (c06d007e) at address 7ffa3c6d4b89

Within file KERNELBASE.dll
In  RaiseException at address 69
In  mkl_serv_getenv at address 5C8
Within file mkl_intel_thread.1.dll
In  mkl_serv_getenv at address 59AF
In  mkl_serv_mkl_get_max_threads at address 22C
In  mkl_lapack_dgetrf at address 2F1
In  mkl_lapack_dgesv at address CE
Within file mkl_core.1.dll
In  dgesv at address 375
Within file MKL_RT.1.DLL
Within file tlapack.exe
in TLAPACK at address 27c


RAX = 00007ffa3ec347b1   RBX = 0000000000000000   RCX = 00000002fffe3130   RDX = 00000000025e0000
RBP = 00000002fffe3799   RSI = 00007ff9dbdc5700   RDI = 0000000000000000   RSP = 00000002fffe3650
R8  = 00007ffa3ec6f4d7   R9  = 00000002fffe2fe8   R10 = 00000000025e5d8e   R11 = 00000002fffe3100
R12 = 00007ff9df096438   R13 = 00007ff9df1440d8   R14 = 000000000000001b   R15 = 00007ff9de895b7c

7ffa3c6d4b89) db     0f,1f,44,00,00

Previous version also stopped working a year ago without me touching anything. I think it complained at one point about expiring Intel license, but license for MKL is not needed as i understand. And now i have problems on fresh installation of Windows and new type of processor

Can anyone install this piece of work ? With not zero probability you will need MKL at some point of your life. It is fast and specifically gains from multicore processors. Even if will save you 3 seconds per day you will save 24hours per entire life. That's 3 working days, or around thousand bucks wasted if you ignore Smile
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1884

PostPosted: Sat May 29, 2021 12:37 pm    Post subject: Reply with quote

It has been over 15 years since I acquired a PC with an AMD CPU, so I cannot try out your program and MKL on such a machine. Errors of the type that you report can be caused by trying to run Intel-specific instructions on an AMD CPU that does not support them. I remember that each time that I installed a new version of the Intel compiler on that PC I had to spend some time to ascertain which compiler options to use and still obtain an EXE that ran properly and fast on that AMD CPU (Athlon X2 - 4200+).

I suggest that you try the example after compiling the source using the Intel Fortran compiler. If you still receive the error traceback after the exception is taken, you can file a bug report on Intel's MKL forum, giving details on what compiler version and compiler options you used, and the version of MKL used.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sat May 29, 2021 3:28 pm    Post subject: Reply with quote

Look at Wikipedia ( https://en.wikipedia.org/wiki/Math_Kernel_Library ) for 'Performance and Vendor Lock-in'. It may help.

I have used AMD cpus in desktop computers I have built myself since Pentium days. For what I do (and I know that others may have a different experience) I get far superior performance compared to the Intel-based machines I was cursed with at the Uni where I worked. Perhaps at the very top-end, using the most cutting-edge software, it may be the other way round.

It wouldn't surprise me (it didn't when I read the section in the Wikipedia article) that Intel cripples its software on AMD processors. That's in the dirty tricks arena, but is it any surprise?

Eddie
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Sat May 29, 2021 9:13 pm    Post subject: Reply with quote

I planned to check if faster memory influencing parallel linear algebra in Intel case. Interesting is that another example of LAIPE parallel linear algebra library made by the gFortran guy does not depend on RAM speed, RAM MHz or latency at all. Faster RAM may change RAMdisk speed for example by 1/3
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2551
Location: Sydney

PostPosted: Sun May 30, 2021 2:53 am    Post subject: Reply with quote

Dan,

I have a Ryzen 5900X so I will try to test your program. This processor does not support AVX512 instructions. I am not familiar with Lapack and MKL libraries but they might not be familiar with (identify) the latest AMD processors.

mecej4 has provided details on the test he carried out using xslink64.
I am still using FTN95 Ver 8.64.0 so might need to upgrade.

Eddie, there have been denials from Intel that AMD processors are restricted. I too was forced to use low spec Intel Xeon W3520 processors by a previous employer, who only believed HP salesmen. It was a revelation when I could choose alternatives.

Interestingly, I did a test of a gFortran .exe, generated for an i7-8700. This ran slower on the Ryzen, but when recompiled on the Ryzen using -march=native, it ran much faster than on the 8700.
gFortran can detect the instruction set available on Ryzen processors (even ones it didn't know about) but perhaps Intel is not as aggressive for not Intel processors. Not sure if this is wrong.
I am amazed at gFortran's support for different instruction sets, which would be a huge ask for FTN95 to contemplate. FTN95 does report some available instructions, as does CPUID.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1884

PostPosted: Sun May 30, 2021 11:33 am    Post subject: Reply with quote

John Campbell wrote: "mecej4 has provided details on the test he carried out using xslink64. I am still using FTN95 Ver 8.64.0 so might need to upgrade."

Let me clarify.

For the test example, tlapack.f90, I have no reason to believe that the older 8.64 compiler or the 32-bit slink64 will not suffice(I have not tried, since I have replaced them with the newer versions).

When you use the MKL libraries with a main program compiled with a non-Intel compiler, some of the speed features such as instruction set detection, OpenMP and other threading, etc., will not be available unless one writes code to enable those features and an API for doing the same is available.

I use the 64-bit slink64, which I renamed to "xslink64", in order to give it a good workout and report bugs that I find (I found one, reported it, and that bug has already been fixed).
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Thu Jul 08, 2021 10:08 pm    Post subject: Reply with quote

John, Have you succeeded with installation of Intel parallel MKL library?

By the way is you new AMD-based computer rock stable? Mine AMD one crashes every two days despite i do not overclock it (it does not allow, almost no room is left for that, AMD squeezed everything from it already to look good vs Intel) and having water cooling. I start to regret i took it. Third time i buy AMD and then regret or return back. Unfortunately INTEL is too late with newer chips and motherboards.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1884

PostPosted: Fri Jul 09, 2021 11:16 am    Post subject: Reply with quote

Dan, how old is that computer with the AMD CPU? Please elaborate on what you mean by "crash". If it your program or Windows that crashes? Have you looked into the details, using tools such as Event Viewer (eventvwr.exe)?

The traceback that you posted on May 28 shows, at the end,

Code:
7ffa3c6d4b89) db     0f,1f,44,00,00


which implies that the runtime package was unable to disassemble the instruction that caused the exception -- an instruction that the companion compiler generated. That sort of thing should not happen!
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group