Forum Index
Welcome to the Silverfrost forums
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

AVX512 and Linear Algebra

Post new topic   Reply to topic Forum Index -> General
View previous topic :: View next topic  
Author Message

Joined: 10 Mar 2008
Posts: 2302
Location: South Pole, Antarctica

PostPosted: Sun Nov 10, 2019 9:05 pm    Post subject: AVX512 and Linear Algebra Reply with quote

Anyone has tried new version of MKL library which claims improvements and AVX512 support? Here is the code to try we tested with FTN95 with mecej4 couple years back. I'd like to get Intel compiled and optimized for AVX512 version of EXE for this code to test latest processors from AMD and INTEL to decide which one is better. AMD does not support AVX512 yet (though has SSE256). Good about AMD is that it is cheaper and has huge ~3-4x larger then Intel Level3 cache which may help to keep large piece of matrix inside. But Intel is better by having AVX512 (not clear though if it has any effect on in our case. In some other cases speedup could be 20% or in some specific cases 300%) and can run all cores at 5 GHz.

Also lately you can find a lot of cheap but somewhat older workstations with supercomputer grade Xeon processors with large number of cores and large memory (HP Z820 Workstation Intel Xeon 16 Core 2.6GHz 128GB RAM 500GB Solid State Drive). They abruptly became obsolete after AMD made 7nm server Epic and workstation Ryzen processors with up to 64 cores with much smaller price tag and as capable as Intel ones. New workstation processors from AMD with 16 and 32 cores will be available in couple weeks. Monopolist Intel also slashing prices but still trying to work also in opposite direction charging sometimes $1000 per core for latest server chips while Asia already showed $1 per core with some ARM mobile processors.

program MKLtest
 implicit none
 integer :: i,j,neq,nrhs=1,lda,ldb, info
 real*8,allocatable :: A(:,:),b(:)
 integer, allocatable :: piv(:)
 Integer count_0, count_1, count_rate, count_max

 do neq=2000,20000,2000
    lda=neq; ldb=neq
    call random_number(A)
    call random_number(b)
    Call system_clock(count_0, count_rate, count_max)
    CALL dgesv (nEq,nrhs,A,ldA,piv, b, ldb, info)
    Call system_clock(count_1, count_rate, count_max)
    Write (*, '(1x,A,i6,A,2x,F8.3,A)') 'nEqu = ',nEq,' ', &
         dble(count_1-count_0)/count_rate, ' s'

 end do
 end program 

Intel MKL is free for a year or two to try
Back to top
View user's profile Send private message

Joined: 16 Feb 2006
Posts: 2285
Location: Sydney

PostPosted: Tue Nov 12, 2019 5:12 am    Post subject: Reply with quote


One of the key parameters on these workstations could be memory speed and memory bandwidth, especially if arrays are larger than cache or you are using lots of threads. I would be careful with old disposed-of processors, as they would probably not have the memory bandwidth to support many threads for large arrays.

I used Xeon processors 5-10 years ago and found them to be very slow, although I probably did not know how to use them; much prefer i7.
The i7 I use (4 or 6 cores) do not support AVX512. It is a Coffee Lake !

AVX512 is on Xeon Phi, which is different from Xeon. It is also in a lot of other very recent 'Lakes ( I get totally confused with the Intel processor names )

With the large array problems I have (2gb - 16gb) it is difficult to understand what architecture combination is best. I have found if the arrays are not in cache, then AVX performance doesn't happen like claims.

Another factor is "many threads" often require new algorithms. I struggle with load balancing between threads for my type of calculation (skyline solver for large linear equations) Other types of problems could be very different.
Back to top
View user's profile Send private message

Joined: 10 Mar 2008
Posts: 2302
Location: South Pole, Antarctica

PostPosted: Tue Nov 12, 2019 11:56 pm    Post subject: Reply with quote

Server processors may have multichannel RAM chipsets. Currently i've seen 8 channels. So the RAM might not be a problem as desktop Intel and AMD processors are mostly dual channels, and only recently AMD started using more. So it is interesting how these new AMD processors will go. The 16 core Ryzen 3950x is dual channel though. Duopoly to hoard money from people. Even some recent mobile processors have 8 memory channels by the way.
Back to top
View user's profile Send private message

Joined: 16 Feb 2006
Posts: 2285
Location: Sydney

PostPosted: Wed Nov 13, 2019 4:56 am    Post subject: Reply with quote


My knowledge on this topic is always only as good as my last project.
My latest project was to use a 16gb skyline matrix and solve for many time steps on 10 options (threads). I tried both an i7-4790K (8 thread) and i7-8700K (12 threads). With the 4790, there was a severe memory bottleneck for 2 passes of 5 threads, taking 9+ seconds per time step ( then 2 x passes), while the 8700 took about 3 to 4 seconds per time step. On the 8700, I then introduced 2 x !$OMP BARRIER for the start and middle of each time step, which better aligned the memory usage between threads in use resulting in average 2.5 seconds per time step, which is about 10x faster than the 4790.

In this example I was really surprised at the difference between the gen 4 and the gen 8 i7, which I understood is mainly due to memory <> cache transfer rate and capacity.

In other past testing of AVX instructions, I have found if the information (arrays) is not in cache, then the AVX advantage can be minimal. This can be addressed by arranging the computation so there is an increased probability the data is in the cache (modifies numerical algorithm).

Both these examples have shown me that use of SIMD (AVX) needs to be tuned to the numerical problem and the other performance limitations of the processor, not just the existence of AVX512. This can be by adjusting the solution algorithm ( eg cache blocking of calcs or other adjusting of OpenMP)

My examples use large arrays ( 2GB - 16GB + ) so more intense calcs on smaller arrays may be different and have different bottlenecks to overcome to approach the quoted AVX rates.

There is another interesting example of MATMUL for large matrices at gFortran Ver 7 ( eg Real*8, dimension(8000,8000) :: a,b,c ; c = MATMUL (a,b) ), where the solution involved partitioning the MATMUL to sub-matrices aa(4,4) and bb(4,4) which fit into L1 cache. This produces about a 10x speed improvement using AVX2 instructions over the previous compiler version.

Getting AVX or AVX512 to produce the claimed performance heavily depends on getting the arrays into cache, at the rate required. Identifying how to do this can be difficult and does depend on the processor and memory mix being used. Unfortunately, for me, it is a learning experience with each new type of project. (With increased number of threads, the shared memory transfer rates also need to be increased.)

Old processors with older, slower memory looks to be a very unlikely win for my type of calculations.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic Forum Index -> General All times are GMT + 1 Hour
Page 1 of 1

Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Powered by phpBB © 2001, 2005 phpBB Group