Topic: AVX512 and Linear Algebra in General

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

10 Nov 2019 8:05 #24649

Anyone has tried new version of MKL library which claims improvements and AVX512 support? Here is the code to try we tested with FTN95 with mecej4 couple years back. I'd like to get Intel compiled and optimized for AVX512 version of EXE for this code to test latest processors from AMD and INTEL to decide which one is better. AMD does not support AVX512 yet (though has SSE256). Good about AMD is that it is cheaper and has huge ~3-4x larger then Intel Level3 cache which may help to keep large piece of matrix inside. But Intel is better by having AVX512 (not clear though if it has any effect on in our case. In some other cases speedup could be 20% or in some specific cases 300%) and can run all cores at 5 GHz.

Also lately you can find a lot of cheap but somewhat older workstations with supercomputer grade Xeon processors with large number of cores and large memory (HP Z820 Workstation Intel Xeon 16 Core 2.6GHz 128GB RAM 500GB Solid State Drive). They abruptly became obsolete after AMD made 7nm server Epic and workstation Ryzen processors with up to 64 cores with much smaller price tag and as capable as Intel ones. New workstation processors from AMD with 16 and 32 cores will be available in couple weeks. Monopolist Intel also slashing prices but still trying to work also in opposite direction charging sometimes $1000 per core for latest server chips while Asia already showed $1 per core with some ARM mobile processors.

program MKLtest
 implicit none 
 integer :: i,j,neq,nrhs=1,lda,ldb, info 
 real*8,allocatable :: A(:,:),b(:) 
 integer, allocatable :: piv(:) 
 Integer count_0, count_1, count_rate, count_max 
 

 do neq=2000,20000,2000 
    lda=neq; ldb=neq 
    allocate(A(neq,neq),b(neq),piv(neq)) 
    call random_number(A) 
    call random_number(b) 
    Call system_clock(count_0, count_rate, count_max) 
    CALL dgesv (nEq,nrhs,A,ldA,piv, b, ldb, info) 
    Call system_clock(count_1, count_rate, count_max) 
    Write (*, '(1x,A,i6,A,2x,F8.3,A)') 'nEqu = ',nEq,' ', & 
         dble(count_1-count_0)/count_rate, ' s' 

    deallocate(A,b,piv) 
 end do
 Pause
 end program

Intel MKL is free for a year or two to try

JohnCampbell

Posts: 2526 Sydney

Back to Top

12 Nov 2019 4:12 #24654

Dan,

One of the key parameters on these workstations could be memory speed and memory bandwidth, especially if arrays are larger than cache or you are using lots of threads. I would be careful with old disposed-of processors, as they would probably not have the memory bandwidth to support many threads for large arrays.

I used Xeon processors 5-10 years ago and found them to be very slow, although I probably did not know how to use them; much prefer i7. The i7 I use (4 or 6 cores) do not support AVX512. It is a Coffee Lake !

AVX512 is on Xeon Phi, which is different from Xeon. It is also in a lot of other very recent 'Lakes ( I get totally confused with the Intel processor names )

With the large array problems I have (2gb - 16gb) it is difficult to understand what architecture combination is best. I have found if the arrays are not in cache, then AVX performance doesn't happen like claims.

Another factor is 'many threads' often require new algorithms. I struggle with load balancing between threads for my type of calculation (skyline solver for large linear equations) Other types of problems could be very different.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

12 Nov 2019 10:56 #24658

Server processors may have multichannel RAM chipsets. Currently i've seen 8 channels. So the RAM might not be a problem as desktop Intel and AMD processors are mostly dual channels, and only recently AMD started using more. So it is interesting how these new AMD processors will go. The 16 core Ryzen 3950x is dual channel though. Duopoly to hoard money from people. Even some recent mobile processors have 8 memory channels by the way.

JohnCampbell

Posts: 2526 Sydney

Back to Top

13 Nov 2019 3:56 #24659

Dan,

My knowledge on this topic is always only as good as my last project. My latest project was to use a 16gb skyline matrix and solve for many time steps on 10 options (threads). I tried both an i7-4790K (8 thread) and i7-8700K (12 threads). With the 4790, there was a severe memory bottleneck for 2 passes of 5 threads, taking 9+ seconds per time step ( then 2 x passes), while the 8700 took about 3 to 4 seconds per time step. On the 8700, I then introduced 2 x !$OMP BARRIER for the start and middle of each time step, which better aligned the memory usage between threads in use resulting in average 2.5 seconds per time step, which is about 10x faster than the 4790.

In this example I was really surprised at the difference between the gen 4 and the gen 8 i7, which I understood is mainly due to memory <> cache transfer rate and capacity.

In other past testing of AVX instructions, I have found if the information (arrays) is not in cache, then the AVX advantage can be minimal. This can be addressed by arranging the computation so there is an increased probability the data is in the cache (modifies numerical algorithm).

Both these examples have shown me that use of SIMD (AVX) needs to be tuned to the numerical problem and the other performance limitations of the processor, not just the existence of AVX512. This can be by adjusting the solution algorithm ( eg cache blocking of calcs or other adjusting of OpenMP)

My examples use large arrays ( 2GB - 16GB + ) so more intense calcs on smaller arrays may be different and have different bottlenecks to overcome to approach the quoted AVX rates.

There is another interesting example of MATMUL for large matrices at gFortran Ver 7 ( eg Real*8, dimension(8000,8000) :: a,b,c ; c = MATMUL (a,b) ), where the solution involved partitioning the MATMUL to sub-matrices aa(4,4) and bb(4,4) which fit into L1 cache. This produces about a 10x speed improvement using AVX2 instructions over the previous compiler version.

Getting AVX or AVX512 to produce the claimed performance heavily depends on getting the arrays into cache, at the rate required. Identifying how to do this can be difficult and does depend on the processor and memory mix being used. Unfortunately, for me, it is a learning experience with each new type of project. (With increased number of threads, the shared memory transfer rates also need to be increased.)

Old processors with older, slower memory looks to be a very unlikely win for my type of calculations.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

13 May 2021 5:43 #27754

Downloaded latest Intel MKL (it's free, btw), compiled this code above with it

ftn95 tlapack.f90 /64  /err /no_truncate /zeroise  >a_FTN95___
slink64  tlapack.obj 'c:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.1.143\windows\redist\intel64\mkl\mkl_rt.1.dll' /file:tlapack.exe  >a_link___

and it fails in some Intel DLL. Can anyone compile anything with MKL ?

mecej4

Posts: 1911

Back to Top

27 May 2021 12:11 #27868

I ran the Lapack example code given above, and it ran fine.

I used the latest versions (as of May 2021) FTN95, XSLINK64 (this is the 64-it SLINK64 Version 3.02, which I renamed in order to keep the older SLINK64 as a backup) and Intel MKL.

ftn95 /64 tlapack.f90
xslink64 tlapack.obj c:mkl_rt.1.dll
path %path%;c:\LANG\OneAPI\mkl\2021.2.0\redist\intel64

T:\lang\mkl>tlapack
 nEqu =   2000      4.089 s
 nEqu =   4000      0.222 s
 nEqu =   6000      0.674 s
 nEqu =   8000      1.564 s
 nEqu =  10000      2.836 s
 nEqu =  12000      5.169 s
 nEqu =  14000      8.281 s
 nEqu =  16000     13.103 s
 nEqu =  18000     21.511 s
 nEqu =  20000     26.495 s
**** PAUSE:
Press ENTER to continue:

A probable explanation for the high consumption of CPU time for nEqu = 2000 is that it includes the time taken to load the big MKL DLL on first use.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

28 May 2021 5:57 #27871

Thanks Mecej4, will look again what's the problem with my setup

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

28 May 2021 9:13 #27873

Still having problems. May be MKL does not like competitor's AMD processor? First it complained about path not found, then started to crash. Crashes even if i place EXE into MKL dir with its DLLs

Unknown exception (c06d007e) at address 7ffa3c6d4b89

Within file KERNELBASE.dll
In  RaiseException at address 69
In  mkl_serv_getenv at address 5C8
Within file mkl_intel_thread.1.dll
In  mkl_serv_getenv at address 59AF
In  mkl_serv_mkl_get_max_threads at address 22C
In  mkl_lapack_dgetrf at address 2F1
In  mkl_lapack_dgesv at address CE
Within file mkl_core.1.dll
In  dgesv at address 375
Within file MKL_RT.1.DLL
Within file tlapack.exe
in TLAPACK at address 27c


RAX = 00007ffa3ec347b1   RBX = 0000000000000000   RCX = 00000002fffe3130   RDX = 00000000025e0000
RBP = 00000002fffe3799   RSI = 00007ff9dbdc5700   RDI = 0000000000000000   RSP = 00000002fffe3650
R8  = 00007ffa3ec6f4d7   R9  = 00000002fffe2fe8   R10 = 00000000025e5d8e   R11 = 00000002fffe3100
R12 = 00007ff9df096438   R13 = 00007ff9df1440d8   R14 = 000000000000001b   R15 = 00007ff9de895b7c

7ffa3c6d4b89) db     0f,1f,44,00,00

Previous version also stopped working a year ago without me touching anything. I think it complained at one point about expiring Intel license, but license for MKL is not needed as i understand. And now i have problems on fresh installation of Windows and new type of processor

Can anyone install this piece of work ? With not zero probability you will need MKL at some point of your life. It is fast and specifically gains from multicore processors. Even if will save you 3 seconds per day you will save 24hours per entire life. That's 3 working days, or around thousand bucks wasted if you ignore 😃

mecej4

Posts: 1911

Back to Top

29 May 2021 11:37 #27874

It has been over 15 years since I acquired a PC with an AMD CPU, so I cannot try out your program and MKL on such a machine. Errors of the type that you report can be caused by trying to run Intel-specific instructions on an AMD CPU that does not support them. I remember that each time that I installed a new version of the Intel compiler on that PC I had to spend some time to ascertain which compiler options to use and still obtain an EXE that ran properly and fast on that AMD CPU (Athlon X2 - 4200+).

I suggest that you try the example after compiling the source using the Intel Fortran compiler. If you still receive the error traceback after the exception is taken, you can file a bug report on Intel's MKL forum, giving details on what compiler version and compiler options you used, and the version of MKL used.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

29 May 2021 2:28 #27875

Look at Wikipedia ( https://en.wikipedia.org/wiki/Math_Kernel_Library ) for 'Performance and Vendor Lock-in'. It may help.

I have used AMD cpus in desktop computers I have built myself since Pentium days. For what I do (and I know that others may have a different experience) I get far superior performance compared to the Intel-based machines I was cursed with at the Uni where I worked. Perhaps at the very top-end, using the most cutting-edge software, it may be the other way round.

It wouldn't surprise me (it didn't when I read the section in the Wikipedia article) that Intel cripples its software on AMD processors. That's in the dirty tricks arena, but is it any surprise?

Eddie

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

29 May 2021 8:13 #27878

I planned to check if faster memory influencing parallel linear algebra in Intel case. Interesting is that another example of LAIPE parallel linear algebra library made by the gFortran guy does not depend on RAM speed, RAM MHz or latency at all. Faster RAM may change RAMdisk speed for example by 1/3

JohnCampbell

Posts: 2526 Sydney

Back to Top

30 May 2021 1:53 #27879

Dan,

I have a Ryzen 5900X so I will try to test your program. This processor does not support AVX512 instructions. I am not familiar with Lapack and MKL libraries but they might not be familiar with (identify) the latest AMD processors.

mecej4 has provided details on the test he carried out using xslink64. I am still using FTN95 Ver 8.64.0 so might need to upgrade.

Eddie, there have been denials from Intel that AMD processors are restricted. I too was forced to use low spec Intel Xeon W3520 processors by a previous employer, who only believed HP salesmen. It was a revelation when I could choose alternatives.

Interestingly, I did a test of a gFortran .exe, generated for an i7-8700. This ran slower on the Ryzen, but when recompiled on the Ryzen using -march=native, it ran much faster than on the 8700. gFortran can detect the instruction set available on Ryzen processors (even ones it didn't know about) but perhaps Intel is not as aggressive for not Intel processors. Not sure if this is wrong. I am amazed at gFortran's support for different instruction sets, which would be a huge ask for FTN95 to contemplate. FTN95 does report some available instructions, as does CPUID.

mecej4

Posts: 1911

Back to Top

30 May 2021 10:33 #27880

John Campbell wrote: 'mecej4 has provided details on the test he carried out using xslink64. I am still using FTN95 Ver 8.64.0 so might need to upgrade.'

Let me clarify.

For the test example, tlapack.f90, I have no reason to believe that the older 8.64 compiler or the 32-bit slink64 will not suffice(I have not tried, since I have replaced them with the newer versions).

When you use the MKL libraries with a main program compiled with a non-Intel compiler, some of the speed features such as instruction set detection, OpenMP and other threading, etc., will not be available unless one writes code to enable those features and an API for doing the same is available.

I use the 64-bit slink64, which I renamed to 'xslink64', in order to give it a good workout and report bugs that I find (I found one, reported it, and that bug has already been fixed).

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

8 Jul 2021 9:08 #28067

John, Have you succeeded with installation of Intel parallel MKL library?

By the way is you new AMD-based computer rock stable? Mine AMD one crashes every two days despite i do not overclock it (it does not allow, almost no room is left for that, AMD squeezed everything from it already to look good vs Intel) and having water cooling. I start to regret i took it. Third time i buy AMD and then regret or return back. Unfortunately INTEL is too late with newer chips and motherboards.

mecej4

Posts: 1911

Back to Top

9 Jul 2021 10:16 #28068

Dan, how old is that computer with the AMD CPU? Please elaborate on what you mean by 'crash'. If it your program or Windows that crashes? Have you looked into the details, using tools such as Event Viewer (eventvwr.exe)?

The traceback that you posted on May 28 shows, at the end,

7ffa3c6d4b89) db     0f,1f,44,00,00

which implies that the runtime package was unable to disassemble the instruction that caused the exception -- an instruction that the companion compiler generated. That sort of thing should not happen!

JohnCampbell

Posts: 2526 Sydney

Back to Top

9 Jul 2021 1:12 #28069

Dan,

I have not tried MKL library.

I bought 5900X with 3600 mHz memory Dec-20; with XMP but no overclocking. Initially it kept crashing, so replaced with 3200 mHz memory and now stable ever since. Also had USB keyboard problems. Have updated bois twice and now runs well. Much faster than Intel for my FE work.

Doing large (1 to 3 gb memory) matrix calculations. It saturates at about 10 threads, but still about 80% faster than 8700K, which also has similar problem.

Could be memory bandwidth limit, but have tried adapted algorithms to use L3 cache sized chunks. Use matmul tests as a way to identify ways of improving thread efficiency.

Also use skyline linear equation solver (reducer) and multiple solution vectors with differing efficency, but all have memory bottleneck effect with higher thread count.

24 threads with dual channel memory is not efficient for my large memory array calcs, but can't afford to investigate the alternatives.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

9 Jul 2021 3:49 #28070

John,

The problem with a USB keyboard is probably not cpu related, but is a mainboard issue. I expect the memory issue is similar. Sometimes such issues relate to BIOS settings, but not always, and is sometimes fixable with a BIOS update. SOmetimes a RAM fault is because the sticks are not seated well enough.

As a 'for instance', my main machine has a Ryzen 2600 in an Asrock B450 board. This machine won't do a wakeup from the keyboard, but a much cheaper board (also Asrock, in this case an A320M) in my backup machine will, both with that cpu in it, or with a cheaper, slower, cpu (and Athlon 3000). It''s the mainboard and its firmware that is at fault.

Dan's issues are probably also board and chipset related rather than cpu.

Incidentally, the A320M boots faster than the B450 even though the latter has M.2 and the former only a SATA SSD.

mecej4

Posts: 1911

Back to Top

9 Jul 2021 4:13 #28071

One more item to check is the CMOS battery (usually a CR2032).

This February, during a cold wave we lost power and heating for almost 24 hours. After power came back, my desktop PC would not boot Windows properly. The BIOS settings were being reset to default values. Even though the PC was only four years old, the battery voltage had dropped to 0.4 V (normal: 3 V).

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

9 Jul 2021 8:21 #28072

Mecej4, All sh$t is absolutely new. ASUS mobo has no crashing complains based on reviews, RAM - dual 3600MHz 16 and dual 18 CAS XMP memory.

Spacious PC case, 4 fans + 3 fans for water cooling. NVMe drive has personal heatsink on top of it with fan to cool it to ~44C

Crashes unexpectedly with no activity. Besides may be 80-100GB filled in memory of different tasks, mostly idle, like 3-4 browsers and other stuff

Ran stress test, no problems

Event viewer show no info besides stating at login that computer recovered from unexpected event

We have hot days lately, 32C at home (which i like because hate cold weather), probably 40C inside the PC box. Memory is very hot though because of overvoltage to 1.45V from usual 1.35V to achieve CAS 16. All 4 memory SIMS are packed too close to each others though but they have heatsinks. Fans send air to cool them and motherboard very strongly.

Still I suspect memory right now. May be will need to update BIOS. People complain on sudden crashes with latest two generations AMD RYZENs. Could be the motherboards also adding uncertainty - i understand this is not just the AMD problem but all together work less reliably because AMD is less used in the world, less complains, which is a recipe for the proverbial devilry to easily squeeze inside somewhere. Same problem like with FTN95 - you and couple others do but otherwise little who sends their bug reports and suggestions for improvement 😃

May be software problem too. Will remove RAMdrive next, then half of memory, then update BIOS - this is my plan.

Will see how it will behave next few days when there will be 44C outside

John, One of tests i just made with computer slightly running other things shows 3.5x speedup with LAIPE on 5950X (16 cores) vs overclocked to 4.4GHz 4770k (4 cores)

On single core tests AMD is often even slower than my old laptops

By the way i also ran single core test by DaveGemini, there was one funny Fortran fan years ago producing noise and hot air on Fortran forums. Test was compiled by Intel compiler and for Intel processors 15-20 years back. It ran some subtests but refused to run LAPACK subtest. Same problems like with MKL i suspect

mecej4

Posts: 1911

Back to Top

9 Jul 2021 11:01 (Edited: 10 Jul 2021 7:57) #28073

Are you aware that adding 'heat sinks' with plastic components can reduce heat flow and, conversely, adding 'insulation' can increase heat flow? See https://www.nuclear-power.net/nuclear-engineering/heat-transfer/thermal-conduction/critical-thickness-of-insulation-critical-radius/ .

Check if the heat sinks on your memory modules have reduced the space between the modules so much as to reduce air flow.