forums.silverfrost.com

DanRRight · Posted: Sun Nov 10, 2019 9:05 pm Post subject: AVX512 and Linear Algebra

Anyone has tried new version of MKL library which claims improvements and AVX512 support? Here is the code to try we tested with FTN95 with mecej4 couple years back. I'd like to get Intel compiled and optimized for AVX512 version of EXE for this code to test latest processors from AMD and INTEL to decide which one is better. AMD does not support AVX512 yet (though has SSE256). Good about AMD is that it is cheaper and has huge ~3-4x larger then Intel Level3 cache which may help to keep large piece of matrix inside. But Intel is better by having AVX512 (not clear though if it has any effect on in our case. In some other cases speedup could be 20% or in some specific cases 300%) and can run all cores at 5 GHz.

Also lately you can find a lot of cheap but somewhat older workstations with supercomputer grade Xeon processors with large number of cores and large memory (HP Z820 Workstation Intel Xeon 16 Core 2.6GHz 128GB RAM 500GB Solid State Drive). They abruptly became obsolete after AMD made 7nm server Epic and workstation Ryzen processors with up to 64 cores with much smaller price tag and as capable as Intel ones. New workstation processors from AMD with 16 and 32 cores will be available in couple weeks. Monopolist Intel also slashing prices but still trying to work also in opposite direction charging sometimes $1000 per core for latest server chips while Asia already showed $1 per core with some ARM mobile processors.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

One of the key parameters on these workstations could be memory speed and memory bandwidth, especially if arrays are larger than cache or you are using lots of threads. I would be careful with old disposed-of processors, as they would probably not have the memory bandwidth to support many threads for large arrays.

I used Xeon processors 5-10 years ago and found them to be very slow, although I probably did not know how to use them; much prefer i7.
The i7 I use (4 or 6 cores) do not support AVX512. It is a Coffee Lake !

AVX512 is on Xeon Phi, which is different from Xeon. It is also in a lot of other very recent 'Lakes ( I get totally confused with the Intel processor names )

With the large array problems I have (2gb - 16gb) it is difficult to understand what architecture combination is best. I have found if the arrays are not in cache, then AVX performance doesn't happen like claims.

Another factor is "many threads" often require new algorithms. I struggle with load balancing between threads for my type of calculation (skyline solver for large linear equations) Other types of problems could be very different.

DanRRight · Posted: Tue Nov 12, 2019 11:56 pm Post subject:

Server processors may have multichannel RAM chipsets. Currently i've seen 8 channels. So the RAM might not be a problem as desktop Intel and AMD processors are mostly dual channels, and only recently AMD started using more. So it is interesting how these new AMD processors will go. The 16 core Ryzen 3950x is dual channel though. Duopoly to hoard money from people. Even some recent mobile processors have 8 memory channels by the way.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

My knowledge on this topic is always only as good as my last project.
My latest project was to use a 16gb skyline matrix and solve for many time steps on 10 options (threads). I tried both an i7-4790K (8 thread) and i7-8700K (12 threads). With the 4790, there was a severe memory bottleneck for 2 passes of 5 threads, taking 9+ seconds per time step ( then 2 x passes), while the 8700 took about 3 to 4 seconds per time step. On the 8700, I then introduced 2 x !$OMP BARRIER for the start and middle of each time step, which better aligned the memory usage between threads in use resulting in average 2.5 seconds per time step, which is about 10x faster than the 4790.

In this example I was really surprised at the difference between the gen 4 and the gen 8 i7, which I understood is mainly due to memory <> cache transfer rate and capacity.

In other past testing of AVX instructions, I have found if the information (arrays) is not in cache, then the AVX advantage can be minimal. This can be addressed by arranging the computation so there is an increased probability the data is in the cache (modifies numerical algorithm).

Both these examples have shown me that use of SIMD (AVX) needs to be tuned to the numerical problem and the other performance limitations of the processor, not just the existence of AVX512. This can be by adjusting the solution algorithm ( eg cache blocking of calcs or other adjusting of OpenMP)

My examples use large arrays ( 2GB - 16GB + ) so more intense calcs on smaller arrays may be different and have different bottlenecks to overcome to approach the quoted AVX rates.

There is another interesting example of MATMUL for large matrices at gFortran Ver 7 ( eg Real*8, dimension(8000,8000) :: a,b,c ; c = MATMUL (a,b) ), where the solution involved partitioning the MATMUL to sub-matrices aa(4,4) and bb(4,4) which fit into L1 cache. This produces about a 10x speed improvement using AVX2 instructions over the previous compiler version.

Getting AVX or AVX512 to produce the claimed performance heavily depends on getting the arrays into cache, at the rate required. Identifying how to do this can be difficult and does depend on the processor and memory mix being used. Unfortunately, for me, it is a learning experience with each new type of project. (With increased number of threads, the shared memory transfer rates also need to be increased.)

Old processors with older, slower memory looks to be a very unlikely win for my type of calculations.

DanRRight · Posted: Thu May 13, 2021 6:43 pm Post subject:

Downloaded latest Intel MKL (it's free, btw), compiled this code above with it

mecej4 · Joined: 31 Oct 2006 Posts: 1886

I ran the Lapack example code given above, and it ran fine.

I used the latest versions (as of May 2021) FTN95, XSLINK64 (this is the 64-it SLINK64 Version 3.02, which I renamed in order to keep the older SLINK64 as a backup) and Intel MKL.

DanRRight · Posted: Fri May 28, 2021 6:57 am Post subject:

Thanks Mecej4, will look again what's the problem with my setup

DanRRight · Posted: Fri May 28, 2021 10:13 pm Post subject:

Still having problems. May be MKL does not like competitor's AMD processor?
First it complained about path not found, then started to crash. Crashes even if i place EXE into MKL dir with its DLLs

mecej4 · Joined: 31 Oct 2006 Posts: 1886

It has been over 15 years since I acquired a PC with an AMD CPU, so I cannot try out your program and MKL on such a machine. Errors of the type that you report can be caused by trying to run Intel-specific instructions on an AMD CPU that does not support them. I remember that each time that I installed a new version of the Intel compiler on that PC I had to spend some time to ascertain which compiler options to use and still obtain an EXE that ran properly and fast on that AMD CPU (Athlon X2 - 4200+).

I suggest that you try the example after compiling the source using the Intel Fortran compiler. If you still receive the error traceback after the exception is taken, you can file a bug report on Intel's MKL forum, giving details on what compiler version and compiler options you used, and the version of MKL used.

LitusSaxonicum · Posted: Sat May 29, 2021 3:28 pm Post subject:

Look at Wikipedia ( https://en.wikipedia.org/wiki/Math_Kernel_Library ) for 'Performance and Vendor Lock-in'. It may help.

I have used AMD cpus in desktop computers I have built myself since Pentium days. For what I do (and I know that others may have a different experience) I get far superior performance compared to the Intel-based machines I was cursed with at the Uni where I worked. Perhaps at the very top-end, using the most cutting-edge software, it may be the other way round.

It wouldn't surprise me (it didn't when I read the section in the Wikipedia article) that Intel cripples its software on AMD processors. That's in the dirty tricks arena, but is it any surprise?

Eddie

DanRRight · Posted: Sat May 29, 2021 9:13 pm Post subject:

I planned to check if faster memory influencing parallel linear algebra in Intel case. Interesting is that another example of LAIPE parallel linear algebra library made by the gFortran guy does not depend on RAM speed, RAM MHz or latency at all. Faster RAM may change RAMdisk speed for example by 1/3

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

I have a Ryzen 5900X so I will try to test your program. This processor does not support AVX512 instructions. I am not familiar with Lapack and MKL libraries but they might not be familiar with (identify) the latest AMD processors.

mecej4 has provided details on the test he carried out using xslink64.
I am still using FTN95 Ver 8.64.0 so might need to upgrade.

Eddie, there have been denials from Intel that AMD processors are restricted. I too was forced to use low spec Intel Xeon W3520 processors by a previous employer, who only believed HP salesmen. It was a revelation when I could choose alternatives.

Interestingly, I did a test of a gFortran .exe, generated for an i7-8700. This ran slower on the Ryzen, but when recompiled on the Ryzen using -march=native, it ran much faster than on the 8700.
gFortran can detect the instruction set available on Ryzen processors (even ones it didn't know about) but perhaps Intel is not as aggressive for not Intel processors. Not sure if this is wrong.
I am amazed at gFortran's support for different instruction sets, which would be a huge ask for FTN95 to contemplate. FTN95 does report some available instructions, as does CPUID.

mecej4 · Joined: 31 Oct 2006 Posts: 1886

John Campbell wrote: "mecej4 has provided details on the test he carried out using xslink64. I am still using FTN95 Ver 8.64.0 so might need to upgrade."

Let me clarify.

For the test example, tlapack.f90, I have no reason to believe that the older 8.64 compiler or the 32-bit slink64 will not suffice(I have not tried, since I have replaced them with the newer versions).

When you use the MKL libraries with a main program compiled with a non-Intel compiler, some of the speed features such as instruction set detection, OpenMP and other threading, etc., will not be available unless one writes code to enable those features and an API for doing the same is available.

I use the 64-bit slink64, which I renamed to "xslink64", in order to give it a good workout and report bugs that I find (I found one, reported it, and that bug has already been fixed).

DanRRight · Posted: Thu Jul 08, 2021 10:08 pm Post subject:

John, Have you succeeded with installation of Intel parallel MKL library?

By the way is you new AMD-based computer rock stable? Mine AMD one crashes every two days despite i do not overclock it (it does not allow, almost no room is left for that, AMD squeezed everything from it already to look good vs Intel) and having water cooling. I start to regret i took it. Third time i buy AMD and then regret or return back. Unfortunately INTEL is too late with newer chips and motherboards.

mecej4 · Joined: 31 Oct 2006 Posts: 1886

Dan, how old is that computer with the AMD CPU? Please elaborate on what you mean by "crash". If it your program or Windows that crashes? Have you looked into the details, using tools such as Event Viewer (eventvwr.exe)?

The traceback that you posted on May 28 shows, at the end,