forums.silverfrost.com

mecej4 · Joined: 31 Oct 2006 Posts: 1951 Location: USA

Dan, yes. Today. The program aborted before the last line was printed, since I do not have that much RAM (I have 8 GB).

If A = 0 has become a bottleneck, perhaps you should represent A as a sparse matrix. Do you have an estimate of the ratio [number of non-zero entries/(n_rows X n_columns]?

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

Dan,

I tested the following program using:
FTN95 Ver 8.62.0
gFortran 10.2.0

On i7-8700k with 32 GB memory and Samsung SSD 960 EVO 500GB

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

more on MATMUL / OpenMP

However, with 6 cores, there is only 1 memory, which is where the bottleneck now is. Not sure how more cores will help OpenMP.

MATMUL with large arrays is very memory intensive. Other calculations have different memory demand rates so might use more threads.

Next test is to use more cores with higher memory bandwidth, which implies new hardware, perhaps R9 3900X or more likely DDR5 memory ?

"A = 0." is very much a memory bottleneck so multi-thread would need to be targeted, with minimum benefit, as Eddie notes. You don't re-initialise thousands of times ?

DanRRight · Posted: Sun Aug 30, 2020 8:08 am Post subject:

John, Here are my results on 32GB DDR3 RAM computer 4.4GHz with Samsung SSD as caching drive

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

Dan,

We have demonstrated that addressing a 38GByte array with only 32 GBytes of physical memory works ok when pagefile.sys is on an SSD (M.2) drive. Note that the test was only for sequential use of the array, as random access would be much worse.

However, if you have [virtual] "matrix size is 100-150-200 GB" but use(address) only "30-40 GB", you should have more than 30-40GB of physical memory, else you will be resorting to frequent paging.
If you use "A = 0." on the full virtual matrix, that would be a disaster, while zeroing 40 GB chunks would also be a problem ( although zeroing 38GB array on 32 GB memory appears to be delayed but manageable in tests above)
You have to be careful when addressing a virtual matrix efficiently.

The Block example code I presented, was based on your description of "blocks" where I was trying to demonstrate a way of addressing the single block.
You could multi-thread the code, by using a seperate thread for each block, as below, although I am not sure that derived types are supported by !$OMP:

mecej4 · Joined: 31 Oct 2006 Posts: 1951 Location: USA

LitusSaxonicum · Posted: Sun Aug 30, 2020 2:58 pm Post subject:

I'm afraid that I did follow up the reference, and the problem wasn't that of the computer going to sleep, rather that the source made my eyes glaze over! It takes more than a mouse nudge to wake me from that state!

Eddie

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

Dan,

Another test of A=0, using OpenMP with 4 threads, which results in -44% to 53% reduction of run times, 34% for 38GByte ( not T/4, due to memory and other delays. ) Still an interesting result.

DanRRight · Posted: Tue Sep 01, 2020 12:03 am Post subject:

That was great result, John. Factor of 2 increase is obvious when it fits into the RAM and hence there is no swapping to the SSD. Possibly OMP uses dual channel configuration of RAM, while serial code does not. There exist 4, 6 and 8 -channel hardware with server processors and chipsets (funny, but 8-channel cases can be found in latest mobile chips, the PC world now is lagging), hence it is good to check this assumption in the future

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

What multi-thread "A = 0" demonstrates is that it does not take too many threads to bottleneck memory transfer capacity. ( apparently 2 ! )

OpenMP is only suited to calculations/algorithms where the combined memory transfer demands don't exceed the memory transfer bandwidth. Cache sharing can help, but big arrays challenge this. You can't just keep adding threads.
You have mentioned 64 thread EPYC, but unless the memory reads can be distributed over all available channels, the algorithm is not going to scale up to more threads. Does 8-channel mean multiple channels can access the same memory locations, or do allocated memory pages need to be distributed across channels to suit the algorithm ?

It is difficult to understand the practical limits of marketing claims.