forums.silverfrost.com

mecej4 · Joined: 31 Oct 2006 Posts: 1903

It is to be expected that /checkmate would force allocation of memory at the outset. Uninitialized variables, including some big arrays, have to be filled with special values so that, when the same variables are used later, their values can be compared with the special value to detect whether they have been initialized.

PaulLaidler · Posted: Fri Mar 30, 2018 12:40 pm Post subject:

ALLOCATE for 32 bit /CHECK uses its own memory allocation based on existing blocks of VirtualAlloc memory and sets to the "undefined" state when called.

ALLOCATE for 64 bit /CHECK uses GlobalAlloc/HeapAlloc and sets to the "undefined" state when called.

wahorger · Joined: 13 Oct 2014 Posts: 1269 Location: Morrison, CO, USA

Thanks for the explanation, Paul.

PaulLaidler · Posted: Mon Apr 02, 2018 12:28 pm Post subject:

Please go to the following post regarding new DLLs...

http://forums.silverfrost.com/viewtopic.php?p=24394#24394

dpannhorst · Joined: 29 Aug 2005 Posts: 165 Location: Berlin, Germany

The dropbox link to new dlls leads to an error.

Detlef Pannhorst

PaulLaidler · Posted: Fri Apr 06, 2018 7:08 pm Post subject:

Yes. The above link explains why the download has been removed.

DanRRight · Posted: Mon Apr 09, 2018 7:38 am Post subject: Re:

PaulLaidler · Posted: Mon Apr 09, 2018 10:59 am Post subject:

Please go to the following post regarding new DLLs.

http://forums.silverfrost.com/viewtopic.php?p=24467#24467

JohnCampbell · Joined: 16 Feb 2006 Posts: 2621 Location: Sydney

Dan,

Ver 8.3 provides more multi-threading options.
I am looking to see what I can achieve and will update shortly.

John

DanRRight · Posted: Mon Apr 09, 2018 10:09 pm Post subject:

Interesting, would like to look, but i'm too busy now to experiment. Meantime for you, John, Paul and those who already started i have few questions about this parallel method:

1) What's new here compared to previous method which allowed to start parallel threads?

2) Was the LOCK mechanism implemented like in FTN95 for NET allowing to print without danger of threads crash? This is the big problem during debug because of a lot of I/O happen at this time

3) How fast is this method compared to parallel example for NET i posted few years back (see the link below, use my last demo) which showed amazing unexplainable till now more then 6.2x speedup on typical 4-core 8-thread processors ?

4) Anyone already bought new cheap 8, 16 or even 32-core AMD processors? How fast is the method on AMD vs Intel

Here is URL for FTN95 for NET case.
http://forums.silverfrost.com/viewtopic.php?t=2534&highlight=net+multithreading

JohnCampbell · Joined: 16 Feb 2006 Posts: 2621 Location: Sydney

Dan,

Interesting questions, but I will try to answer a few of my own first.

Why try using AMD when intel are so cheap ?
I just bought an i7-8700K which has 6 cores for 12 threads. The important feature is it supports 2666 MHz memory, which provides a greater memory transfer bandwidth. It gives noticeable improvement in comparison to i7-4790K for multi-thread equation solution of 300 Mb skyline matrix for 12 threads. The 4790K (4 cores, 8 threads) looses efficiency above 4 threads when hyper-threading, which I attribute to the slower 1600 MHz memory.

My use of multi-threading is fairly basic. The FTN95 approach does require some care when managing private variables. My approach is to immediately call a routine, which then allocates local variables for all private variables, while shared arrays are allocated before thread initiation to provide thread based accumulators. (even the thread ID must be private !) I am now trying to emulate SCHEDULE(DYNAMIC) and CRITICAL.
FTN95 threading could offer a lot of potential, as opening an OMP PARALLEL region can take 30,000 processor cycles on other compilers, which kills small load threads.
Still have some work to complete this approach,

John

DanRRight · Posted: Thu Apr 12, 2018 12:59 am Post subject:

With computers the minimal unit of measuring is factor of 2. Two computers within factor of 2 of performance are essentially equal. Otherwise if one thinks 20% difference is a lot then buy new computer with each and every increase by 20% (which translates to every few months). This will explain my questions below.

Interesting to test and find what is better for large scale linear algebra

- double amount or cores or
- double speed of RAM or
- quad channel vs dual channel memory architecture or
- double cache size
- double harddrives speed ?

Assuming the RAM size is not a problem last question is also not a problem. But there exist 4300MHz Corsair DDR4 RAM modules which are almost factor of 2 faster then typical 1.6-2.4 MHz ones. There exist 20-30 MB caches versus typical 9-12MB. There exist quad channel memory transfer speeds etc... What it is mostly bound to when matrix size is very large?

JohnCampbell · Joined: 16 Feb 2006 Posts: 2621 Location: Sydney

Dan,

All these are significant, as they are related.
I find the bottleneck is with transfers between memory and cache.
So speed of RAM and cache size are the most significant.

I am not familiar with "quad channel vs dual channel memory architecture" so if it affects transfer rates then that would be related.

"double amount of cores" would change the number of threads (?) so would be significant.

The other main significance is modifying the calculation to minimise the memory to cache transfers, ie cache smart algorithm.

What is interesting is that performance is less affected by the processor clock rate, as the bottleneck is memory <> cache transfers.

What I am still trying to understand is how to use separate memory pages for each thread, as sharing pages between threads can affect memory coherence.
("Memory Coherence" is my latest unknown. The difficulty is that if you don't understand how this affects performance, it is difficult to construct a test that identifies the problem, especially demonstrating how to run without the problem.)

Has anyone experienced the improvement in MATMUL performance in gFortran Ver 7+ for large matrices? They have changed the algorithm and it works on 4x4 sub-matrices and achieves performance on a single thread that I achieve using 4 threads ! Their approach is cache smart + vector instructions, achieving surprising single thread performance, demonstrating there is much to learn about managing the multi-level cache architecture.

still much to learn !

mecej4 · Joined: 31 Oct 2006 Posts: 1903

There was an interesting contribution by "Repeat Offender" in the Intel Fortran forum, in which he showed that doing arithmetic using AVX instructions instead of a straight table lookup enabled a program to run 400-X faster. The chosen task: converting the text of an e-bible, about 4.5 MB long, to upper case.

See https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/757222#comment-1918919 . You may have to sign in to make his post visible.

DanRRight · Posted: Sat Apr 14, 2018 8:56 am Post subject:

No, Intel does not need registering. By the way their forums allow to post much larger source code sizes. And also the forum design looks more modern.

If our linear algebra is actually memory bandwidth bound then AVX may not influence performance much. What good to check is if memory architecture matters or not. Today AMD announced their second iteration of 8 core 4 memory channel processors at even cheaper price $330. Also rumors are flying about 48 and 64 core AMD chips with 256MB cache and 8 channel memory architecture.

For memory bound tasks the optimum processor could be with any low MHZ, just as many cores and many memory channels as possible.