View previous topic :: View next topic |
Author |
Message |
mecej4
Joined: 31 Oct 2006 Posts: 1899
|
Posted: Fri Mar 30, 2018 11:47 am Post subject: |
|
|
It is to be expected that /checkmate would force allocation of memory at the outset. Uninitialized variables, including some big arrays, have to be filled with special values so that, when the same variables are used later, their values can be compared with the special value to detect whether they have been initialized. |
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
Posted: Fri Mar 30, 2018 12:40 pm Post subject: |
|
|
ALLOCATE for 32 bit /CHECK uses its own memory allocation based on existing blocks of VirtualAlloc memory and sets to the "undefined" state when called.
ALLOCATE for 64 bit /CHECK uses GlobalAlloc/HeapAlloc and sets to the "undefined" state when called. |
|
Back to top |
|
 |
wahorger

Joined: 13 Oct 2014 Posts: 1257 Location: Morrison, CO, USA
|
Posted: Sun Apr 01, 2018 12:13 am Post subject: |
|
|
Thanks for the explanation, Paul. |
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
|
Back to top |
|
 |
dpannhorst
Joined: 29 Aug 2005 Posts: 165 Location: Berlin, Germany
|
Posted: Fri Apr 06, 2018 7:03 pm Post subject: |
|
|
The dropbox link to new dlls leads to an error.
Detlef Pannhorst |
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
Posted: Fri Apr 06, 2018 7:08 pm Post subject: |
|
|
Yes. The above link explains why the download has been removed. |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Mon Apr 09, 2018 7:38 am Post subject: Re: |
|
|
wahorger wrote: | I am observing that V8.30.0 is much faster at 32-bit compiling than 8.20.0. |
All version were always superfast like no other compiler, i automatically keep compilation speed results from 1999. Where other compilers spend 3 min FTN95 compiles 3 seconds. That takes place many times per day. And since during program development (in my case this is vast majority of spent time) the compilation and debugging speed are key, they are way more important then run time. Usually people say that they chose Fortran for its run speed. But this without parallelisation and supercomputers is an absurd. In reality if they use PC they lose most of the time for development. Lose just 3 seconds per day and at the end of life you will lose 24 hours. Actually we lose many many hours per day, how much this end up for life is even scary to pronounce. |
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
|
Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2615 Location: Sydney
|
Posted: Mon Apr 09, 2018 1:29 pm Post subject: |
|
|
Dan,
Ver 8.3 provides more multi-threading options.
I am looking to see what I can achieve and will update shortly.
John |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Mon Apr 09, 2018 10:09 pm Post subject: |
|
|
Interesting, would like to look, but i'm too busy now to experiment. Meantime for you, John, Paul and those who already started i have few questions about this parallel method:
1) What's new here compared to previous method which allowed to start parallel threads?
2) Was the LOCK mechanism implemented like in FTN95 for NET allowing to print without danger of threads crash? This is the big problem during debug because of a lot of I/O happen at this time
3) How fast is this method compared to parallel example for NET i posted few years back (see the link below, use my last demo) which showed amazing unexplainable till now more then 6.2x speedup on typical 4-core 8-thread processors ?
4) Anyone already bought new cheap 8, 16 or even 32-core AMD processors? How fast is the method on AMD vs Intel
Here is URL for FTN95 for NET case.
http://forums.silverfrost.com/viewtopic.php?t=2534&highlight=net+multithreading |
|
Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2615 Location: Sydney
|
Posted: Tue Apr 10, 2018 3:47 am Post subject: |
|
|
Dan,
Interesting questions, but I will try to answer a few of my own first.
Why try using AMD when intel are so cheap ?
I just bought an i7-8700K which has 6 cores for 12 threads. The important feature is it supports 2666 MHz memory, which provides a greater memory transfer bandwidth. It gives noticeable improvement in comparison to i7-4790K for multi-thread equation solution of 300 Mb skyline matrix for 12 threads. The 4790K (4 cores, 8 threads) looses efficiency above 4 threads when hyper-threading, which I attribute to the slower 1600 MHz memory.
My use of multi-threading is fairly basic. The FTN95 approach does require some care when managing private variables. My approach is to immediately call a routine, which then allocates local variables for all private variables, while shared arrays are allocated before thread initiation to provide thread based accumulators. (even the thread ID must be private !) I am now trying to emulate SCHEDULE(DYNAMIC) and CRITICAL.
FTN95 threading could offer a lot of potential, as opening an OMP PARALLEL region can take 30,000 processor cycles on other compilers, which kills small load threads.
Still have some work to complete this approach,
John |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Thu Apr 12, 2018 12:59 am Post subject: |
|
|
With computers the minimal unit of measuring is factor of 2. Two computers within factor of 2 of performance are essentially equal. Otherwise if one thinks 20% difference is a lot then buy new computer with each and every increase by 20% (which translates to every few months). This will explain my questions below.
Interesting to test and find what is better for large scale linear algebra
- double amount or cores or
- double speed of RAM or
- quad channel vs dual channel memory architecture or
- double cache size
- double harddrives speed ?
Assuming the RAM size is not a problem last question is also not a problem. But there exist 4300MHz Corsair DDR4 RAM modules which are almost factor of 2 faster then typical 1.6-2.4 MHz ones. There exist 20-30 MB caches versus typical 9-12MB. There exist quad channel memory transfer speeds etc... What it is mostly bound to when matrix size is very large? |
|
Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2615 Location: Sydney
|
Posted: Thu Apr 12, 2018 1:40 am Post subject: |
|
|
Dan,
All these are significant, as they are related.
I find the bottleneck is with transfers between memory and cache.
So speed of RAM and cache size are the most significant.
I am not familiar with "quad channel vs dual channel memory architecture" so if it affects transfer rates then that would be related.
"double amount of cores" would change the number of threads (?) so would be significant.
The other main significance is modifying the calculation to minimise the memory to cache transfers, ie cache smart algorithm.
What is interesting is that performance is less affected by the processor clock rate, as the bottleneck is memory <> cache transfers.
What I am still trying to understand is how to use separate memory pages for each thread, as sharing pages between threads can affect memory coherence.
("Memory Coherence" is my latest unknown. The difficulty is that if you don't understand how this affects performance, it is difficult to construct a test that identifies the problem, especially demonstrating how to run without the problem.)
Has anyone experienced the improvement in MATMUL performance in gFortran Ver 7+ for large matrices? They have changed the algorithm and it works on 4x4 sub-matrices and achieves performance on a single thread that I achieve using 4 threads ! Their approach is cache smart + vector instructions, achieving surprising single thread performance, demonstrating there is much to learn about managing the multi-level cache architecture.
still much to learn ! |
|
Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1899
|
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Sat Apr 14, 2018 8:56 am Post subject: |
|
|
No, Intel does not need registering. By the way their forums allow to post much larger source code sizes. And also the forum design looks more modern.
If our linear algebra is actually memory bandwidth bound then AVX may not influence performance much. What good to check is if memory architecture matters or not. Today AMD announced their second iteration of 8 core 4 memory channel processors at even cheaper price $330. Also rumors are flying about 48 and 64 core AMD chips with 256MB cache and 8 channel memory architecture.
For memory bound tasks the optimum processor could be with any low MHZ, just as many cores and many memory channels as possible. |
|
Back to top |
|
 |
|