forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

How to cRAM as much computing power as possible into a progr
Goto page Previous  1, 2
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Sat Aug 29, 2020 5:26 am    Post subject: Reply with quote

Quote:
If it lags then this needs fixing which will benefit all users.

Ahem, it will benefit all users who write programs whose only purpose is to set a huge matrix to zero and do nothing else with the matrix.

In fact, the compiler could perform a data flow analysis and decide to skip the assignment completely, since the values in the matrix are not used subsequently. The size of the matrix is independent of what is in the matrix, as is the information needed to deallocate.

That aside, Intel Fortran produces EXEs that run about two time faster than do FTN95 compiled EXEs, for sequential programs.

If your program is provided with properly formulated OpenMP directives or can use threaded libraries, that factor can become 2 X n_cores.

The programmer would do better to make sure that good algorithms and data structures are used, that only needed calculations are done, and that the code is portable. Using this prescription enabled a speed increase of 270,000 in a recent exercise, covered in the Intel Fortran forum:

https://community.intel.com/t5/Intel-Fortran-Compiler/Problem-with-variable-that-change-their-values-without-apparent/td-p/1200298

Develop, fine tune and debug with FTN95, and then try other compilers to gain speed.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sat Aug 29, 2020 1:54 pm    Post subject: Reply with quote

A speed-up of even 270,000 times is in practice useless if the execution time without the speed-up is rather small, say 1 second, and the program waits for a human to interact with it.

I frankly doubt that n cores gives you an n times speed up, as there is only one memory space and a limited cpu-to-ram channel, made worse if n is large.

There may be approaches other than a loop or nested loops whether they are explicit or implicit like setting array A = 0. Old fashioned FTN77/386 has inline routines MOVE@ and FILL@ that just might help so that you can overwrite more than one variable in one go.

Personally, I feel that if the array to be zeroed is small, the benefits from any fancy technique are likely to be small, and if the array is huge, then I have to question just what it is needed for. If you have a huge array and at the end of some operation it remains sparse, then there are techniques for dealing with that. The only benefit I can see from having the huge array is that addressing its cells is easy - and using that easiness is just being lazy.

Of course, in many cases, if the execution time is slow, who cares if the program is running on a dedicated computer? There is a range of execution times that are unhelpful, i.e. the times where you wait for something to finish. If you know that the computer is taking an hour, you get on and do something useful in the time. If you get the runtime down to 10 minutes, you might just be tempted to sit it out, in which case you will fail one of the Kipling Tests:

If you can fill the unforgiving minute
With sixty seconds' worth of distance run,

Eddie
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Sun Aug 30, 2020 1:57 am    Post subject: Reply with quote

Mecej4, Have you done the above test on serial Intel and gFortran and got factor of 2 speedup?
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Sun Aug 30, 2020 2:20 am    Post subject: Reply with quote

Dan, yes. Today. The program aborted before the last line was printed, since I do not have that much RAM (I have 8 GB).

If A = 0 has become a bottleneck, perhaps you should represent A as a sparse matrix. Do you have an estimate of the ratio [number of non-zero entries/(n_rows X n_columns]?
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Aug 30, 2020 4:50 am    Post subject: Reply with quote

Dan,

I tested the following program using:
FTN95 Ver 8.62.0
gFortran 10.2.0

On i7-8700k with 32 GB memory and Samsung SSD 960 EVO 500GB

Code:
 Real*4, allocatable :: A(:,:)
!
 integer*8 i0, i1, icount_rate, icount_max
 integer*4 i,j
 integer*1 k
 real*4    c, rj, mb
!
 open (10,file='log_a', position='append')
 write (10,10)
! 10 format (/'gFortran - Plato Release x64')
! 10 format (/'FTN95 - Plato Release x64')
! 10 format (/'gFortran zero_a.f90 -o zero_gf.exe -fimplicit-none -O2 -march=native')
 10 format (/'FTN95 zero_a.f90 /64 /opt /lgo')
!
 c = sqrt(10.)
 do i=6, 10
   j=nint(c**i)
   allocate (A(j,j))

   Call system_clock(i0, icount_rate, icount_max)
   A(:,:) = 0
   Call system_clock(i1, icount_rate, icount_max)

   rj = j ; Mb = rj**2*4/(1024.**2)
   k = i
   write ( *,*) k, ' Dim., Size_MB, Time', j, 4.*j*j/1e6, real(i1-i0)/icount_rate, Mb
   write (10,*) k, ' Dim., Size_MB, Time', j, 4.*j*j/1e6, real(i1-i0)/icount_rate, Mb
   deallocate (A)
 end do
 close (10)
! pause
 end

! call clock@(t0)
! ftn95 aaa.f95 /link /64 >z


The output results are
Code:

gFortran - Plato Release x64
    6  Dim., Size_MB, Time        1000   4.00000000       9.33900010E-04   3.81469727   
    7  Dim., Size_MB, Time        3162   39.9929771       7.86439981E-03   38.1402740   
    8  Dim., Size_MB, Time       10000   400.000000       8.42124969E-02   381.469727   
    9  Dim., Size_MB, Time       31623   4000.05664      0.799299777       3814.75122   
   10  Dim., Size_MB, Time      100000   40000.0000       21.1393280       38146.9727   

FTN95 - Plato Release x64
            6 Dim., Size_MB, Time        1000     4.00000        1.800000E-03     3.81470
            7 Dim., Size_MB, Time        3162     39.9930        1.710000E-02     38.1403
            8 Dim., Size_MB, Time       10000     400.000        0.172500         381.470
            9 Dim., Size_MB, Time       31623     4000.06         1.61070         3814.75
           10 Dim., Size_MB, Time      100000     40000.0         27.0905         38147.0

gFortran zero_a.f90 -o zero_gf.exe -fimplicit-none -O2 -march=native
    6  Dim., Size_MB, Time        1000   4.00000000       5.49999997E-04   3.81469727   
    7  Dim., Size_MB, Time        3162   39.9929771       5.52489981E-03   38.1402740   
    8  Dim., Size_MB, Time       10000   400.000000       5.37659004E-02   381.469727   
    9  Dim., Size_MB, Time       31623   4000.05664      0.522874594       3814.75122   
   10  Dim., Size_MB, Time      100000   40000.0000       15.9236097       38146.9727   

FTN95 zero_a.f90 /64 /opt /lgo
            6 Dim., Size_MB, Time        1000     4.00000        8.000000E-04     3.81470
            7 Dim., Size_MB, Time        3162     39.9930        7.800000E-03     38.1403
            8 Dim., Size_MB, Time       10000     400.000        7.610000E-02     381.470
            9 Dim., Size_MB, Time       31623     4000.06        0.754100         3814.75
           10 Dim., Size_MB, Time      100000     40000.0         17.8507         38147.0

I am surprised by the results:
1) FTN95 is not much slower than gFortran
2) Array is 38 GBytes on a 32 GB PC, with little paging delay. It would be much different on HDD.
3) Any computation on A would take much longer.
4) Integer*1 k ; write (10,*) k on FTN95 has a problem.

This is surprisingly fast for 38GB array !!

Multi-threading has it's benefits, but only where OpenMP is suited.
MATMUL is very suited to OpenMP. It is also high intensity memory usage with AVX.
MATMUL with FTN95 Ver 8.62 using Real*8 might get 3 GFlop/sec. (without AVX (AXPY8@) less than 1 GFlop/s)
Using 6 threads can get 50 GFlop/sec with 6 cores (12 threads about the same)
However, with 6 cores, there is only 1 memory,
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Aug 30, 2020 5:17 am    Post subject: Reply with quote

more on MATMUL / OpenMP

However, with 6 cores, there is only 1 memory, which is where the bottleneck now is. Not sure how more cores will help OpenMP.

MATMUL with large arrays is very memory intensive. Other calculations have different memory demand rates so might use more threads.

Next test is to use more cores with higher memory bandwidth, which implies new hardware, perhaps R9 3900X or more likely DDR5 memory ?

"A = 0." is very much a memory bottleneck so multi-thread would need to be targeted, with minimum benefit, as Eddie notes. You don't re-initialise thousands of times ?
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Sun Aug 30, 2020 8:08 am    Post subject: Reply with quote

John, Here are my results on 32GB DDR3 RAM computer 4.4GHz with Samsung SSD as caching drive

Code:
 Dim, Size_MB, Time        1000     4.00000        2.500000E-03
 Dim, Size_MB, Time        3162     39.9930        2.040000E-02
 Dim, Size_MB, Time       10000     400.000        9.660000E-02
 Dim, Size_MB, Time       31623     4000.06         1.10660   
 Dim, Size_MB, Time      100000     40000.0         27.5506   


Does FTN95 for NET run in 64bit mode? I'd try to multitask A=0 case using FTN95 for NET multitasking demo i posted here in 2013. It showed amazing unexplained till today multithreading capabilities back then.
http://forums.silverfrost.com/viewtopic.php?t=2534&highlight=net+multithreading


Mecej4: i do not like to complicate the task with sparse matrix block sizes specifics. Just will comment that sometimes matrix size is 100-150-200 GB and real used size is 30-40 GB. Now you will understand my pain when i had to load 30-100 such files each time zeroising matrix before loading new data file. As i said i fixed this just by zeroising only the matrix elements which were just used before loading new data file. But i want to speed up things even more
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Aug 30, 2020 10:08 am    Post subject: Reply with quote

Dan,

We have demonstrated that addressing a 38GByte array with only 32 GBytes of physical memory works ok when pagefile.sys is on an SSD (M.2) drive. Note that the test was only for sequential use of the array, as random access would be much worse.

However, if you have [virtual] "matrix size is 100-150-200 GB" but use(address) only "30-40 GB", you should have more than 30-40GB of physical memory, else you will be resorting to frequent paging.
If you use "A = 0." on the full virtual matrix, that would be a disaster, while zeroing 40 GB chunks would also be a problem ( although zeroing 38GB array on 32 GB memory appears to be delayed but manageable in tests above)
You have to be careful when addressing a virtual matrix efficiently.

The Block example code I presented, was based on your description of "blocks" where I was trying to demonstrate a way of addressing the single block.
You could multi-thread the code, by using a seperate thread for each block, as below, although I am not sure that derived types are supported by !$OMP:
Code:
!  now process the defined blocks using OMP
!
      call omp_set_num_threads (4)
!
!$OMP PARALLEL DO   &
!$OMP& SHARED ( block_array_records, max_blocks )  &
!$OMP& PRIVATE ( i )  &
!$OMP& SCHEDULE (DYNAMIC)
       do i = 1, max_blocks
        if ( block_array_records(i)%block_size <= 0 ) cycle
!
        call process_block ( i, block_array_records(i)%block_size, block_array_records(i)%block )
!
       end do ! i
!$OMP END PARALLEL DO
!
    end

    subroutine process_block ( i, n, block_array )
    integer*4 i, n, ne, id, j, k
    integer*4 block_array(n,n)
    integer*4, external :: omp_get_thread_num
!
    id = omp_get_thread_num ()
    ne = 0
    do j = 1,n
      do k = 1,n
        if ( block_array(k,j) /= i ) ne = ne+1
      end do
    end do
    write (*,11) 'Thread = ',id,'Block size = ',n,'errors = ',ne
 11 format (a,i0)
    end subroutine process_block
It appears to work !!

However, if you expect to have all blocks on pagefile.sys, it is important to recognise that the combined memory demand of all active thread blocks must be in physical memory for resonable performance.
While OpenMP have multiple cores, there is only 1(or2) memory feeds shared for all threads.
I might be wrong, but overcoming memory bottlenecking is my current problem for OpenMP.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Sun Aug 30, 2020 12:22 pm    Post subject: Re: Reply with quote

LitusSaxonicum wrote:
A speed-up of even 270,000 times is in practice useless if the execution time without the speed-up is rather small, say 1 second, and the program waits for a human to interact with it.


If you follow the link that I provided along with that statement, you will see that the run time before speed up was approximately six hours. The only human interaction needed was to type the EXE name and wait for it to finish (and to shake the mouse now and then to prevent the computer from going to sleep).

Quote:
I frankly doubt that n cores gives you an n times speed up, as there is only one memory space and a limited cpu-to-ram channel, made worse if n is large.


Usually that applies, but in a recent case (another long thread in the Intel Fortran forum), I was surprised to find that with OpenMP the speed up was proportional to the number of threads. The calculation involved DO loops containing recursive calls in which the calculation for one DO index was independent from the calculation for any other pair, and there was no memory contention. To obtain the best speed up, it was necessary to estimate the index ranges to be allotted to each thread so as to keep all cores equally busy.

The purpose of the program in question was to repeat the findings in this remarkable journal article [Bull. Amer. Math. Soc. 72 (6): 1079]:



There does exist the rare phenomenon of super-linear speed up, although I have not seen it so far myself:

https://en.wikipedia.org/wiki/Speedup#Super-linear_speedup

Credit: Thanks to John S. for his recipe for including images!
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sun Aug 30, 2020 2:58 pm    Post subject: Reply with quote

I'm afraid that I did follow up the reference, and the problem wasn't that of the computer going to sleep, rather that the source made my eyes glaze over! It takes more than a mouse nudge to wake me from that state!

Eddie
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Aug 31, 2020 7:15 am    Post subject: Reply with quote

Dan,

Another test of A=0, using OpenMP with 4 threads, which results in -44% to 53% reduction of run times, 34% for 38GByte ( not T/4, due to memory and other delays. ) Still an interesting result.
Code:
 Real*4, allocatable :: A(:,:)
!
 integer*8 i0, i1, icount_rate, icount_max
 integer*4 i,j,j1,j2
 integer*2 k
 real*4    c, rj, mb
!
 open (10,file='log_a', position='append')
 write (10,10)
! 10 format (/'gFortran - Plato Release x64')
! 10 format (/'FTN95 - Plato Release x64')
! 10 format (/'FTN95 zero_a.f90 /64 /opt /lgo')
 10 format (/'gFortran zero_omp.f90 -o zero_omp.exe -fimplicit-none -O2 -march=native -fopenmp')
!
 call omp_set_num_threads (4)

 c = sqrt(10.)
 c = sqrt(c)
 do i=12, 20
   j=nint(c**i)
   allocate (A(j,j))

   Call system_clock(i0, icount_rate, icount_max)
!
!$OMP PARALLEL DO      &
!$OMP& SHARED ( A,j )  &
!$OMP& PRIVATE ( k,j1,j2 )
       do k = 1,4
         select case (k)
           case(1)
             j1 = 1
             j2 = j/4
           case(2)
             j1 = j/4+1
             j2 = j/2
           case (3)
             j1 = j/2+1
             j2 = j-j/4
           case (4)
             j1 = j-j/4+1
             j2 = j
         end select
         A(:,j1:j2) = 0
       end do ! k
!$OMP END PARALLEL DO
!
   Call system_clock(i1, icount_rate, icount_max)
!
   rj = j ; Mb = rj**2*4/(1024.**2)
   k = i
   write ( *,*) k, ' Dim., Size_MB, Time', j, 4.*j*j/1e6, real(i1-i0)/icount_rate, Mb
   write (10,*) k, ' Dim., Size_MB, Time', j, 4.*j*j/1e6, real(i1-i0)/icount_rate, Mb
   deallocate (A)
 end do

 close (10)
! pause
 end

the gFortran run times are:
Code:
gFortran zero_a.f90 -o zero_gf.exe -fimplicit-none -O2 -march=native
    6  Dim., Size_MB, Time        1000   4.00000000       5.49999997E-04   3.81469727   
    7  Dim., Size_MB, Time        3162   39.9929771       5.52489981E-03   38.1402740   
    8  Dim., Size_MB, Time       10000   400.000000       5.37659004E-02   381.469727   
    9  Dim., Size_MB, Time       31623   4000.05664      0.522874594       3814.75122   
   10  Dim., Size_MB, Time      100000   40000.0000       15.9236097       38146.9727   

gFortran zero_omp.f90 -o zero_omp.exe -fimplicit-none -O2 -march=native -fopenmp
     12  Dim., Size_MB, Time        1000   4.00000000       7.91000028E-04   3.81469727   
     13  Dim., Size_MB, Time        1778   12.6451359       1.12929998E-03   12.0593414   
     14  Dim., Size_MB, Time        3162   39.9929771       2.64460011E-03   38.1402740   
     15  Dim., Size_MB, Time        5623   126.472511       8.01560003E-03   120.613586   
     16  Dim., Size_MB, Time       10000   400.000000       2.62054000E-02   381.469727   
     17  Dim., Size_MB, Time       17783   1264.94043       7.87720010E-02   1206.34119   
     18  Dim., Size_MB, Time       31623   4000.05664      0.243141100       3814.75122   
     19  Dim., Size_MB, Time       56234   12649.0508      0.798558712       12063.0752   
     20  Dim., Size_MB, Time      100000   40000.0000       10.5303001       38146.9727   
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Tue Sep 01, 2020 12:03 am    Post subject: Reply with quote

That was great result, John. Factor of 2 increase is obvious when it fits into the RAM and hence there is no swapping to the SSD. Possibly OMP uses dual channel configuration of RAM, while serial code does not. There exist 4, 6 and 8 -channel hardware with server processors and chipsets (funny, but 8-channel cases can be found in latest mobile chips, the PC world now is lagging), hence it is good to check this assumption in the future
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue Sep 01, 2020 6:47 am    Post subject: Re: Reply with quote

What multi-thread "A = 0" demonstrates is that it does not take too many threads to bottleneck memory transfer capacity. ( apparently 2 ! )

OpenMP is only suited to calculations/algorithms where the combined memory transfer demands don't exceed the memory transfer bandwidth. Cache sharing can help, but big arrays challenge this. You can't just keep adding threads.
You have mentioned 64 thread EPYC, but unless the memory reads can be distributed over all available channels, the algorithm is not going to scale up to more threads. Does 8-channel mean multiple channels can access the same memory locations, or do allocated memory pages need to be distributed across channels to suit the algorithm ?

It is difficult to understand the practical limits of marketing claims.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group