Silverfrost Forums

Welcome to our forums

Using threads

4 Dec 2017 9:07 #20933

A few of my applications use a genetic algorithm for optimisation purposes, a slightly modified version of the veritable F77 routine PIKAIA http://www.hao.ucar.edu/modeling/pikaia/pikaia.php. The major overhead with GAs is the evaluation of the fitness function, and I am am aware of timeconsuming interval searches (for an overall minimum of maximum) within my fitness functions which consume a lot of CPU time. These have become a larger burden as I've tired to solve more complex problems.

A few years ago there was some discussion on the forum about using threads and a few sample programs were posted. Looking back at these last week I wondered if there was an opportunity to speed up some of these time consuming interval searches. The code below is my first attempt at this.

Basically, I am looking for the local maximum in an interval search between s = 0 and s = 1 (this is a made up example - I know that this particular test function can be solved analytically for Tmax). There are two subroutines, one searches between 0 and 0.5 and the other searches between 0.5 and 1. In the first instance they are called one after the other, and then they are run in parallel as threads.

Can somebody confirm if I have got the locking and unlocking correct? I am making local copies of all global variables and using these locals in the thread. Is this a sensible approach?

The code appears to run correctly under Win32 release, although the results under x64 release are a little disappointing:

Option         Serial  Parallel  Speedup
               (s)     (s)
Release win32, 4.26,   2.22,     x 1.92
Release x64,   6.76,   7.34,     x 0.92

Has anybody looked at speed of code compiled with x64 and compared to win32? Note that using threads, in this example is actually slower than running the code serially. Is there a reason for this?

4 Dec 2017 9:10 #20934

The sample code:-

module data_mod
implicit none
integer, parameter :: dp=kind(1.d0)
real(kind=dp) r1,x1,x2,r2,v1,ns,tmax1,tmax2,ds
integer,parameter::IO_LOCK = 42 
contains
subroutine search1(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) r1_loc, x1_loc, r2_loc, x2_loc, v1_loc, ns_loc, ds_loc
real(kind=dp) s, tmax_loc, t, k
complex(kind=dp) i
call lock@(IO_LOCK) ; write(6,*) 'Starting thread  ',ithread ; call unlock@(IO_LOCK)
call lock@(ithread)
r1_loc = r1 ; r2_loc = r2 ; x1_loc = x1 ; x2_loc = x2
v1_loc = v1 ; ns_loc = ns ; ds_loc = ds
call unlock@(ithread)
s  = epsilon(1.0)
tmax_loc = 0.d0
k = 90.d0/(4.d0*atan(1.d0)*ns_loc)
do while (s .le. 0.5d0)
  i = v1_loc/cmplx((r1_loc+r2_loc/s),(x1_loc+x2_loc),kind=dp)
  t = (k)*(abs(i)*abs(i)*(r2_loc/s))
  if (t .gt. tmax_loc) tmax_loc = t
  s = s + ds_loc
end do
call lock@(ithread) ; tmax1 = tmax_loc ; call unlock@(ithread)
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search1

subroutine search2(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) r1_loc, x1_loc, r2_loc, x2_loc, v1_loc, ns_loc, ds_loc
real(kind=dp) s, tmax_loc, t, k
complex(kind=dp) i
call lock@(IO_LOCK) ; write(6,*) 'Starting thread  ',ithread ; call unlock@(IO_LOCK)
call lock@(ithread)
r1_loc = r1 ; r2_loc = r2 ; x1_loc = x1 ; x2_loc = x2
v1_loc = v1 ; ns_loc = ns ; ds_loc = ds
call unlock@(ithread)
s  = 0.5d0
tmax_loc = 0.d0
k = 90.d0/(4.d0*atan(1.d0)*ns_loc) 
do while (s .le. 1.d0)
  i = v1_loc/cmplx((r1_loc+r2_loc/s),(x1_loc+x2_loc),kind=dp)
  t = (k)*(abs(i)*abs(i)*(r2_loc/s))
  if (t .gt. tmax_loc) tmax_loc = t
  s = s + ds_loc
end do
call lock@(ithread) ; tmax2 = tmax_loc ; call unlock@(ithread)
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search2

end data_mod

program main
use data_mod
implicit none
include<windows.ins>
real(kind=dp)tmax
real start_time, end_time, tserial, tparallel
integer(kind=7) h1, h2
r1 = 0.295d0 ; r2 = 0.150d0 ; x1 = 0.510d0 ; x2 = 0.210d0 ; v1 = 200.d0/sqrt(3.d0) ; ns = 1200.d0
ds = 1.d-8

write(6,*) 'Serial'
call clock@ (start_time)
call search1(1) ; call search2(2)
tmax=max(tmax1,tmax2)
write(6,*) 'Tmax ',tmax
call clock@ (end_time)
tserial = end_time - start_time
write(6,*) 'Elapsed time serial', tserial
write(6,*)

write(6,*) 'Parallel'
call clock@ (start_time)
h1 = start_thread@(search1,1) 
h2 = start_thread@(search2,2)
call wait_for_thread@(0)
tmax=max(tmax1,tmax2)
write(6,*) 'Tmax ',tmax
call clock@ (end_time)
tparallel = end_time - start_time
write(6,*) 'Elapsed time parallel', tparallel
write(6,*)
write(6,*) 'x increase in speed ', tserial/tparallel
end

All suggestions/comments welcome - this is a new area for me. Thanks Ken

4 Dec 2017 10:39 #20935

Interestingly compiling and running the code on a different machine produces the following timing results:-

Option         Serial  Parallel  Speedup 
                (s)     (s) 
 Release win32, 7.26,   3.15,     x 2.3 
 Release x64,  13.35,   8.86,     x 1.5

x64 is significantly slower than win32, but in this case using threads does reduce the time required.

5 Dec 2017 1:46 #20952

The results on my machine are similar to your first set.

I am using an Intel i7-4500U @ 1.80GHz.

The code for 64 bit LOCK@ etc. is identical to that for 32 bits.

The overall slowness of 64 bits might relate to the processing of complex expressions in this program but one would expect/hope for a similar comparative improvement with 2 threads and this appears not to be the case.

5 Dec 2017 2:44 #20953

Thanks Paul, It is indeed an unexpected result which you have confirmed. The first set to timings were on an machine with Intel I5-2410 @ 2.3 GHz, the second Intel I5-3340 @ 2.7 GHz.

A third set of timings from a machine with Intel I7-5506 @ 2.4 GHz are:-

Option         Serial  Parallel  Speedup 
                 (s)     (s) 
  Release win32, 1,69,   1,02,     x 1.65 
  Release x64,   4.75,   4.98,     x 0.95

So 3 out of 4 tests on different machines show no gain adopting threads with x64 for this particular code. I will investigate further.

Ken

5 Dec 2017 8:40 #20955

Below is a simplified version of my earlier code, in which there is no manipulation of complex variables. Running this code suggests that the overall slowness of 64 bits is not related to the processing of complex expressions, as the issues identified in my earlier post are still apparent.

implicit none
integer, parameter :: dp=kind(1.d0)
real(kind=dp) ds
integer,parameter::IO_LOCK = 42 
contains
subroutine search1(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) ds_loc
real(kind=dp) s
call lock@(IO_LOCK) ; write(6,*) 'Starting thread  ',ithread ; call unlock@(IO_LOCK)
call lock@(ithread)
ds_loc = ds
call unlock@(ithread)
s  = epsilon(1.d0)
do while (s .le. 0.5d0)
  s = s + ds_loc
end do
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search1

subroutine search2(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) ds_loc
real(kind=dp) s
call lock@(IO_LOCK) ; write(6,*) 'Starting thread  ',ithread  ; call unlock@(IO_LOCK)
call lock@(ithread)
ds_loc = ds
call unlock@(ithread)
s  = 0.5d0
do while (s .le. 1.d0)
  s = s + ds_loc
end do
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search2

subroutine main
include<windows.ins>
real start_time, end_time, tserial, tparallel
integer(kind=7) h1, h2
ds = 1.d-10
write(6,*) 'Serial'
call clock@ (start_time)
call search1(1) 
call search2(2)
call clock@ (end_time)
tserial = end_time - start_time
write(6,*) 'Elapsed time serial', tserial
write(6,*)

write(6,*) 'Parallel'
call clock@ (start_time)
h1 = start_thread@(search1,1)
h2 = start_thread@(search2,2)
call wait_for_thread@(0)
call clock@ (end_time)
tparallel = end_time - start_time
write(6,*) 'Elapsed time parallel', tparallel
write(6,*)
write(6,*) 'x increase in speed ', tserial/tparallel
end subroutine main

end data_mod

program main1
use data_mod
call main
end
6 Dec 2017 2:36 #20957

I am using a 2 core CPU and both cores are apparently being used for 64 bits as for 32 bits.

At the moment I don't know why the 64 bit case shows no improvement when multi-threading.

6 Dec 2017 6:53 #20958
Option         Serial  Parallel  Speedup 
               (s)     (s) 
Release win32, 2.03,   0.786,    x 2.58
Release x64,   3.71,   3.71,     x 1.00
 
10 Dec 2017 11:33 #20962

John,

I'm using 8.2, so clearly there is a difference with the earlier release. It may be the call lock@(ithread) and call unlock@(ithread)

Try changing all of these to call lock@(data_lock) and call unlock@(data_lock)

and add integer,parameter::IO_LOCK = 1, data_lock = 2 to the global definitions in the module.

As I said, this is all new to me and I'm slowly trying work out how to do this. With 4 threads I can get a speed up of 3.4 times (two dual core processors), with Win32 and /optimise.

Ken

11 Dec 2017 12:06 (Edited: 11 Dec 2017 12:56) #20965

Ken,

I tried your examples using FTN95 Ver 8.20 and fond /64 to be faster for second test but slower for first (COMPLEX) test.

I made a few changes to provide some more statistics.

  1. the timers record both elapsed and CPU time, to give an indication of the threading efficiencies.
  2. you only need local (non saved) copies of variables that change in the thread calculation. ds, r1,x1,ns etc don't need local copies as they are never changed, while s_loc, tmax_loc, t_loc, k_loc and i_loc need to have local copies.
  3. can demonstrate multi-thread without routine search2, as search1 is thread safe. (reduces code size, which is critical for this forum)

My batch file to test different compile options is

del %1.exe
ftn95 %1 /link /echo_options >> %1.log
%1 >> %1.log

del %1.exe
ftn95 %1 /link /64 /echo_options >> %1.log
%1 >> %1.log

del %1.exe
ftn95 %1 /link /opt /echo_options >> %1.log
%1 >> %1.log

del %1.exe
ftn95 %1 /link /64 /opt /echo_options >> %1.log
%1 >> %1.log

notepad %1.log

/echo_options is available in 8.20 and works well in this case.

  module data_mod 
   implicit none 
   integer, parameter :: dp=kind(1.d0) 
   real(kind=dp) ds 
   real(kind=dp) r1,x1,x2,r2,v1,ns,tmax(2)
   integer,parameter::IO_LOCK = 42 
   real*4 times(2)
   contains 
  
   subroutine time_start
     call get_times (times)
   end subroutine time_start
   
   subroutine time_end
     real*4    now(2)
     call get_times (now)
     times = now - times
   end subroutine time_end
   
   subroutine get_times (now)
     real*4 now(2)
     call elapse_time (now(1))
     call cpu_time (now(2))
   end subroutine get_times

   subroutine elapse_time (seconds)
     real*4 seconds
     integer*8 tick, rate
     call system_clock (tick, rate)
     seconds = dble(tick) / dble(rate)
   end subroutine elapse_time

   subroutine search1 (ithread) 
     include<windows.ins> 
     integer, intent(in) :: ithread 
     real(kind=dp) s_loc 
     integer n_do

       call lock@(IO_LOCK) 
     write(6,*) 'Starting A thread  ',ithread 
       call unlock@(IO_LOCK) 
     s_loc = epsilon(1.d0) 
     n_do  = 0
     do while (s_loc <= 0.5d0) 
       s_loc = s_loc + ds
       n_do  = n_do+1
     end do 
       call lock@(IO_LOCK) 
     write(6,*) 'Completed A thread ',ithread, n_do, s_loc
       call unlock@(IO_LOCK) 
   end subroutine search1 
  
   subroutine searchX (ithread) 
     include<windows.ins> 
     integer, intent(in) :: ithread
     real(kind=dp) s_loc, tmax_loc, t_loc, k_loc 
     complex(kind=dp) i_loc 
     integer n_do
   
       call lock@(IO_LOCK) 
     write(6,*) 'Starting O1 thread ',ithread
       call unlock@(IO_LOCK) 
     tmax_loc = 0.d0 
     k_loc = 90.d0/(4.d0*atan(1.d0)*ns) 
     s_loc = epsilon(1.0) 
     n_do  = 0
     do while (s_loc <= 0.5d0) 
       i_loc = v1/cmplx((r1+r2/s_loc),(x1+x2),kind=dp) 
       t_loc = (k_loc)*(abs(i_loc)*abs(i_loc)*(r2/s_loc)) 
       if (t_loc > tmax_loc) tmax_loc = t_loc 
       s_loc = s_loc + ds 
       n_do  = n_do+1
     end do 
       call lock@(ithread) 
     tmax(ithread) = tmax_loc 
       call unlock@(ithread)
       call lock@(IO_LOCK) 
     write(6,*) 'Completed O1 thread',ithread, n_do, tmax_loc
       call unlock@(IO_LOCK) 
   end subroutine searchX 
11 Dec 2017 12:11 #20966
   subroutine do_tests
     include<windows.ins> 
     real*4   tserial(2), tparallel(2) 
     integer(kind=7) h1, h2 
     integer kk
  
     r1 = 0.295d0 ; r2 = 0.150d0 ; x1 = 0.510d0 ; x2 = 0.210d0 ; v1 = 200.d0/sqrt(3.d0) ; ns = 1200.d0 

   do kk = 1,2
     write(6,*) 'Serial   ', kk
     call time_start
     if ( kk==2 ) then
       ds = 1.d-9
       call search1(1) 
       call search1(2) 
     else
       ds = 1.d-8
       call searchX (1) 
       call searchx (2) 
     end if
     call time_end
     tserial = times
     write(6,*) 'Elapsed time serial', tserial 
     write(6,*) 
    
     write(6,*) 'Parallel ', kk
     call time_start
     if ( kk==2 ) then
       h1 = start_thread@ (search1,1) 
       h2 = start_thread@ (search1,2) 
     else
       h1 = start_thread@ (searchX,1) 
       h2 = start_thread@ (searchx,2) 
     end if
     call wait_for_thread@ (0) 
     call time_end
     tparallel = times
     write(6,*) 'Elapsed time parallel', tparallel 
     write(6,*) 
     write(6,*) 'x increase in speed ', tserial/tparallel 
   end do

   end subroutine do_tests 

 end module data_mod 

 program main1 
   use data_mod 
   call do_tests
 end 
11 Dec 2017 12:15 #20967

results include [FTN95/Win32 Ver. 8.20.0 Copyright (c) Silverfrost Ltd 1993-2017]

[Current options] ERROR_NUMBERS;IMPLICIT_NONE;INTL;LINK;LOGL;  32-bit
                                    Elapse          CPU
 Serial   1 Elapsed time serial     4.04749         4.04043    
 Parallel 1 Elapsed time parallel   1.85291         3.66602    
           x increase in speed      2.18440         1.10213    

 Serial   2 Elapsed time serial     3.65540         3.63482    
 Parallel 2 Elapsed time parallel   1.85205         3.61922    
           x increase in speed      1.97370         1.00431    

[Current options] 64;ERROR_NUMBERS;IMPLICIT_NONE;INTL;LINK;LOGL;  64-bit

 Serial   1 Elapsed time serial     6.61194         6.55204    
 Parallel 1 Elapsed time parallel   3.34155         6.56764    
           x increase in speed      1.97870        0.997625    

 Serial   2 Elapsed time serial     3.36658         3.33842    
 Parallel 2 Elapsed time parallel   1.71411         3.32282    
           x increase in speed      1.96404         1.00469    
11 Dec 2017 7:01 #20968

Anyone tested results with lock free code? Locking is expensive. IO should be taken out from threads and memory structures should be organized to allow lock free code.

11 Dec 2017 10:45 #20969

John has correctly identified the source of the problem...

CLOCK@ gives the elapsed time for 32 bits but the CPU time for 64 bits. The CPU time combines the times for both threads.

The treading is working correctly and the reduction in elapsed time is comparable between 32 bits and 64 bits.

One way to correctly calculate the elapsed time is to call SYSTEM_CLOCK.

We should probably treat the issue with CLOCK@ as a bug in the 64 bit library although this will only be apparent when using threading.

11 Dec 2017 11:58 #20971

The following is an example where the number of threads can be varied more easily. The same routine is called for multiple threads, with the 'arg' indicating which thread is being called. It appear to work, giving the same answer.

 module data_mod 
   implicit none 
   integer, parameter :: dp=kind(1.d0) 
   real(kind=dp) ds 
   real(kind=dp) r1,x1,x2,r2,v1,ns,tmax(4),smax(4)
   integer,parameter::IO_LOCK = 42 
   integer max_threads
   real*4 times(2)
   contains 
  
   subroutine time_start
     call get_times (times)
   end subroutine time_start
   
   subroutine time_end
     real*4    now(2)
     call get_times (now)
     times = now - times
   end subroutine time_end
   
   subroutine get_times (now)
     real*4 now(2)
     call elapse_time (now(1))
     call cpu_time (now(2))
   end subroutine get_times

   subroutine elapse_time (seconds)
     real*4 seconds
     integer*8 tick, rate
     call system_clock (tick, rate)
     seconds = dble(tick) / dble(rate)
   end subroutine elapse_time

   subroutine search1 (ithread) 
     include<windows.ins> 
     integer, intent(in) :: ithread 
     real(kind=dp) s_loc, slim
     integer n_do

       call lock@(IO_LOCK) 
     write(6,*) 'Starting  A thread ',ithread 
       call unlock@(IO_LOCK) 

     slim  = dble(ithread)/dble(max_threads)
     s_loc = epsilon(1.0d0) + slim - 1.0d0/dble(max_threads)
     n_do  = 0
     do while (s_loc <= slim) 
       s_loc = s_loc + ds
       n_do  = n_do+1
     end do 

       call lock@(IO_LOCK) 
     write(6,*) 'Completed A thread ',ithread, n_do, s_loc
       call unlock@(IO_LOCK) 
   end subroutine search1 
  
   subroutine searchX (ithread) 
     include<windows.ins> 
     integer, intent(in) :: ithread
     real(kind=dp) s_loc, slim, tmax_loc, t_loc, k_loc, smax_loc
     complex(kind=dp) i_loc 
     integer n_do
   
       call lock@   (IO_LOCK) 
     write(6,*) 'Starting  C1 thread',ithread
       call unlock@ (IO_LOCK) 

     tmax_loc = 0.d0 
     k_loc = 90.d0/(4.d0*atan(1.d0)*ns) 
     slim  = dble(ithread)/dble(max_threads)
     s_loc = epsilon(1.0d0) + slim - 1.0d0/dble(max_threads)
     n_do  = 0
     do while (s_loc <= slim) 
       i_loc = v1/cmplx((r1+r2/s_loc),(x1+x2),kind=dp) 
!!!       t_loc = (k_loc)*(abs(i_loc)*abs(i_loc)*(r2/s_loc)) 
       t_loc = (k_loc)*((abs(i_loc)**2)*(r2/s_loc)) 
       if (t_loc > tmax_loc) then
         tmax_loc = t_loc 
         smax_loc = s_loc
       end if
       s_loc = s_loc + ds 
       n_do  = n_do+1
     end do 
!z       call lock@   (ithread)    ! ithread should be hh(ithread) ??
     tmax(ithread) = tmax_loc 
     smax(ithread) = smax_loc
!z       call unlock@ (ithread)

       call lock@   (IO_LOCK) 
     write(6,*) 'Completed C1 thread',ithread, n_do, tmax_loc
       call unlock@ (IO_LOCK) 
   end subroutine searchX 
11 Dec 2017 11:59 #20972

The test routine is: subroutine do_tests include<windows.ins> real*4 tserial(2), tparallel(2) integer(kind=7) h1,h2,h3,h4, hh(4) integer kk, j, arg(4)

     r1 = 0.295d0 ; r2 = 0.150d0 ; x1 = 0.510d0 ; x2 = 0.210d0 ; v1 = 200.d0/sqrt(3.d0) ; ns = 1200.d0 
     max_threads = 4

   do kk = 1,2
     write(6,*) ' ' 
     write(6,11) 'Serial  ', kk
     call time_start
     if ( kk==2 ) then
       ds = 1.d-9
       call search1 (1) 
       call search1 (2) 
       call search1 (3) 
       call search1 (4) 
     else
       ds = 1.d-8
       call searchX (1) 
       call searchx (2) 
       call searchx (3) 
       call searchx (4) 
     end if
     call time_end
     tserial = times
     write(6,11) ' Serial  ', kk, ' Elapsed time ', tserial 
11   format (a,i2,a,2f9.5)   
    
     write(6,11) 'Parallel', kk
     call time_start
     if ( kk==2 ) then
       h1 = start_thread@ (search1,1) 
       h2 = start_thread@ (search1,2) 
       h3 = start_thread@ (search1,3) 
       h4 = start_thread@ (search1,4) 
     else
       h1 = start_thread@ (searchX,1) 
       h2 = start_thread@ (searchx,2) 
       h3 = start_thread@ (searchx,3) 
       h4 = start_thread@ (searchx,4) 
     end if
     call wait_for_thread@ (0) 
     call time_end
     tparallel = times
     write(6,11) ' Parallel', kk, ' Elapsed time ', tparallel 

     write(6,12) '     x increase in speed ', tserial/tparallel 
     if ( kk==1 ) then
       j = maxloc( tmax(1:max_threads), 1 )
       write (*,*) 'max val is',tmax(j),' at s=',smax(j)
     end if

!  Test of calling start_thread@ in a DO loop    
!   hh(j) = start_thread@ (search1,j) failed, as j is changing its value
!   while arg(j) appears to transfer correct address
!
     write(6,11) 'Test arg', kk
     call time_start
     if ( kk==2 ) then
       do j = 1,max_threads
         arg(j) = j
         hh(j)  = start_thread@ (search1,arg(j))
       end do
     else
       do j = 1,max_threads
         arg(j) = j
         hh(j)  = start_thread@ (searchx,arg(j)) 
       end do
     end if
     call wait_for_thread@ (0) 
     call time_end
     tparallel = times
     write(6,11) ' Test arg', kk, ' Elapsed time ', tparallel 

     write(6,12) '     x increase in speed ', tserial/tparallel 
     if ( kk==1 ) then
       j = maxloc( tmax(1:max_threads), 1 )
       write (*,*) 'max val is',tmax(j),' at s=',smax(j)
     end if

12   format (a,2f9.5)   
   end do

   end subroutine do_tests 

 end module data_mod 

 program main1 
   use data_mod 
   write (*,*) '[Ken_thread2b.f90 ]'
   call do_tests
 end 

I ran it on my pc that supports 4 threads.

11 Dec 2017 3:35 #20973

Ken's original code raises two further issues.

  1. The code fragment

    call lock@(ithread) ds_loc = ds call unlock@(ithread)

appears in two places where the actual value of ithread is different. This is probably not what is intended. In working code (rather than this cut-down demo) the variable ds would be shared and needing read/write protection. In which case the locking ID would need to be the same in both instances.

  1. The other issue is for us to sort out. The use of start_thread@ in a 64 bit environment raises the question: what makes a call 'thread safe' for 64 bits? We need to look into this and (at least) provide some documentation that explains what kind of call is considered to be safe.
11 Dec 2017 7:49 #20974

John-Silver

Some of the results in this Forum thread are erroneous. If correct results are 60% slower then that will be a particular case. In general my experience is that 64 bit results are usually on a par with 32 bit results (for speed).

The real motivation for moving to 64 bits is the increase in address space.

start_thread@ is for Win32 and x64 and not .NET.

Yes, multi-threading can be tricky.

11 Dec 2017 8:53 #20976

Paul, John,

Thanks to you both for comments and detective work.

I can see how to make this work now.

Ken

12 Dec 2017 12:27 #20977

Based on the examples I have posted, there does not appear to be significant performance problems with 64-bit multi-threading.

I have used two of Ken's examples, I called ; Search1 : simple loop count, where 64-bit is faster SearchX : loop of COMPLEX calculation, where 64-bit is slower.

I have not identified the problem, but I suspect it is the use of cmplx (x,y,kind=dp) or COMPLEX i_loc in the following: i_loc = v1/cmplx((r1+r2/s_loc),(x1+x2),kind=dp) t_loc = (k_loc)((abs(i_loc)**2)(r2/s_loc))

I rarely use complex type variables, so am not familiar with this type of problem.

My 'Test arg' example posted above is an interesting example of multi-threading, as for my first try below with j as the arg value, this fails as the variable 'j' changes through the loop and so changes in each of the initiated threads. This is because the same address of the variable j is transferred to each thread: do j = 1,max_threads hh(j) = start_thread@ (searchx,j) end do

By changing to arg(j), a unique address for arg(j) is transferred to each thread and so it appears to work ok do j = 1,max_threads arg(j) = j hh(j) = start_thread@ (searchx,arg(j)) end do

Further tests need to be done for varying max_threads to identify the multi-thread efficiency and overheads. My pc's support up to 4 or 8 threads. ( It has been a long time since I used a pc that supported only 2 threads. )

John

Please login to reply.