forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Using threads
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support
View previous topic :: View next topic  
Author Message
Kenneth_Smith



Joined: 18 May 2012
Posts: 697
Location: Hamilton, Lanarkshire, Scotland.

PostPosted: Mon Dec 04, 2017 10:07 pm    Post subject: Using threads Reply with quote

A few of my applications use a genetic algorithm for optimisation purposes, a slightly modified version of the veritable F77 routine PIKAIA http://www.hao.ucar.edu/modeling/pikaia/pikaia.php. The major overhead with GAs is the evaluation of the fitness function, and I am am aware of timeconsuming interval searches (for an overall minimum of maximum) within my fitness functions which consume a lot of CPU time. These have become a larger burden as I've tired to solve more complex problems.

A few years ago there was some discussion on the forum about using threads and a few sample programs were posted. Looking back at these last week I wondered if there was an opportunity to speed up some of these time consuming interval searches. The code below is my first attempt at this.

Basically, I am looking for the local maximum in an interval search between s = 0 and s = 1 (this is a made up example - I know that this particular test function can be solved analytically for Tmax). There are two subroutines, one searches between 0 and 0.5 and the other searches between 0.5 and 1. In the first instance they are called one after the other, and then they are run in parallel as threads.

Can somebody confirm if I have got the locking and unlocking correct? I am making local copies of all global variables and using these locals in the thread. Is this a sensible approach?

The code appears to run correctly under Win32 release, although the results under x64 release are a little disappointing:

Code:
Option         Serial  Parallel  Speedup
               (s)     (s)
Release win32, 4.26,   2.22,     x 1.92
Release x64,   6.76,   7.34,     x 0.92


Has anybody looked at speed of code compiled with x64 and compared to win32? Note that using threads, in this example is actually slower than running the code serially. Is there a reason for this?
Back to top
View user's profile Send private message Visit poster's website
Kenneth_Smith



Joined: 18 May 2012
Posts: 697
Location: Hamilton, Lanarkshire, Scotland.

PostPosted: Mon Dec 04, 2017 10:10 pm    Post subject: Reply with quote

The sample code:-

Code:
module data_mod
implicit none
integer, parameter :: dp=kind(1.d0)
real(kind=dp) r1,x1,x2,r2,v1,ns,tmax1,tmax2,ds
integer,parameter::IO_LOCK = 42
contains
subroutine search1(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) r1_loc, x1_loc, r2_loc, x2_loc, v1_loc, ns_loc, ds_loc
real(kind=dp) s, tmax_loc, t, k
complex(kind=dp) i
call lock@(IO_LOCK) ; write(6,*) 'Starting thread  ',ithread ; call unlock@(IO_LOCK)
call lock@(ithread)
r1_loc = r1 ; r2_loc = r2 ; x1_loc = x1 ; x2_loc = x2
v1_loc = v1 ; ns_loc = ns ; ds_loc = ds
call unlock@(ithread)
s  = epsilon(1.0)
tmax_loc = 0.d0
k = 90.d0/(4.d0*atan(1.d0)*ns_loc)
do while (s .le. 0.5d0)
  i = v1_loc/cmplx((r1_loc+r2_loc/s),(x1_loc+x2_loc),kind=dp)
  t = (k)*(abs(i)*abs(i)*(r2_loc/s))
  if (t .gt. tmax_loc) tmax_loc = t
  s = s + ds_loc
end do
call lock@(ithread) ; tmax1 = tmax_loc ; call unlock@(ithread)
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search1

subroutine search2(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) r1_loc, x1_loc, r2_loc, x2_loc, v1_loc, ns_loc, ds_loc
real(kind=dp) s, tmax_loc, t, k
complex(kind=dp) i
call lock@(IO_LOCK) ; write(6,*) 'Starting thread  ',ithread ; call unlock@(IO_LOCK)
call lock@(ithread)
r1_loc = r1 ; r2_loc = r2 ; x1_loc = x1 ; x2_loc = x2
v1_loc = v1 ; ns_loc = ns ; ds_loc = ds
call unlock@(ithread)
s  = 0.5d0
tmax_loc = 0.d0
k = 90.d0/(4.d0*atan(1.d0)*ns_loc)
do while (s .le. 1.d0)
  i = v1_loc/cmplx((r1_loc+r2_loc/s),(x1_loc+x2_loc),kind=dp)
  t = (k)*(abs(i)*abs(i)*(r2_loc/s))
  if (t .gt. tmax_loc) tmax_loc = t
  s = s + ds_loc
end do
call lock@(ithread) ; tmax2 = tmax_loc ; call unlock@(ithread)
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search2

end data_mod

program main
use data_mod
implicit none
include<windows.ins>
real(kind=dp)tmax
real start_time, end_time, tserial, tparallel
integer(kind=7) h1, h2
r1 = 0.295d0 ; r2 = 0.150d0 ; x1 = 0.510d0 ; x2 = 0.210d0 ; v1 = 200.d0/sqrt(3.d0) ; ns = 1200.d0
ds = 1.d-8

write(6,*) 'Serial'
call clock@ (start_time)
call search1(1) ; call search2(2)
tmax=max(tmax1,tmax2)
write(6,*) 'Tmax ',tmax
call clock@ (end_time)
tserial = end_time - start_time
write(6,*) 'Elapsed time serial', tserial
write(6,*)

write(6,*) 'Parallel'
call clock@ (start_time)
h1 = start_thread@(search1,1)
h2 = start_thread@(search2,2)
call wait_for_thread@(0)
tmax=max(tmax1,tmax2)
write(6,*) 'Tmax ',tmax
call clock@ (end_time)
tparallel = end_time - start_time
write(6,*) 'Elapsed time parallel', tparallel
write(6,*)
write(6,*) 'x increase in speed ', tserial/tparallel
end


All suggestions/comments welcome - this is a new area for me.
Thanks
Ken
Back to top
View user's profile Send private message Visit poster's website
Kenneth_Smith



Joined: 18 May 2012
Posts: 697
Location: Hamilton, Lanarkshire, Scotland.

PostPosted: Mon Dec 04, 2017 11:39 pm    Post subject: Reply with quote

Interestingly compiling and running the code on a different machine produces the following timing results:-

Code:
Option         Serial  Parallel  Speedup
                (s)     (s)
 Release win32, 7.26,   3.15,     x 2.3
 Release x64,  13.35,   8.86,     x 1.5


x64 is significantly slower than win32, but in this case using threads does reduce the time required.
Back to top
View user's profile Send private message Visit poster's website
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7912
Location: Salford, UK

PostPosted: Tue Dec 05, 2017 2:46 pm    Post subject: Reply with quote

The results on my machine are similar to your first set.

I am using an Intel i7-4500U @ 1.80GHz.

The code for 64 bit LOCK@ etc. is identical to that for 32 bits.

The overall slowness of 64 bits might relate to the processing of complex expressions in this program but one would expect/hope for a similar comparative improvement with 2 threads and this appears not to be the case.
Back to top
View user's profile Send private message AIM Address
Kenneth_Smith



Joined: 18 May 2012
Posts: 697
Location: Hamilton, Lanarkshire, Scotland.

PostPosted: Tue Dec 05, 2017 3:44 pm    Post subject: Reply with quote

Thanks Paul, It is indeed an unexpected result which you have confirmed.
The first set to timings were on an machine with Intel I5-2410 @ 2.3 GHz, the second Intel I5-3340 @ 2.7 GHz.

A third set of timings from a machine with Intel I7-5506 @ 2.4 GHz are:-

Code:
Option         Serial  Parallel  Speedup
                 (s)     (s)
  Release win32, 1,69,   1,02,     x 1.65
  Release x64,   4.75,   4.98,     x 0.95


So 3 out of 4 tests on different machines show no gain adopting threads with x64 for this particular code. I will investigate further.

Ken
Back to top
View user's profile Send private message Visit poster's website
Kenneth_Smith



Joined: 18 May 2012
Posts: 697
Location: Hamilton, Lanarkshire, Scotland.

PostPosted: Tue Dec 05, 2017 9:40 pm    Post subject: Reply with quote

Below is a simplified version of my earlier code, in which there is no manipulation of complex variables. Running this code suggests that the overall slowness of 64 bits is not related to the processing of complex expressions, as the issues identified in my earlier post are still apparent.

Code:

implicit none
integer, parameter :: dp=kind(1.d0)
real(kind=dp) ds
integer,parameter::IO_LOCK = 42
contains
subroutine search1(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) ds_loc
real(kind=dp) s
call lock@(IO_LOCK) ; write(6,*) 'Starting thread  ',ithread ; call unlock@(IO_LOCK)
call lock@(ithread)
ds_loc = ds
call unlock@(ithread)
s  = epsilon(1.d0)
do while (s .le. 0.5d0)
  s = s + ds_loc
end do
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search1

subroutine search2(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) ds_loc
real(kind=dp) s
call lock@(IO_LOCK) ; write(6,*) 'Starting thread  ',ithread  ; call unlock@(IO_LOCK)
call lock@(ithread)
ds_loc = ds
call unlock@(ithread)
s  = 0.5d0
do while (s .le. 1.d0)
  s = s + ds_loc
end do
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search2

subroutine main
include<windows.ins>
real start_time, end_time, tserial, tparallel
integer(kind=7) h1, h2
ds = 1.d-10
write(6,*) 'Serial'
call clock@ (start_time)
call search1(1)
call search2(2)
call clock@ (end_time)
tserial = end_time - start_time
write(6,*) 'Elapsed time serial', tserial
write(6,*)

write(6,*) 'Parallel'
call clock@ (start_time)
h1 = start_thread@(search1,1)
h2 = start_thread@(search2,2)
call wait_for_thread@(0)
call clock@ (end_time)
tparallel = end_time - start_time
write(6,*) 'Elapsed time parallel', tparallel
write(6,*)
write(6,*) 'x increase in speed ', tserial/tparallel
end subroutine main

end data_mod

program main1
use data_mod
call main
end
Back to top
View user's profile Send private message Visit poster's website
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7912
Location: Salford, UK

PostPosted: Wed Dec 06, 2017 3:36 pm    Post subject: Reply with quote

I am using a 2 core CPU and both cores are apparently being used for 64 bits as for 32 bits.

At the moment I don't know why the 64 bit case shows no improvement when multi-threading.
Back to top
View user's profile Send private message AIM Address
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Wed Dec 06, 2017 7:53 pm    Post subject: Reply with quote

Code:

Option         Serial  Parallel  Speedup
               (s)     (s)
Release win32, 2.03,   0.786,    x 2.58
Release x64,   3.71,   3.71,     x 1.00
 
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Sat Dec 09, 2017 11:18 pm    Post subject: Reply with quote

I've tried running the first of KenS codes abóve and it works ok for x32, but when I try running x64 the program it appears to compile ok, and link ok, but 'hangs' interminably doing nothing and I can't even close the process from Windows Task Manager!!!

I'm running from command line. v8.1 with the last released dlls14

any ideas why this is happening ?
Back to top
View user's profile Send private message
Kenneth_Smith



Joined: 18 May 2012
Posts: 697
Location: Hamilton, Lanarkshire, Scotland.

PostPosted: Sun Dec 10, 2017 12:33 pm    Post subject: Reply with quote

John,

I'm using 8.2, so clearly there is a difference with the earlier release. It may be the call lock@(ithread) and call unlock@(ithread)

Try changing all of these to call lock@(data_lock) and call unlock@(data_lock)

and add integer,parameter::IO_LOCK = 1, data_lock = 2 to the global definitions in the module.

As I said, this is all new to me and I'm slowly trying work out how to do this. With 4 threads I can get a speed up of 3.4 times (two dual core processors), with Win32 and /optimise.

Ken
Back to top
View user's profile Send private message Visit poster's website
JohnCampbell



Joined: 16 Feb 2006
Posts: 2551
Location: Sydney

PostPosted: Mon Dec 11, 2017 1:06 am    Post subject: Reply with quote

Ken,

I tried your examples using FTN95 Ver 8.20 and fond /64 to be faster for second test but slower for first (COMPLEX) test.

I made a few changes to provide some more statistics.

1) the timers record both elapsed and CPU time, to give an indication of the threading efficiencies.
2) you only need local (non saved) copies of variables that change in the thread calculation.
ds, r1,x1,ns etc don't need local copies as they are never changed,
while s_loc, tmax_loc, t_loc, k_loc and i_loc need to have local copies.
3) can demonstrate multi-thread without routine search2, as search1 is thread safe. (reduces code size, which is critical for this forum)

My batch file to test different compile options is
Code:
del %1.exe
ftn95 %1 /link /echo_options >> %1.log
%1 >> %1.log

del %1.exe
ftn95 %1 /link /64 /echo_options >> %1.log
%1 >> %1.log

del %1.exe
ftn95 %1 /link /opt /echo_options >> %1.log
%1 >> %1.log

del %1.exe
ftn95 %1 /link /64 /opt /echo_options >> %1.log
%1 >> %1.log

notepad %1.log

/echo_options is available in 8.20 and works well in this case.

Code:
  module data_mod
   implicit none
   integer, parameter :: dp=kind(1.d0)
   real(kind=dp) ds
   real(kind=dp) r1,x1,x2,r2,v1,ns,tmax(2)
   integer,parameter::IO_LOCK = 42
   real*4 times(2)
   contains
 
   subroutine time_start
     call get_times (times)
   end subroutine time_start
   
   subroutine time_end
     real*4    now(2)
     call get_times (now)
     times = now - times
   end subroutine time_end
   
   subroutine get_times (now)
     real*4 now(2)
     call elapse_time (now(1))
     call cpu_time (now(2))
   end subroutine get_times

   subroutine elapse_time (seconds)
     real*4 seconds
     integer*8 tick, rate
     call system_clock (tick, rate)
     seconds = dble(tick) / dble(rate)
   end subroutine elapse_time

   subroutine search1 (ithread)
     include<windows.ins>
     integer, intent(in) :: ithread
     real(kind=dp) s_loc
     integer n_do

       call lock@(IO_LOCK)
     write(6,*) 'Starting A thread  ',ithread
       call unlock@(IO_LOCK)
     s_loc = epsilon(1.d0)
     n_do  = 0
     do while (s_loc <= 0.5d0)
       s_loc = s_loc + ds
       n_do  = n_do+1
     end do
       call lock@(IO_LOCK)
     write(6,*) 'Completed A thread ',ithread, n_do, s_loc
       call unlock@(IO_LOCK)
   end subroutine search1
 
   subroutine searchX (ithread)
     include<windows.ins>
     integer, intent(in) :: ithread
     real(kind=dp) s_loc, tmax_loc, t_loc, k_loc
     complex(kind=dp) i_loc
     integer n_do
   
       call lock@(IO_LOCK)
     write(6,*) 'Starting O1 thread ',ithread
       call unlock@(IO_LOCK)
     tmax_loc = 0.d0
     k_loc = 90.d0/(4.d0*atan(1.d0)*ns)
     s_loc = epsilon(1.0)
     n_do  = 0
     do while (s_loc <= 0.5d0)
       i_loc = v1/cmplx((r1+r2/s_loc),(x1+x2),kind=dp)
       t_loc = (k_loc)*(abs(i_loc)*abs(i_loc)*(r2/s_loc))
       if (t_loc > tmax_loc) tmax_loc = t_loc
       s_loc = s_loc + ds
       n_do  = n_do+1
     end do
       call lock@(ithread)
     tmax(ithread) = tmax_loc
       call unlock@(ithread)
       call lock@(IO_LOCK)
     write(6,*) 'Completed O1 thread',ithread, n_do, tmax_loc
       call unlock@(IO_LOCK)
   end subroutine searchX


Last edited by JohnCampbell on Mon Dec 11, 2017 1:56 am; edited 4 times in total
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2551
Location: Sydney

PostPosted: Mon Dec 11, 2017 1:11 am    Post subject: Reply with quote

Code:
   subroutine do_tests
     include<windows.ins>
     real*4   tserial(2), tparallel(2)
     integer(kind=7) h1, h2
     integer kk
 
     r1 = 0.295d0 ; r2 = 0.150d0 ; x1 = 0.510d0 ; x2 = 0.210d0 ; v1 = 200.d0/sqrt(3.d0) ; ns = 1200.d0

   do kk = 1,2
     write(6,*) 'Serial   ', kk
     call time_start
     if ( kk==2 ) then
       ds = 1.d-9
       call search1(1)
       call search1(2)
     else
       ds = 1.d-8
       call searchX (1)
       call searchx (2)
     end if
     call time_end
     tserial = times
     write(6,*) 'Elapsed time serial', tserial
     write(6,*)
   
     write(6,*) 'Parallel ', kk
     call time_start
     if ( kk==2 ) then
       h1 = start_thread@ (search1,1)
       h2 = start_thread@ (search1,2)
     else
       h1 = start_thread@ (searchX,1)
       h2 = start_thread@ (searchx,2)
     end if
     call wait_for_thread@ (0)
     call time_end
     tparallel = times
     write(6,*) 'Elapsed time parallel', tparallel
     write(6,*)
     write(6,*) 'x increase in speed ', tserial/tparallel
   end do

   end subroutine do_tests

 end module data_mod

 program main1
   use data_mod
   call do_tests
 end
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2551
Location: Sydney

PostPosted: Mon Dec 11, 2017 1:15 am    Post subject: Reply with quote

results include
Code:
 [FTN95/Win32 Ver. 8.20.0 Copyright (c) Silverfrost Ltd 1993-2017]

[Current options] ERROR_NUMBERS;IMPLICIT_NONE;INTL;LINK;LOGL;  32-bit
                                    Elapse          CPU
 Serial   1 Elapsed time serial     4.04749         4.04043   
 Parallel 1 Elapsed time parallel   1.85291         3.66602   
           x increase in speed      2.18440         1.10213   

 Serial   2 Elapsed time serial     3.65540         3.63482   
 Parallel 2 Elapsed time parallel   1.85205         3.61922   
           x increase in speed      1.97370         1.00431   

[Current options] 64;ERROR_NUMBERS;IMPLICIT_NONE;INTL;LINK;LOGL;  64-bit

 Serial   1 Elapsed time serial     6.61194         6.55204   
 Parallel 1 Elapsed time parallel   3.34155         6.56764   
           x increase in speed      1.97870        0.997625   

 Serial   2 Elapsed time serial     3.36658         3.33842   
 Parallel 2 Elapsed time parallel   1.71411         3.32282   
           x increase in speed      1.96404         1.00469   

Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 196

PostPosted: Mon Dec 11, 2017 8:01 am    Post subject: Reply with quote

Anyone tested results with lock free code? Locking is expensive. IO should be taken out from threads and memory structures should be organized to allow lock free code.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7912
Location: Salford, UK

PostPosted: Mon Dec 11, 2017 11:45 am    Post subject: Reply with quote

John has correctly identified the source of the problem...

CLOCK@ gives the elapsed time for 32 bits but the CPU time for 64 bits.
The CPU time combines the times for both threads.

The treading is working correctly and the reduction in elapsed time is comparable between 32 bits and 64 bits.

One way to correctly calculate the elapsed time is to call SYSTEM_CLOCK.

We should probably treat the issue with CLOCK@ as a bug in the 64 bit library although this will only be apparent when using threading.
Back to top
View user's profile Send private message AIM Address
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support All times are GMT + 1 Hour
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group