|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
Kenneth_Smith
Joined: 18 May 2012 Posts: 713 Location: Hamilton, Lanarkshire, Scotland.
|
Posted: Mon Dec 04, 2017 10:07 pm Post subject: Using threads |
|
|
A few of my applications use a genetic algorithm for optimisation purposes, a slightly modified version of the veritable F77 routine PIKAIA http://www.hao.ucar.edu/modeling/pikaia/pikaia.php. The major overhead with GAs is the evaluation of the fitness function, and I am am aware of timeconsuming interval searches (for an overall minimum of maximum) within my fitness functions which consume a lot of CPU time. These have become a larger burden as I've tired to solve more complex problems.
A few years ago there was some discussion on the forum about using threads and a few sample programs were posted. Looking back at these last week I wondered if there was an opportunity to speed up some of these time consuming interval searches. The code below is my first attempt at this.
Basically, I am looking for the local maximum in an interval search between s = 0 and s = 1 (this is a made up example - I know that this particular test function can be solved analytically for Tmax). There are two subroutines, one searches between 0 and 0.5 and the other searches between 0.5 and 1. In the first instance they are called one after the other, and then they are run in parallel as threads.
Can somebody confirm if I have got the locking and unlocking correct? I am making local copies of all global variables and using these locals in the thread. Is this a sensible approach?
The code appears to run correctly under Win32 release, although the results under x64 release are a little disappointing:
Code: | Option Serial Parallel Speedup
(s) (s)
Release win32, 4.26, 2.22, x 1.92
Release x64, 6.76, 7.34, x 0.92 |
Has anybody looked at speed of code compiled with x64 and compared to win32? Note that using threads, in this example is actually slower than running the code serially. Is there a reason for this? |
|
Back to top |
|
|
Kenneth_Smith
Joined: 18 May 2012 Posts: 713 Location: Hamilton, Lanarkshire, Scotland.
|
Posted: Mon Dec 04, 2017 10:10 pm Post subject: |
|
|
The sample code:-
Code: | module data_mod
implicit none
integer, parameter :: dp=kind(1.d0)
real(kind=dp) r1,x1,x2,r2,v1,ns,tmax1,tmax2,ds
integer,parameter::IO_LOCK = 42
contains
subroutine search1(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) r1_loc, x1_loc, r2_loc, x2_loc, v1_loc, ns_loc, ds_loc
real(kind=dp) s, tmax_loc, t, k
complex(kind=dp) i
call lock@(IO_LOCK) ; write(6,*) 'Starting thread ',ithread ; call unlock@(IO_LOCK)
call lock@(ithread)
r1_loc = r1 ; r2_loc = r2 ; x1_loc = x1 ; x2_loc = x2
v1_loc = v1 ; ns_loc = ns ; ds_loc = ds
call unlock@(ithread)
s = epsilon(1.0)
tmax_loc = 0.d0
k = 90.d0/(4.d0*atan(1.d0)*ns_loc)
do while (s .le. 0.5d0)
i = v1_loc/cmplx((r1_loc+r2_loc/s),(x1_loc+x2_loc),kind=dp)
t = (k)*(abs(i)*abs(i)*(r2_loc/s))
if (t .gt. tmax_loc) tmax_loc = t
s = s + ds_loc
end do
call lock@(ithread) ; tmax1 = tmax_loc ; call unlock@(ithread)
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search1
subroutine search2(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) r1_loc, x1_loc, r2_loc, x2_loc, v1_loc, ns_loc, ds_loc
real(kind=dp) s, tmax_loc, t, k
complex(kind=dp) i
call lock@(IO_LOCK) ; write(6,*) 'Starting thread ',ithread ; call unlock@(IO_LOCK)
call lock@(ithread)
r1_loc = r1 ; r2_loc = r2 ; x1_loc = x1 ; x2_loc = x2
v1_loc = v1 ; ns_loc = ns ; ds_loc = ds
call unlock@(ithread)
s = 0.5d0
tmax_loc = 0.d0
k = 90.d0/(4.d0*atan(1.d0)*ns_loc)
do while (s .le. 1.d0)
i = v1_loc/cmplx((r1_loc+r2_loc/s),(x1_loc+x2_loc),kind=dp)
t = (k)*(abs(i)*abs(i)*(r2_loc/s))
if (t .gt. tmax_loc) tmax_loc = t
s = s + ds_loc
end do
call lock@(ithread) ; tmax2 = tmax_loc ; call unlock@(ithread)
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search2
end data_mod
program main
use data_mod
implicit none
include<windows.ins>
real(kind=dp)tmax
real start_time, end_time, tserial, tparallel
integer(kind=7) h1, h2
r1 = 0.295d0 ; r2 = 0.150d0 ; x1 = 0.510d0 ; x2 = 0.210d0 ; v1 = 200.d0/sqrt(3.d0) ; ns = 1200.d0
ds = 1.d-8
write(6,*) 'Serial'
call clock@ (start_time)
call search1(1) ; call search2(2)
tmax=max(tmax1,tmax2)
write(6,*) 'Tmax ',tmax
call clock@ (end_time)
tserial = end_time - start_time
write(6,*) 'Elapsed time serial', tserial
write(6,*)
write(6,*) 'Parallel'
call clock@ (start_time)
h1 = start_thread@(search1,1)
h2 = start_thread@(search2,2)
call wait_for_thread@(0)
tmax=max(tmax1,tmax2)
write(6,*) 'Tmax ',tmax
call clock@ (end_time)
tparallel = end_time - start_time
write(6,*) 'Elapsed time parallel', tparallel
write(6,*)
write(6,*) 'x increase in speed ', tserial/tparallel
end |
All suggestions/comments welcome - this is a new area for me.
Thanks
Ken |
|
Back to top |
|
|
Kenneth_Smith
Joined: 18 May 2012 Posts: 713 Location: Hamilton, Lanarkshire, Scotland.
|
Posted: Mon Dec 04, 2017 11:39 pm Post subject: |
|
|
Interestingly compiling and running the code on a different machine produces the following timing results:-
Code: | Option Serial Parallel Speedup
(s) (s)
Release win32, 7.26, 3.15, x 2.3
Release x64, 13.35, 8.86, x 1.5 |
x64 is significantly slower than win32, but in this case using threads does reduce the time required. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8018 Location: Salford, UK
|
Posted: Tue Dec 05, 2017 2:46 pm Post subject: |
|
|
The results on my machine are similar to your first set.
I am using an Intel i7-4500U @ 1.80GHz.
The code for 64 bit LOCK@ etc. is identical to that for 32 bits.
The overall slowness of 64 bits might relate to the processing of complex expressions in this program but one would expect/hope for a similar comparative improvement with 2 threads and this appears not to be the case. |
|
Back to top |
|
|
Kenneth_Smith
Joined: 18 May 2012 Posts: 713 Location: Hamilton, Lanarkshire, Scotland.
|
Posted: Tue Dec 05, 2017 3:44 pm Post subject: |
|
|
Thanks Paul, It is indeed an unexpected result which you have confirmed.
The first set to timings were on an machine with Intel I5-2410 @ 2.3 GHz, the second Intel I5-3340 @ 2.7 GHz.
A third set of timings from a machine with Intel I7-5506 @ 2.4 GHz are:-
Code: | Option Serial Parallel Speedup
(s) (s)
Release win32, 1,69, 1,02, x 1.65
Release x64, 4.75, 4.98, x 0.95 |
So 3 out of 4 tests on different machines show no gain adopting threads with x64 for this particular code. I will investigate further.
Ken |
|
Back to top |
|
|
Kenneth_Smith
Joined: 18 May 2012 Posts: 713 Location: Hamilton, Lanarkshire, Scotland.
|
Posted: Tue Dec 05, 2017 9:40 pm Post subject: |
|
|
Below is a simplified version of my earlier code, in which there is no manipulation of complex variables. Running this code suggests that the overall slowness of 64 bits is not related to the processing of complex expressions, as the issues identified in my earlier post are still apparent.
Code: |
implicit none
integer, parameter :: dp=kind(1.d0)
real(kind=dp) ds
integer,parameter::IO_LOCK = 42
contains
subroutine search1(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) ds_loc
real(kind=dp) s
call lock@(IO_LOCK) ; write(6,*) 'Starting thread ',ithread ; call unlock@(IO_LOCK)
call lock@(ithread)
ds_loc = ds
call unlock@(ithread)
s = epsilon(1.d0)
do while (s .le. 0.5d0)
s = s + ds_loc
end do
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search1
subroutine search2(ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) ds_loc
real(kind=dp) s
call lock@(IO_LOCK) ; write(6,*) 'Starting thread ',ithread ; call unlock@(IO_LOCK)
call lock@(ithread)
ds_loc = ds
call unlock@(ithread)
s = 0.5d0
do while (s .le. 1.d0)
s = s + ds_loc
end do
call lock@(IO_LOCK) ; write(6,*) 'Completed thread ',ithread ; call unlock@(IO_LOCK)
end subroutine search2
subroutine main
include<windows.ins>
real start_time, end_time, tserial, tparallel
integer(kind=7) h1, h2
ds = 1.d-10
write(6,*) 'Serial'
call clock@ (start_time)
call search1(1)
call search2(2)
call clock@ (end_time)
tserial = end_time - start_time
write(6,*) 'Elapsed time serial', tserial
write(6,*)
write(6,*) 'Parallel'
call clock@ (start_time)
h1 = start_thread@(search1,1)
h2 = start_thread@(search2,2)
call wait_for_thread@(0)
call clock@ (end_time)
tparallel = end_time - start_time
write(6,*) 'Elapsed time parallel', tparallel
write(6,*)
write(6,*) 'x increase in speed ', tserial/tparallel
end subroutine main
end data_mod
program main1
use data_mod
call main
end
|
|
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8018 Location: Salford, UK
|
Posted: Wed Dec 06, 2017 3:36 pm Post subject: |
|
|
I am using a 2 core CPU and both cores are apparently being used for 64 bits as for 32 bits.
At the moment I don't know why the 64 bit case shows no improvement when multi-threading. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2866 Location: South Pole, Antarctica
|
Posted: Wed Dec 06, 2017 7:53 pm Post subject: |
|
|
Code: |
Option Serial Parallel Speedup
(s) (s)
Release win32, 2.03, 0.786, x 2.58
Release x64, 3.71, 3.71, x 1.00
|
|
|
Back to top |
|
|
Kenneth_Smith
Joined: 18 May 2012 Posts: 713 Location: Hamilton, Lanarkshire, Scotland.
|
Posted: Sun Dec 10, 2017 12:33 pm Post subject: |
|
|
John,
I'm using 8.2, so clearly there is a difference with the earlier release. It may be the call lock@(ithread) and call unlock@(ithread)
Try changing all of these to call lock@(data_lock) and call unlock@(data_lock)
and add integer,parameter::IO_LOCK = 1, data_lock = 2 to the global definitions in the module.
As I said, this is all new to me and I'm slowly trying work out how to do this. With 4 threads I can get a speed up of 3.4 times (two dual core processors), with Win32 and /optimise.
Ken |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2585 Location: Sydney
|
Posted: Mon Dec 11, 2017 1:06 am Post subject: |
|
|
Ken,
I tried your examples using FTN95 Ver 8.20 and fond /64 to be faster for second test but slower for first (COMPLEX) test.
I made a few changes to provide some more statistics.
1) the timers record both elapsed and CPU time, to give an indication of the threading efficiencies.
2) you only need local (non saved) copies of variables that change in the thread calculation.
ds, r1,x1,ns etc don't need local copies as they are never changed,
while s_loc, tmax_loc, t_loc, k_loc and i_loc need to have local copies.
3) can demonstrate multi-thread without routine search2, as search1 is thread safe. (reduces code size, which is critical for this forum)
My batch file to test different compile options is
Code: | del %1.exe
ftn95 %1 /link /echo_options >> %1.log
%1 >> %1.log
del %1.exe
ftn95 %1 /link /64 /echo_options >> %1.log
%1 >> %1.log
del %1.exe
ftn95 %1 /link /opt /echo_options >> %1.log
%1 >> %1.log
del %1.exe
ftn95 %1 /link /64 /opt /echo_options >> %1.log
%1 >> %1.log
notepad %1.log |
/echo_options is available in 8.20 and works well in this case.
Code: | module data_mod
implicit none
integer, parameter :: dp=kind(1.d0)
real(kind=dp) ds
real(kind=dp) r1,x1,x2,r2,v1,ns,tmax(2)
integer,parameter::IO_LOCK = 42
real*4 times(2)
contains
subroutine time_start
call get_times (times)
end subroutine time_start
subroutine time_end
real*4 now(2)
call get_times (now)
times = now - times
end subroutine time_end
subroutine get_times (now)
real*4 now(2)
call elapse_time (now(1))
call cpu_time (now(2))
end subroutine get_times
subroutine elapse_time (seconds)
real*4 seconds
integer*8 tick, rate
call system_clock (tick, rate)
seconds = dble(tick) / dble(rate)
end subroutine elapse_time
subroutine search1 (ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) s_loc
integer n_do
call lock@(IO_LOCK)
write(6,*) 'Starting A thread ',ithread
call unlock@(IO_LOCK)
s_loc = epsilon(1.d0)
n_do = 0
do while (s_loc <= 0.5d0)
s_loc = s_loc + ds
n_do = n_do+1
end do
call lock@(IO_LOCK)
write(6,*) 'Completed A thread ',ithread, n_do, s_loc
call unlock@(IO_LOCK)
end subroutine search1
subroutine searchX (ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) s_loc, tmax_loc, t_loc, k_loc
complex(kind=dp) i_loc
integer n_do
call lock@(IO_LOCK)
write(6,*) 'Starting O1 thread ',ithread
call unlock@(IO_LOCK)
tmax_loc = 0.d0
k_loc = 90.d0/(4.d0*atan(1.d0)*ns)
s_loc = epsilon(1.0)
n_do = 0
do while (s_loc <= 0.5d0)
i_loc = v1/cmplx((r1+r2/s_loc),(x1+x2),kind=dp)
t_loc = (k_loc)*(abs(i_loc)*abs(i_loc)*(r2/s_loc))
if (t_loc > tmax_loc) tmax_loc = t_loc
s_loc = s_loc + ds
n_do = n_do+1
end do
call lock@(ithread)
tmax(ithread) = tmax_loc
call unlock@(ithread)
call lock@(IO_LOCK)
write(6,*) 'Completed O1 thread',ithread, n_do, tmax_loc
call unlock@(IO_LOCK)
end subroutine searchX |
Last edited by JohnCampbell on Mon Dec 11, 2017 1:56 am; edited 4 times in total |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2585 Location: Sydney
|
Posted: Mon Dec 11, 2017 1:11 am Post subject: |
|
|
Code: | subroutine do_tests
include<windows.ins>
real*4 tserial(2), tparallel(2)
integer(kind=7) h1, h2
integer kk
r1 = 0.295d0 ; r2 = 0.150d0 ; x1 = 0.510d0 ; x2 = 0.210d0 ; v1 = 200.d0/sqrt(3.d0) ; ns = 1200.d0
do kk = 1,2
write(6,*) 'Serial ', kk
call time_start
if ( kk==2 ) then
ds = 1.d-9
call search1(1)
call search1(2)
else
ds = 1.d-8
call searchX (1)
call searchx (2)
end if
call time_end
tserial = times
write(6,*) 'Elapsed time serial', tserial
write(6,*)
write(6,*) 'Parallel ', kk
call time_start
if ( kk==2 ) then
h1 = start_thread@ (search1,1)
h2 = start_thread@ (search1,2)
else
h1 = start_thread@ (searchX,1)
h2 = start_thread@ (searchx,2)
end if
call wait_for_thread@ (0)
call time_end
tparallel = times
write(6,*) 'Elapsed time parallel', tparallel
write(6,*)
write(6,*) 'x increase in speed ', tserial/tparallel
end do
end subroutine do_tests
end module data_mod
program main1
use data_mod
call do_tests
end |
|
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2585 Location: Sydney
|
Posted: Mon Dec 11, 2017 1:15 am Post subject: |
|
|
results include Code: | [FTN95/Win32 Ver. 8.20.0 Copyright (c) Silverfrost Ltd 1993-2017]
[Current options] ERROR_NUMBERS;IMPLICIT_NONE;INTL;LINK;LOGL; 32-bit
Elapse CPU
Serial 1 Elapsed time serial 4.04749 4.04043
Parallel 1 Elapsed time parallel 1.85291 3.66602
x increase in speed 2.18440 1.10213
Serial 2 Elapsed time serial 3.65540 3.63482
Parallel 2 Elapsed time parallel 1.85205 3.61922
x increase in speed 1.97370 1.00431
[Current options] 64;ERROR_NUMBERS;IMPLICIT_NONE;INTL;LINK;LOGL; 64-bit
Serial 1 Elapsed time serial 6.61194 6.55204
Parallel 1 Elapsed time parallel 3.34155 6.56764
x increase in speed 1.97870 0.997625
Serial 2 Elapsed time serial 3.36658 3.33842
Parallel 2 Elapsed time parallel 1.71411 3.32282
x increase in speed 1.96404 1.00469
|
|
|
Back to top |
|
|
jalih
Joined: 30 Jul 2012 Posts: 196
|
Posted: Mon Dec 11, 2017 8:01 am Post subject: |
|
|
Anyone tested results with lock free code? Locking is expensive. IO should be taken out from threads and memory structures should be organized to allow lock free code. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8018 Location: Salford, UK
|
Posted: Mon Dec 11, 2017 11:45 am Post subject: |
|
|
John has correctly identified the source of the problem...
CLOCK@ gives the elapsed time for 32 bits but the CPU time for 64 bits.
The CPU time combines the times for both threads.
The treading is working correctly and the reduction in elapsed time is comparable between 32 bits and 64 bits.
One way to correctly calculate the elapsed time is to call SYSTEM_CLOCK.
We should probably treat the issue with CLOCK@ as a bug in the 64 bit library although this will only be apparent when using threading. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2585 Location: Sydney
|
Posted: Mon Dec 11, 2017 12:58 pm Post subject: |
|
|
The following is an example where the number of threads can be varied more easily.
The same routine is called for multiple threads, with the "arg" indicating which thread is being called.
It appear to work, giving the same answer.
Code: | module data_mod
implicit none
integer, parameter :: dp=kind(1.d0)
real(kind=dp) ds
real(kind=dp) r1,x1,x2,r2,v1,ns,tmax(4),smax(4)
integer,parameter::IO_LOCK = 42
integer max_threads
real*4 times(2)
contains
subroutine time_start
call get_times (times)
end subroutine time_start
subroutine time_end
real*4 now(2)
call get_times (now)
times = now - times
end subroutine time_end
subroutine get_times (now)
real*4 now(2)
call elapse_time (now(1))
call cpu_time (now(2))
end subroutine get_times
subroutine elapse_time (seconds)
real*4 seconds
integer*8 tick, rate
call system_clock (tick, rate)
seconds = dble(tick) / dble(rate)
end subroutine elapse_time
subroutine search1 (ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) s_loc, slim
integer n_do
call lock@(IO_LOCK)
write(6,*) 'Starting A thread ',ithread
call unlock@(IO_LOCK)
slim = dble(ithread)/dble(max_threads)
s_loc = epsilon(1.0d0) + slim - 1.0d0/dble(max_threads)
n_do = 0
do while (s_loc <= slim)
s_loc = s_loc + ds
n_do = n_do+1
end do
call lock@(IO_LOCK)
write(6,*) 'Completed A thread ',ithread, n_do, s_loc
call unlock@(IO_LOCK)
end subroutine search1
subroutine searchX (ithread)
include<windows.ins>
integer, intent(in) :: ithread
real(kind=dp) s_loc, slim, tmax_loc, t_loc, k_loc, smax_loc
complex(kind=dp) i_loc
integer n_do
call lock@ (IO_LOCK)
write(6,*) 'Starting C1 thread',ithread
call unlock@ (IO_LOCK)
tmax_loc = 0.d0
k_loc = 90.d0/(4.d0*atan(1.d0)*ns)
slim = dble(ithread)/dble(max_threads)
s_loc = epsilon(1.0d0) + slim - 1.0d0/dble(max_threads)
n_do = 0
do while (s_loc <= slim)
i_loc = v1/cmplx((r1+r2/s_loc),(x1+x2),kind=dp)
!!! t_loc = (k_loc)*(abs(i_loc)*abs(i_loc)*(r2/s_loc))
t_loc = (k_loc)*((abs(i_loc)**2)*(r2/s_loc))
if (t_loc > tmax_loc) then
tmax_loc = t_loc
smax_loc = s_loc
end if
s_loc = s_loc + ds
n_do = n_do+1
end do
!z call lock@ (ithread) ! ithread should be hh(ithread) ??
tmax(ithread) = tmax_loc
smax(ithread) = smax_loc
!z call unlock@ (ithread)
call lock@ (IO_LOCK)
write(6,*) 'Completed C1 thread',ithread, n_do, tmax_loc
call unlock@ (IO_LOCK)
end subroutine searchX
|
|
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|