 |
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Thu Apr 25, 2013 5:59 pm Post subject: Two multithreading programs |
|
|
To run the programs in parallel this compiler has one neat trick related to NET which i'd like to use. Here are two small programs which illustrate the approach. One is straightforward, it uses separate subroutines for each thread and works fine, it gives you speedup almost proportional to amount of processors. And another, of more general form, simpler and smaller and where you can divide the workload on arbitrary amount of threads, and hence much more practical, by unknown reason is unstable.
What is in the program - just the DO loops doing some fake simulation 200M times. First do loop works on single processor and gives you estimation of usual single-threaded CPU time. Then in the first program, which works fine, i explicitly start 4 threads (my PC has 4-core processor, and though it can run 8 independent threads the speedup is not 8 times of course because of only 4 floating point units per CPU) each thread takes 1/4 of the DO loop cycles. In the second program, which is unstable, I use another DO loop which creates and starts 4 threads using one single subroutine where the workload is automatically divided on needed amount of threads (i took 4 by mentioned above reason).
While first program sustains any torture, but is not how parallel programs have to be written with 10, 100 or more processors, with the second I can not even get notification that first thread is started...
Please run both programs and tell me what you get. Any hints of making the second example stable are appreciated.
/* Code is almost 100% standard Fortran, except just one new command LOCK to avoid conflicts of threads which does not need explanation and calls to the couple FTN95 library functions. Also, after playing a bit with the code you can decrease first cycle 10 times so it will not annoy you.
By the way the example for just two threads is here http://forums.silverfrost.com/viewtopic.php?t=2176&postdays=0&postorder=asc&highlight=f2003&start=30
Code: |
! Multithreading example parallel4a.f95
! Dan R Right 2012
!
! Watch in Task Manager how four threads run1,... run4
! grab four processor cores working in parallel
!
! Compilation: ftn95 parallel4a.f95 /clr /link /multi_threaded
!
include <clearwin.ins>
EXTERNAL run1, run2, run3, run4
common /abc_/kEnded1, kended2, kended3, kended4
common /abc2_/itotal
common /threads_/nThreads
character*1 cha
!..... straight non-threaded run
print*,' Run w/o threads started'
nThreads=1
call clock@ (time_start)
itotal = 0
d=2.0
do i=1,200000000/nThreads
d=alog(exp(d))
itotal = itotal +1
enddo
call clock@ (time_finish)
print*, 'Elapsed time without threads=', time_finish-time_start, itotal
!...multithreaded
nThreads = 4
call clock@ (time_start)
itotal = 0
CALL CREATE_THREAD@(run1,21)
CALL CREATE_THREAD@(run2,22)
CALL CREATE_THREAD@(run3,23)
CALL CREATE_THREAD@(run4,24)
!...wait till all threads end
do while (kEnded1==0.or.kEnded2==0.or.kEnded3==0.or.kEnded4==0)
call sleep1@(0.1)
enddo
call clock@ (time_finish)
print*, 'Elapsed time with threads=', time_finish-time_start, itotal
print*, 'Enter any key+Enter to exit'
read(*,*) cha
END
!=============================================================
subroutine run1()
include <clearwin.ins>
common /abc_/kEnded1, kended2, kended3, kended4
common /abc2_/itotal
common /threads_/nThreads
lock; print*,'Thr.1 started'; end lock
d =2.0
itot =0
kEnded1=0
do i=1,200000000/nThreads
d=alog(exp(d))
itot = itot+1
enddo
lock; itotal = itotal +itot ;end lock
kEnded1=1
lock; print*,'Thr.1 ended' ; end lock
end
|
Last edited by DanRRight on Thu Apr 25, 2013 8:42 pm; edited 17 times in total |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Thu Apr 25, 2013 6:01 pm Post subject: |
|
|
...continue...END OF THE FIRST PROGRAM
Code: |
!-----------------------------------------
subroutine run2()
include <clearwin.ins>
common /abc_/kEnded1, kended2, kended3, kended4
common /abc2_/itotal
common /threads_/nThreads
lock; print*,'Thr.2 started' ;end lock
d =2.0
itot =0
kEnded2=0
do i=1,200000000/nThreads
d=alog(exp(d))
itot = itot+1
enddo
lock; itotal = itotal +itot ;end lock
kEnded2=1
lock; print*,'Thr.2 ended' ;end lock
end
!---------------------------------------------
subroutine run3()
include <clearwin.ins>
common /abc_/kEnded1, kended2, kended3, kended4
common /abc2_/itotal
common /threads_/nThreads
lock; print*,'Thr.3 started' ;end lock
d =2.0
itot =0
kEnded3=0
do i=1,200000000/nThreads
d=alog(exp(d))
itot = itot+1
enddo
lock; itotal = itotal +itot ;end lock
kEnded3=1
lock; print*,'Thr.3 ended' ;end lock
end
!---------------------------------------------
subroutine run4()
include <clearwin.ins>
common /abc_/kEnded1, kended2, kended3, kended4
common /abc2_/itotal
common /threads_/nThreads
lock; print*,'Thr.4 started' ;end lock
d =2.0
itot =0
kEnded4=0
do i=1,200000000/nThreads
d=alog(exp(d))
itot = itot+1
enddo
lock; itotal = itotal +itot ;end lock
kEnded4=1
lock; print*,'Thr.4 ended' ;end lock
end
|
Last edited by DanRRight on Thu Apr 25, 2013 6:31 pm; edited 1 time in total |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Thu Apr 25, 2013 6:03 pm Post subject: |
|
|
HERE IS SECOND PROGRAM
Code: |
! Multithreading example parallelN.f95
! Dan R Right 2013
!
! Watch in Task Manager how threads
! grab processor cores working in parallel
!
! Compilation: ftn95 parallelN.f95 /clr /link /multi_threaded
include <clearwin.ins>
EXTERNAL runN
parameter (nThrMax=8)
common /threads_/nEmployedThreads, iCurrThread, kEnded(nThrMax)
common /abc1_/itotal1, itotal2, itot(nThrMax)
character*1 cha
!................ straight non-threaded run
kEnded(:) = 1
itot(:) = 0
print*,' Run w/o threads started'
nEmployedThreads=1
iCurrThread = 1
call clock@ (time_start)
itotal1 = 0
itot(iCurrThread)=0
d=2.22
do i=1,200000000/nEmployedThreads
d=alog(exp(d))
itotal1 = itotal1 + 1
itot(iCurrThread) = itot(iCurrThread) + 1
enddo
call clock@ (time_finish)
print*, 'Elapsed time without threads=', time_finish-time_start
print*, 'Checksum: totals using two methods = ', itotal1, itot(iCurrThread)
!...................multithreaded .............
nEmployedThreads = 4
call clock@ (time_start)
itotal1 = 0
itotal2 = 0
itot(:) = 0
do iCurrThread = 1, nEmployedThreads
CALL CREATE_THREAD@(runN,20+iCurrThread)
enddo
!...wait till all threads end
do while (minval(kEnded)==0)
call sleep1@(0.1)
enddo
call clock@ (time_finish)
print*, 'Elapsed time with threads=', time_finish-time_start
print*, 'Checksum: totals using three methods = ', itotal1, itotal2, sum(itot)
print*, 'Enter any key+Enter to exit'
read(*,*) cha
END
! =====================================================================
subroutine runN()
include <clearwin.ins>
parameter (nThrMax=8)
common /threads_/nEmployedThreads, iCurrThread, kEnded(nThrMax)
common /abc1_/itotal1, itotal2, itot(nThrMax)
ithr = iCurrThread
lock; print*,'Thr. started = ', ithr ; end lock
d =2.22
itot (ithr)=0
itot2 =0
kEnded(ithr)=0
do i=1,200000000/nEmployedThreads
d=alog(exp(d))
itot(ithr) = itot(ithr)+1
itot2 = itot2 + 1
enddo
lock; itotal1 = itotal1 + itot(ithr) ;end lock
lock; itotal2 = itotal2 + itot2 ;end lock
lock; print*, 'Thr. ended = ', ithr ;end lock
kEnded(ithr)=1
end
|
|
|
Back to top |
|
 |
davidb
Joined: 17 Jul 2009 Posts: 560 Location: UK
|
Posted: Thu Apr 25, 2013 7:32 pm Post subject: |
|
|
In your second program, you should pass the thread number as an argument to the subroutine runN? This is probably what is wrong.
If I can, I will post a working example of multi-threading with FTN95 in a few days; just a little busy at the moment. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Thu Apr 25, 2013 8:13 pm Post subject: |
|
|
Thanks for the critical point. Yes, adding integer argument to RunN (m) can reveal the thread handle (the one which is = 20+iCurrThread), which could be used for thread counting.
When I added argument m to RunN and changed one line
ithr = iCurrThread
to
ithr = m-20
all started working OK. So any guess why iCurrThread passed via common was wrong? Is there a chance that anything passed via common could be wrong ? When we clear that and formulate some simple rules what's to do and what's not i think anyone will be able safely write parallel programs using FTN95 |
|
Back to top |
|
 |
davidb
Joined: 17 Jul 2009 Posts: 560 Location: UK
|
Posted: Thu Apr 25, 2013 10:35 pm Post subject: |
|
|
Data in common will probably be Shared amongst the threads, so it isn't possible for each thread to have different sets of common variables.
On the other hand, arguments passed as arguments, and local variables in the subroutine, will be stored on the stack. There should be a different stack for each thread.
However, this isn't the point. The subroutine runN requires an integer argument is passed to it. If you leave this out you will be potentially corrupting the call stack frame at each call and the behaviour you wil get will be unpredictable.
You may find it helpful to read up on the difference between shared and private variables in OpenMP as the distinction used there is what you need to be aiming for with your own codes using FTN95. For example, you will learn it is OK for each thread to read data in shared variables, but to write requires the use of locks. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Fri Apr 26, 2013 1:37 am Post subject: Re: |
|
|
davidb wrote: | Data in common will probably be Shared amongst the threads, so it isn't possible for each thread to have different sets of common variables. |
That transfer of data via common block used that exactly property. Again, as it is seen from the code above, the DO loop sets the variable iCurrThread which is in the common block, then launches the thread with RunN which also has this common block inside, the thread picks up this value, assignes it to local variable ithr to keep it unchanged for this specific thread for further use and what we have got was unbelievable: the variable was immediately different from one it was sent!
So the reason most probably as you have mentioned was corruption due to absent argument in RunM though the first code was working OK without this argument no matter how many threads used. Or the reason was different because the iCurrThread variable is still passed via common with wrong value? But (I am still curious) may be someone got the problem even with the first program example above?
Update: Looks like threads are launched by the DO loop and the code inside the threads start running absolutely independently, no one waits for another and we can not tell in which order. So i suspect this is the reason of wrong iCurrThread: the DO loop succeeds to launch two threads before the first thread picks up the variable. |
|
Back to top |
|
 |
davidb
Joined: 17 Jul 2009 Posts: 560 Location: UK
|
Posted: Fri Apr 26, 2013 7:16 pm Post subject: |
|
|
You cannot pass the thread number in COMMON. When you run subroutines using different threads, they run "concurrently" in PARALLEL, or at least overlap with each other in time. This is the whole point of using multiple threads on different processors or cores - to get some parts of your code running in parallel.
In your second code, each iteration of the loop sets up one thread and sets up iCurrThread (the thread number) in COMMON. The first iteration of your loop calls runN which enters an expensive DO Loop (i=1, 200000000). At the start of this loop, the value of iCurrThread in memory is correct. However, before this loop can complete, a second call to runN is made with a different thread. This second call changes the value of iCurrThread whilst the first thread is still running runN which leads to incorrect results. The problem is made worse when the third and fourth threads start up.
The only way to fix this is to pass the thread number as an argument to the subroutine you want to call. Each thread will then pass a different value on the stack and each call of runN can safely run in parallel.
Your code is a little bit unusual in the way you have partitioned the work between threads. I have made another Post showing how to do this in a more general way, which people may find useful.
I have run your modified code 2 and it does give me an improvement, but I only have 2 cores here. I would be interested to see what performance you are getting with 1 thread, 4 threads and 8 threads, where the last is using hyperthreading. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl |
|
Back to top |
|
 |
jalih
Joined: 30 Jul 2012 Posts: 196
|
Posted: Sat Apr 27, 2013 4:26 pm Post subject: Re: |
|
|
davidb wrote: |
Your code is a little bit unusual in the way you have partitioned the work between threads.
|
It's a little bit hard to follow also... Some multithreaded matrix multiplication test would be interesting.
I wrote a simple multithreaded 1000x1000 sized square matrix multiplication test using PL/I-language, available here with source and binary. |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Sat Apr 27, 2013 5:54 pm Post subject: |
|
|
Well, here is the same code #2 after simplification, beautification and syntactic sugar together with the runtime results. Total CPU time is elapsed time * amount of threads. Computer - i7 laptop, 2.2GHz, 4 cores, 8 threads set in BIOS, Windows7.
It can not be simpler and more self-explanatory than that. Code creates independent threads running subroutine runN (iThrHandle) using CALL CREATE_THREAD@(runN,20+i) where iThrHandle = 20+i is thread handle and follows the moment they finish when all flags kThreadEnded are =1. Exactly the same has been generally done by any other parallelization methods like OpenMP, CUDA etc when the threads could be independent. Add into subroutine RunN any specific task you want, Monte-Carlo, matrix operations, parallelizing DO loops etc partitioning them the way you want. Set when start the thread kThreadEnded(ithr) = 0 and make it =1 when finished. That's all. I set parameter nThrMax=8, use your own CPU cores/threads count. You may also consider adding or removing tiny delay when launching the threads (see the line call sleep1@(0.02) ) if "threads jam" (?) conflict takes place when amount of threads is very large (why? this is the question to FTN95 developers. This is second time i find necessity for tiny delay with NET, first time it was for Clearwin+ GUI Run-Pause-Stop code, it was posted here 3-4 years ago).
And the results as a function of # of threads are great, the speedup on 4 cores is like on almost 8-core CPU. Isn't this amazing?
1 Elapsed time, total CPU time= 15.6210 15.6210
2 Elapsed time, total CPU time= 7.94200 15.8840
4 Elapsed time, total CPU time= 4.07800 16.3120
5 Elapsed time, total CPU time= 3.39700 16.9850
6 Elapsed time, total CPU time= 2.98200 17.8920
8 Elapsed time, total CPU time= 2.49800 19.9840
Code: |
! Compilation: ftn95 parallelN4.f95 /clr /link /multi_threaded
! Code is yours. Use it at your own risk
!
include <clearwin.ins>
EXTERNAL runN
parameter (nThrMax=8)
common /threads_/nEmployedThreads, kThreadEnded(nThrMax)
character*80 getenv@
write(*,*) 'Processor ', getenv@('PROCESSOR_IDENTIFIER')
READ(getenv@('NUMBER_OF_PROCESSORS'),*) n_processorsTotal
write(*,*) ' Max number of threads=', n_processorsTotal
1 print*,' Enter number of parallel threads <= 8'
read(*,*) nEmployedThreads
if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4
call clock@ (time_start)
!...set a flag of thread finished
kThreadEnded(:)=1
do i = 1, nEmployedThreads
CALL CREATE_THREAD@(runN,20+i)
call sleep1@(0.02)
enddo
!...wait till all threads finish
do while (minval(kThreadEnded)==0)
call sleep1@(0.1)
enddo
call clock@ (time_finish)
time = time_finish-time_start
time2= time * nEmployedThreads
print*, 'Elapsed time, total CPU time=', time, time2
goto 1
END
! =======================================================
subroutine runN (iThrHandle)
include <clearwin.ins>
parameter (nThrMax=8)
common /threads_/nEmployedThreads, kThreadEnded(nThrMax)
ithr = iThrHandle-20
kThreadEnded(ithr) = 0
lock; print*,'Started thread # ', ithr ; end lock
d =2.22
do i=1,200000000/nEmployedThreads
d=alog(exp(d))
enddo
lock; print*, 'Ended thread # ', ithr ; end lock
kThreadEnded(ithr)=1
end |
|
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Sun Jun 23, 2013 9:42 pm Post subject: |
|
|
Wondering why this functionality is available only in NET...Would be great if Silverfrost made it for regular x86 mode. |
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
Posted: Mon Jun 24, 2013 8:30 am Post subject: |
|
|
I have this as a high priority on my list of things to do. |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Tue Jun 25, 2013 8:22 am Post subject: |
|
|
THAT deserves BIG THANKS! |
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
Posted: Wed Jul 03, 2013 3:52 pm Post subject: |
|
|
I have uploaded a new beta version of salflibc.dll to http://www.silverfrost.com/beta/salflibc.exe
This contains new Win32 thread functions illustrated in the following sample.
The new routines are described in the file cwplus.enh that is included in the download.
Code: | module threadMod
c_external start_thread@ '__start_thread' (REF,REF):integer*4
c_external wait_for_thread@ '__wait_for_thread' (VAL)
c_external lock@ '__lock' (VAL)
c_external unlock@ '__unlock' (VAL)
integer,parameter::IO_LOCK = 42 !Any value. Your choice
contains
subroutine threadFunc(count)
integer count,start
if(count < 0) return !Illustrates an abort
start = count
do while(count > 0)
call sleep1@(1.0)
call lock@(IO_LOCK)
print*, "threadFunc ", start, count
call unlock@(IO_LOCK)
count = count - 1
end do
end subroutine threadFunc
end threadMod
program Threads
use threadMod
integer hThread(3)
hThread(1) = start_thread@(threadFunc, 6) !Run for 6 seconds
hThread(2) = start_thread@(threadFunc, 3) !Run for 3 seconds
hThread(3) = start_thread@(threadFunc, 7) !Run for 7 seconds
call wait_for_thread@(hThread(2))
call wait_for_thread@(0)
end
|
|
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
Posted: Thu Jul 04, 2013 6:28 am Post subject: |
|
|
I have edited the above sample program and removed a call to ExitThread that was not needed because the wrapper function in the library makes this call. |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|