Silverfrost Forums

Welcome to our forums

Two multithreading programs

25 Apr 2013 4:59 (Edited: 25 Apr 2013 7:42) #12101

To run the programs in parallel this compiler has one neat trick related to NET which i'd like to use. Here are two small programs which illustrate the approach. One is straightforward, it uses separate subroutines for each thread and works fine, it gives you speedup almost proportional to amount of processors. And another, of more general form, simpler and smaller and where you can divide the workload on arbitrary amount of threads, and hence much more practical, by unknown reason is unstable.

What is in the program - just the DO loops doing some fake simulation 200M times. First do loop works on single processor and gives you estimation of usual single-threaded CPU time. Then in the first program, which works fine, i explicitly start 4 threads (my PC has 4-core processor, and though it can run 8 independent threads the speedup is not 8 times of course because of only 4 floating point units per CPU) each thread takes 1/4 of the DO loop cycles. In the second program, which is unstable, I use another DO loop which creates and starts 4 threads using one single subroutine where the workload is automatically divided on needed amount of threads (i took 4 by mentioned above reason). While first program sustains any torture, but is not how parallel programs have to be written with 10, 100 or more processors, with the second I can not even get notification that first thread is started...

Please run both programs and tell me what you get. Any hints of making the second example stable are appreciated.

/* Code is almost 100% standard Fortran, except just one new command LOCK to avoid conflicts of threads which does not need explanation and calls to the couple FTN95 library functions. Also, after playing a bit with the code you can decrease first cycle 10 times so it will not annoy you. By the way the example for just two threads is here https://forums.silverfrost.com/Forum/Topic/1894&postdays=0&postorder=asc&highlight=f2003&start=30

! Multithreading example parallel4a.f95 
! Dan R Right 2012 
! 
! Watch in Task Manager how four threads run1,... run4 
! grab four  processor cores working in parallel 
! 
! Compilation: ftn95 parallel4a.f95 /clr /link /multi_threaded 
! 
  include <clearwin.ins> 
  EXTERNAL run1, run2, run3, run4 
  common /abc_/kEnded1, kended2, kended3, kended4  
  common /abc2_/itotal
  common /threads_/nThreads
  character*1 cha 

!..... straight non-threaded run 

  print*,' Run w/o threads started' 
  nThreads=1
  call clock@ (time_start) 
     itotal = 0
     d=2.0 

      do i=1,200000000/nThreads
       d=alog(exp(d)) 
       itotal = itotal +1	
      enddo 

  call clock@ (time_finish) 
  print*, 'Elapsed time without threads=', time_finish-time_start, itotal

!...multithreaded 
  nThreads = 4
  call clock@ (time_start) 
 
    itotal = 0
    CALL CREATE_THREAD@(run1,21) 
    CALL CREATE_THREAD@(run2,22) 
    CALL CREATE_THREAD@(run3,23) 
    CALL CREATE_THREAD@(run4,24) 

 !...wait till all threads end 
  do while (kEnded1==0.or.kEnded2==0.or.kEnded3==0.or.kEnded4==0) 
    call sleep1@(0.1) 
  enddo 



  call clock@ (time_finish) 

  print*, 'Elapsed time with threads=', time_finish-time_start, itotal 
  print*, 'Enter any key+Enter to exit' 
  read(*,*) cha 
  END 

!=============================================================

  subroutine run1() 
  include <clearwin.ins> 
  common /abc_/kEnded1, kended2, kended3, kended4  
  common /abc2_/itotal
  common /threads_/nThreads

     lock;   print*,'Thr.1 started'; end lock 
     d      =2.0 
     itot   =0
     kEnded1=0 

      do i=1,200000000/nThreads
       d=alog(exp(d)) 
       itot = itot+1
      enddo 

     lock; itotal = itotal +itot  ;end lock 
     kEnded1=1 
     lock;   print*,'Thr.1 ended' ; end lock 
  end 
25 Apr 2013 5:01 (Edited: 25 Apr 2013 5:31) #12102

...continue...END OF THE FIRST PROGRAM

!-----------------------------------------
  subroutine run2() 
  include <clearwin.ins> 
  common /abc_/kEnded1, kended2, kended3, kended4  
  common /abc2_/itotal
  common /threads_/nThreads

     lock;   print*,'Thr.2 started' ;end lock 
     d      =2.0 
     itot   =0
     kEnded2=0 

      do i=1,200000000/nThreads
       d=alog(exp(d))
       itot = itot+1
      enddo 

     lock; itotal = itotal +itot  ;end lock 
     kEnded2=1 
     lock;   print*,'Thr.2 ended' ;end lock 
  end

!--------------------------------------------- 
  subroutine run3() 
  include <clearwin.ins> 
  common /abc_/kEnded1, kended2, kended3, kended4  
  common /abc2_/itotal
  common /threads_/nThreads

     lock;   print*,'Thr.3 started' ;end lock 
     d      =2.0 
     itot   =0
     kEnded3=0 

      do i=1,200000000/nThreads 
       d=alog(exp(d))
       itot = itot+1
      enddo 

     lock; itotal = itotal +itot  ;end lock 
     kEnded3=1 
     lock;   print*,'Thr.3 ended' ;end lock 
  end

!--------------------------------------------- 
  subroutine run4() 
  include <clearwin.ins> 
  common /abc_/kEnded1, kended2, kended3, kended4  
  common /abc2_/itotal
  common /threads_/nThreads

     lock;   print*,'Thr.4 started' ;end lock 
     d      =2.0 
     itot   =0
     kEnded4=0 

      do i=1,200000000/nThreads
       d=alog(exp(d))
       itot = itot+1
      enddo 

     lock; itotal = itotal +itot  ;end lock 
     kEnded4=1 
     lock;   print*,'Thr.4 ended' ;end lock 
  end
25 Apr 2013 5:03 #12103

HERE IS SECOND PROGRAM

! Multithreading example parallelN.f95 
! Dan R Right 2013 
! 
! Watch in Task Manager how threads 
! grab processor cores working in parallel 
! 
! Compilation: ftn95 parallelN.f95 /clr /link /multi_threaded 
 
  include <clearwin.ins> 
  EXTERNAL runN
  parameter (nThrMax=8)
  common /threads_/nEmployedThreads, iCurrThread, kEnded(nThrMax)
  common /abc1_/itotal1, itotal2, itot(nThrMax)
  character*1 cha 

 !................ straight non-threaded run 

  kEnded(:) = 1
  itot(:)   = 0

  print*,' Run w/o threads started' 
  nEmployedThreads=1
  iCurrThread = 1

  call clock@ (time_start) 
     itotal1 = 0
     itot(iCurrThread)=0
     d=2.22 

      do i=1,200000000/nEmployedThreads
       d=alog(exp(d)) 
       itotal1 = itotal1 + 1	
       itot(iCurrThread) = itot(iCurrThread) + 1 
      enddo 

  call clock@ (time_finish) 
  print*, 'Elapsed time without threads=', time_finish-time_start
  print*, 'Checksum: totals using two methods = ', itotal1, itot(iCurrThread)

!...................multithreaded .............

  nEmployedThreads = 4
  call clock@ (time_start) 
 
    itotal1 = 0
    itotal2 = 0
    itot(:) = 0

   do iCurrThread = 1, nEmployedThreads
    CALL CREATE_THREAD@(runN,20+iCurrThread) 
   enddo

 !...wait till all threads end 
  do while (minval(kEnded)==0) 
    call sleep1@(0.1) 
  enddo 

  call clock@ (time_finish) 

  print*, 'Elapsed time with threads=', time_finish-time_start
  print*, 'Checksum: totals using three methods = ', itotal1, itotal2, sum(itot)
  print*, 'Enter any key+Enter to exit' 
  read(*,*) cha 
  END 

! =====================================================================
  subroutine runN() 
  include <clearwin.ins> 
  parameter (nThrMax=8)
  common /threads_/nEmployedThreads, iCurrThread, kEnded(nThrMax)
  common /abc1_/itotal1, itotal2, itot(nThrMax)

     ithr = iCurrThread
     lock;   print*,'Thr. started = ', ithr ; end lock 

     d      =2.22 
     itot  (ithr)=0
     itot2       =0
     kEnded(ithr)=0 

      do i=1,200000000/nEmployedThreads
       d=alog(exp(d)) 
       itot(ithr) = itot(ithr)+1
       itot2      = itot2 + 1 
      enddo 

     lock; itotal1 = itotal1 + itot(ithr)  ;end lock 
     lock; itotal2 = itotal2 + itot2       ;end lock 
     lock; print*, 'Thr.  ended = ', ithr  ;end lock 

     kEnded(ithr)=1 
  end 
25 Apr 2013 6:32 #12104

In your second program, you should pass the thread number as an argument to the subroutine runN? This is probably what is wrong.

If I can, I will post a working example of multi-threading with FTN95 in a few days; just a little busy at the moment.

25 Apr 2013 7:13 #12105

Thanks for the critical point. Yes, adding integer argument to RunN (m) can reveal the thread handle (the one which is = 20+iCurrThread), which could be used for thread counting.

When I added argument m to RunN and changed one line ithr = iCurrThread

to ithr = m-20

all started working OK. So any guess why iCurrThread passed via common was wrong? Is there a chance that anything passed via common could be wrong ? When we clear that and formulate some simple rules what's to do and what's not i think anyone will be able safely write parallel programs using FTN95

25 Apr 2013 9:35 #12106

Data in common will probably be Shared amongst the threads, so it isn't possible for each thread to have different sets of common variables.

On the other hand, arguments passed as arguments, and local variables in the subroutine, will be stored on the stack. There should be a different stack for each thread.

However, this isn't the point. The subroutine runN requires an integer argument is passed to it. If you leave this out you will be potentially corrupting the call stack frame at each call and the behaviour you wil get will be unpredictable.

You may find it helpful to read up on the difference between shared and private variables in OpenMP as the distinction used there is what you need to be aiming for with your own codes using FTN95. For example, you will learn it is OK for each thread to read data in shared variables, but to write requires the use of locks.

26 Apr 2013 12:37 #12107

Quoted from davidb Data in common will probably be Shared amongst the threads, so it isn't possible for each thread to have different sets of common variables.

That transfer of data via common block used that exactly property. Again, as it is seen from the code above, the DO loop sets the variable iCurrThread which is in the common block, then launches the thread with RunN which also has this common block inside, the thread picks up this value, assignes it to local variable ithr to keep it unchanged for this specific thread for further use and what we have got was unbelievable: the variable was immediately different from one it was sent! So the reason most probably as you have mentioned was corruption due to absent argument in RunM though the first code was working OK without this argument no matter how many threads used. Or the reason was different because the iCurrThread variable is still passed via common with wrong value? But (I am still curious) may be someone got the problem even with the first program example above?

Update: Looks like threads are launched by the DO loop and the code inside the threads start running absolutely independently, no one waits for another and we can not tell in which order. So i suspect this is the reason of wrong iCurrThread: the DO loop succeeds to launch two threads before the first thread picks up the variable.

26 Apr 2013 6:16 #12109

You cannot pass the thread number in COMMON. When you run subroutines using different threads, they run 'concurrently' in PARALLEL, or at least overlap with each other in time. This is the whole point of using multiple threads on different processors or cores - to get some parts of your code running in parallel.

In your second code, each iteration of the loop sets up one thread and sets up iCurrThread (the thread number) in COMMON. The first iteration of your loop calls runN which enters an expensive DO Loop (i=1, 200000000). At the start of this loop, the value of iCurrThread in memory is correct. However, before this loop can complete, a second call to runN is made with a different thread. This second call changes the value of iCurrThread whilst the first thread is still running runN which leads to incorrect results. The problem is made worse when the third and fourth threads start up.

The only way to fix this is to pass the thread number as an argument to the subroutine you want to call. Each thread will then pass a different value on the stack and each call of runN can safely run in parallel.

Your code is a little bit unusual in the way you have partitioned the work between threads. I have made another Post showing how to do this in a more general way, which people may find useful.

I have run your modified code 2 and it does give me an improvement, but I only have 2 cores here. I would be interested to see what performance you are getting with 1 thread, 4 threads and 8 threads, where the last is using hyperthreading.

27 Apr 2013 3:26 #12115

Quoted from davidb

Your code is a little bit unusual in the way you have partitioned the work between threads.

It's a little bit hard to follow also... Some multithreaded matrix multiplication test would be interesting.

I wrote a simple multithreaded 1000x1000 sized square matrix multiplication test using PL/I-language, available here with source and binary.

27 Apr 2013 4:54 #12116

Well, here is the same code #2 after simplification, beautification and syntactic sugar together with the runtime results. Total CPU time is elapsed time * amount of threads. Computer - i7 laptop, 2.2GHz, 4 cores, 8 threads set in BIOS, Windows7.

It can not be simpler and more self-explanatory than that. Code creates independent threads running subroutine runN (iThrHandle) using CALL CREATE_THREAD@(runN,20+i) where iThrHandle = 20+i is thread handle and follows the moment they finish when all flags kThreadEnded are =1. Exactly the same has been generally done by any other parallelization methods like OpenMP, CUDA etc when the threads could be independent. Add into subroutine RunN any specific task you want, Monte-Carlo, matrix operations, parallelizing DO loops etc partitioning them the way you want. Set when start the thread kThreadEnded(ithr) = 0 and make it =1 when finished. That's all. I set parameter nThrMax=8, use your own CPU cores/threads count. You may also consider adding or removing tiny delay when launching the threads (see the line call sleep1@(0.02) ) if 'threads jam' (?) conflict takes place when amount of threads is very large (why? this is the question to FTN95 developers. This is second time i find necessity for tiny delay with NET, first time it was for Clearwin+ GUI Run-Pause-Stop code, it was posted here 3-4 years ago).

And the results as a function of # of threads are great, the speedup on 4 cores is like on almost 8-core CPU. Isn't this amazing?

1 Elapsed time, total CPU time= 15.6210 15.6210 2 Elapsed time, total CPU time= 7.94200 15.8840 4 Elapsed time, total CPU time= 4.07800 16.3120 5 Elapsed time, total CPU time= 3.39700 16.9850 6 Elapsed time, total CPU time= 2.98200 17.8920 8 Elapsed time, total CPU time= 2.49800 19.9840

! Compilation: ftn95 parallelN4.f95 /clr /link /multi_threaded 
! Code is yours. Use it at your own risk
!
  
  include <clearwin.ins> 
  EXTERNAL runN 
  parameter (nThrMax=8) 
  common /threads_/nEmployedThreads, kThreadEnded(nThrMax) 
  character*80 getenv@

  write(*,*) 'Processor ', getenv@('PROCESSOR_IDENTIFIER')
  READ(getenv@('NUMBER_OF_PROCESSORS'),*) n_processorsTotal
  write(*,*) ' Max number of threads=', n_processorsTotal 

1 print*,' Enter number of parallel threads <= 8' 
  read(*,*)   nEmployedThreads 
  if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4

  call clock@ (time_start) 

!...set a flag of thread finished
  kThreadEnded(:)=1 

  do i = 1, nEmployedThreads 
     CALL CREATE_THREAD@(runN,20+i) 
     call sleep1@(0.02) 
  enddo 

!...wait till all threads finish 
  do while (minval(kThreadEnded)==0) 
    call sleep1@(0.1) 
  enddo 

  call clock@ (time_finish) 

  time = time_finish-time_start
  time2= time * nEmployedThreads
  print*, 'Elapsed time, total CPU time=', time, time2

  goto 1

  END 

! =======================================================
  subroutine runN (iThrHandle) 
  include <clearwin.ins> 
  parameter (nThrMax=8) 
  common /threads_/nEmployedThreads, kThreadEnded(nThrMax) 

  ithr = iThrHandle-20
  kThreadEnded(ithr) = 0 

  lock;   print*,'Started thread # ', ithr ; end lock 

  d      =2.22 

  do i=1,200000000/nEmployedThreads 
    d=alog(exp(d)) 
  enddo 

  lock;     print*, 'Ended   thread # ', ithr  ;     end lock 
  kThreadEnded(ithr)=1 
  end 
23 Jun 2013 8:42 #12463

Wondering why this functionality is available only in NET...Would be great if Silverfrost made it for regular x86 mode.

24 Jun 2013 7:30 #12467

I have this as a high priority on my list of things to do.

25 Jun 2013 7:22 #12481

THAT deserves BIG THANKS!

3 Jul 2013 2:52 #12552

I have uploaded a new beta version of salflibc.dll to http://www.silverfrost.com/beta/salflibc.exe

This contains new Win32 thread functions illustrated in the following sample. The new routines are described in the file cwplus.enh that is included in the download.

module threadMod
 c_external start_thread@    '__start_thread' (REF,REF):integer*4
 c_external wait_for_thread@ '__wait_for_thread' (VAL)
 c_external lock@            '__lock' (VAL)
 c_external unlock@          '__unlock' (VAL)
 integer,parameter::IO_LOCK = 42    !Any value. Your choice
contains
 subroutine threadFunc(count)
  integer count,start
  if(count < 0) return  !Illustrates an abort
  start = count
  do while(count > 0)
    call sleep1@(1.0)
    call lock@(IO_LOCK)
    print*, 'threadFunc ', start, count
    call unlock@(IO_LOCK)
    count = count - 1
  end do
 end subroutine threadFunc
end threadMod

program Threads
use threadMod
integer hThread(3)
hThread(1) = start_thread@(threadFunc, 6) !Run for 6 seconds
hThread(2) = start_thread@(threadFunc, 3) !Run for 3 seconds
hThread(3) = start_thread@(threadFunc, 7) !Run for 7 seconds
call wait_for_thread@(hThread(2))
call wait_for_thread@(0)
end
4 Jul 2013 5:28 #12558

I have edited the above sample program and removed a call to ExitThread that was not needed because the wrapper function in the library makes this call.

4 Jul 2013 2:51 (Edited: 4 Jul 2013 3:26) #12561

Thanks Paul for the efforts, this implementation makes parallelization as easy as 2x2.
My first observations. The parameter passed to subroutine threadFunc is somehow changed in between by one as you can see in this example. This probably needs your attention, it is important to have exactly the same values

module threadMod 
  c_external start_thread@    '__start_thread' (REF,REF):integer*4 
  c_external wait_for_thread@ '__wait_for_thread' (VAL) 
  c_external lock@            '__lock' (VAL) 
  c_external unlock@          '__unlock' (VAL) 
  integer,parameter::IO_LOCK = 42    !Any value. Your choice 
  integer ::  nEmployedThreads   

 contains 
  subroutine threadFunc(ithread) 
  call lock@(IO_LOCK);  print*, 'Started thread # ', ithread; call unlock@(IO_LOCK) 

   d=2.22 
   do i=1,200000000/nEmployedThreads 
     d=alog(exp(d)) 
   enddo 

  call lock@(IO_LOCK);  print*, 'Ended thread # ', ithread;  call unlock@(IO_LOCK) 

  end subroutine threadFunc 
 end threadMod 
!-------------------------------------------------------------------------------
 program Threads2 
 use threadMod 
 integer hThread(8) 

    TimeFor1Thread = 1.e20

100 print*,' Enter number of parallel threads <= 8. Run one thread few times first' 
   read(*,*)   nEmployedThreads 
   if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4 

   call clock@ (time_start)

   do i = 1, nEmployedThreads 
      hThread(i) = start_thread@(threadFunc, i) 
   enddo

 call wait_for_thread@(0) 

   call clock@ (time_finish) 

   time = time_finish-time_start 
   time2= time * nEmployedThreads 
   if(nEmployedThreads.eq.1) then
    if(TimeFor1Thread.gt.time) TimeFor1Thread=time
   endif

   print*, 'Elapsed time, total CPU time=', time, time2
   print*, 'SPEEDUP=', TimeFor1Thread/time

   goto 100

 end

Another observation. The speedups i get are much smaller then amount of processors and way smaller then with your incredible and not yet completely explainable NET implementation. I get 6.6x speedups in NET and only 2.6x with this x86 method on my 4 core/8thread PC. The NET code is shown above at the beginning of this thread but i modified it a bit to demonstrate speedups easily. That says that some fine tuning of this new method is still needed

! Compilation: ftn95 filename.f95 /clr /link /multi_threaded 
   
   include <clearwin.ins> 
   EXTERNAL runN 
   parameter (nThrMax=8) 
   common /threads_/nEmployedThreads, kThreadEnded(nThrMax) 
   character*80 getenv@ 

   write(*,*) 'Processor ', getenv@('PROCESSOR_IDENTIFIER') 
   READ(getenv@('NUMBER_OF_PROCESSORS'),*) n_processorsTotal 
   write(*,*) ' Max number of threads=', n_processorsTotal 

    TimeFor1Thread = 1.e20
 100 print*,' Enter number of parallel threads <= 8. Run one thread few times first' 
   read(*,*)   nEmployedThreads 
   if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4 

   call clock@ (time_start) 

 !...set a flag of thread finished 
   kThreadEnded(:)=1 

   do i = 1, nEmployedThreads 
      CALL CREATE_THREAD@(runN,i) 
      call sleep1@(0.02) 
   enddo 

 !...wait till all threads finish 
   do while (minval(kThreadEnded)==0) 
     call sleep1@(0.1) 
   enddo 

   call clock@ (time_finish) 

   time = time_finish-time_start 
   time2= time * nEmployedThreads 
   if(nEmployedThreads.eq.1) then
    if(TimeFor1Thread.gt.time) TimeFor1Thread=time
   endif
   print*, 'Elapsed time, total CPU time=', time, time2
   print*, 'SPEEDUP=', TimeFor1Thread/time


   goto 100 

   END 
4 Jul 2013 3:18 #12562

Continuation of the NET code

!==============================================
   subroutine runN (iThrHandle) 
   include <clearwin.ins> 
   parameter (nThrMax=8) 
   common /threads_/nEmployedThreads, kThreadEnded(nThrMax) 

   kThreadEnded(iThrHandle) = 0 

   lock;   print*,'Started thread # ', iThrHandle ; end lock 

   d=2.22 
   do i=1,200000000/nEmployedThreads 
     d=alog(exp(d)) 
   enddo 

   lock;     print*, 'Ended   thread # ', iThrHandle  ;     end lock 
   kThreadEnded(iThrHandle)=1 
   end
4 Jul 2013 5:22 #12563

Quoted from DanRRight My first observations. The parameter passed to subroutine threadFunc is somehow changed in between by one as you can see in this example. This probably needs your attention, it is important to have exactly the same values

Dan,

You can't use loop counter directly as thread parameter. Remember, you are passing a pointer to parameter for the thread, not a a value. Now all your threads share the same pointer as parameter and may also run at any order.

You should put all thread parameters into array and use the loop counter as array index counter:

hThread(i) = start_thread@(threadFunc, params(i)) 
4 Jul 2013 6:33 (Edited: 5 Jul 2013 10:30) #12564

Yes, I remember that from your approach. But I hope Paul worked that out to remove using LOC somehow. May be it's just me but i find using pointer adds some mind melting twist to the whole generally simple idea, or at least rises the question why which can potentially stop people from trying new things if not explained well.

P.S. Anyway,looks like Paul's approach does not need LOC, but still needs an array. Decently, I need a fresh head to understand why. The modified code which shows threads correctly is here

module threadMod 
   c_external start_thread@    '__start_thread' (REF,REF):integer*4 
   c_external wait_for_thread@ '__wait_for_thread' (VAL) 
   c_external lock@            '__lock' (VAL) 
   c_external unlock@          '__unlock' (VAL) 
   integer,parameter::IO_LOCK = 42    !Any value. Your choice 
   integer ::  nEmployedThreads    


  contains 
   subroutine threadFunc(ithread) 
   call lock@(IO_LOCK);  print*, 'Started thread # ', ithread; call unlock@(IO_LOCK) 

    d=2.22 
    do i=1,200000000/nEmployedThreads 
      d=alog(exp(d)) 
    enddo 

   call lock@(IO_LOCK);  print*, 'Ended thread # ', ithread;  call unlock@(IO_LOCK) 

   end subroutine threadFunc 
  end threadMod 
 !------------------------------------------------------------------------------- 
  program Threads2 
  use threadMod 
  integer hThread(8)
   integer :: iThreadNo(8) = (/1,2,3,4,5,6,7,8/)

    print*, 'Wait ...testing pure no-thread case'
    call clock@ (time_start) 
    nEmployedThreads = 1
    d=2.22 
    do i=1,200000000/nEmployedThreads 
      d=alog(exp(d)) 
    enddo 

    call clock@ (time_finish) 
    time = time_finish-time_start 
    print*, 'Pure no-thread case time=', time

     TimeFor1Thread=time 


 100 print*,' Enter number of parallel threads <= 8' 
    read(*,*)   nEmployedThreads 
    if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4 

    call clock@ (time_start) 

    do i = 1, nEmployedThreads 
       hThread(i) = start_thread@(threadFunc, iThreadNo(i) )
    enddo 

  call wait_for_thread@(0) 

    call clock@ (time_finish) 

    time = time_finish-time_start 
    time2= time * nEmployedThreads 
    if(nEmployedThreads.eq.1) then 
     if(TimeFor1Thread.gt.time) TimeFor1Thread=time 
    endif 

    print*, 'Elapsed time, total CPU time=', time, time2 
    print*, 'SPEEDUP=', TimeFor1Thread/time 

    goto 100 

  end
4 Jul 2013 6:34 #12565

I have tested Jalih matrix multiplication program using his routines and his DLL and compared the results with those obtained using the new routines. The results are the same and using two processors I get half the single processor time as expected.

There is very little to optimise. Start_thread@ has almost no overhead and just calls on CreateThread. Lock@ uses a Critical Section approach and, though this may not be optimal, it should have little effect on the performance.

If there is clear evidence that .NET does much better then I will have to get inside the .NET code and find out what it is doing.

Please login to reply.