replica nfl jerseysreplica nfl jerseyssoccer jerseyreplica nfl jerseys forums.silverfrost.com :: View topic - Two multithreading programs
forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Two multithreading programs
Goto page Previous  1, 2
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Thu Jul 04, 2013 3:51 pm    Post subject: Reply with quote

Thanks Paul for the efforts, this implementation makes parallelization as easy as 2x2.
My first observations. The parameter passed to subroutine threadFunc is somehow changed in between by one as you can see in this example. This probably needs your attention, it is important to have exactly the same values

Code:
module threadMod
  c_external start_thread@    '__start_thread' (REF,REF):integer*4
  c_external wait_for_thread@ '__wait_for_thread' (VAL)
  c_external lock@            '__lock' (VAL)
  c_external unlock@          '__unlock' (VAL)
  integer,parameter::IO_LOCK = 42    !Any value. Your choice
  integer ::  nEmployedThreads   

 contains
  subroutine threadFunc(ithread)
  call lock@(IO_LOCK);  print*, 'Started thread # ', ithread; call unlock@(IO_LOCK)

   d=2.22
   do i=1,200000000/nEmployedThreads
     d=alog(exp(d))
   enddo

  call lock@(IO_LOCK);  print*, 'Ended thread # ', ithread;  call unlock@(IO_LOCK)

  end subroutine threadFunc
 end threadMod
!-------------------------------------------------------------------------------
 program Threads2
 use threadMod
 integer hThread(8)

    TimeFor1Thread = 1.e20

100 print*,' Enter number of parallel threads <= 8. Run one thread few times first'
   read(*,*)   nEmployedThreads
   if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4

   call clock@ (time_start)

   do i = 1, nEmployedThreads
      hThread(i) = start_thread@(threadFunc, i)
   enddo

 call wait_for_thread@(0)

   call clock@ (time_finish)

   time = time_finish-time_start
   time2= time * nEmployedThreads
   if(nEmployedThreads.eq.1) then
    if(TimeFor1Thread.gt.time) TimeFor1Thread=time
   endif

   print*, 'Elapsed time, total CPU time=', time, time2
   print*, 'SPEEDUP=', TimeFor1Thread/time

   goto 100

 end


Another observation. The speedups i get are much smaller then amount of processors and way smaller then with your incredible and not yet completely explainable NET implementation. I get 6.6x speedups in NET and only 2.6x with this x86 method on my 4 core/8thread PC. The NET code is shown above at the beginning of this thread but i modified it a bit to demonstrate speedups easily. That says that some fine tuning of this new method is still needed

Code:
! Compilation: ftn95 filename.f95 /clr /link /multi_threaded
   
   include <clearwin.ins>
   EXTERNAL runN
   parameter (nThrMax=8)
   common /threads_/nEmployedThreads, kThreadEnded(nThrMax)
   character*80 getenv@

   write(*,*) 'Processor ', getenv@('PROCESSOR_IDENTIFIER')
   READ(getenv@('NUMBER_OF_PROCESSORS'),*) n_processorsTotal
   write(*,*) ' Max number of threads=', n_processorsTotal

    TimeFor1Thread = 1.e20
 100 print*,' Enter number of parallel threads <= 8. Run one thread few times first'
   read(*,*)   nEmployedThreads
   if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4

   call clock@ (time_start)

 !...set a flag of thread finished
   kThreadEnded(:)=1

   do i = 1, nEmployedThreads
      CALL CREATE_THREAD@(runN,i)
      call sleep1@(0.02)
   enddo

 !...wait till all threads finish
   do while (minval(kThreadEnded)==0)
     call sleep1@(0.1)
   enddo

   call clock@ (time_finish)

   time = time_finish-time_start
   time2= time * nEmployedThreads
   if(nEmployedThreads.eq.1) then
    if(TimeFor1Thread.gt.time) TimeFor1Thread=time
   endif
   print*, 'Elapsed time, total CPU time=', time, time2
   print*, 'SPEEDUP=', TimeFor1Thread/time


   goto 100

   END


Last edited by DanRRight on Thu Jul 04, 2013 4:26 pm; edited 3 times in total
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Thu Jul 04, 2013 4:18 pm    Post subject: Reply with quote

Continuation of the NET code
Code:
 
!==============================================
   subroutine runN (iThrHandle)
   include <clearwin.ins>
   parameter (nThrMax=8)
   common /threads_/nEmployedThreads, kThreadEnded(nThrMax)

   kThreadEnded(iThrHandle) = 0

   lock;   print*,'Started thread # ', iThrHandle ; end lock

   d=2.22
   do i=1,200000000/nEmployedThreads
     d=alog(exp(d))
   enddo

   lock;     print*, 'Ended   thread # ', iThrHandle  ;     end lock
   kThreadEnded(iThrHandle)=1
   end
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 196

PostPosted: Thu Jul 04, 2013 6:22 pm    Post subject: Re: Reply with quote

DanRRight wrote:
My first observations. The parameter passed to subroutine threadFunc is somehow changed in between by one as you can see in this example. This probably needs your attention, it is important to have exactly the same values


Dan,

You can't use loop counter directly as thread parameter. Remember, you are passing a pointer to parameter for the thread, not a a value. Now all your threads share the same pointer as parameter and may also run at any order.

You should put all thread parameters into array and use the loop counter as array index counter:

Code:
hThread(i) = start_thread@(threadFunc, params(i))
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Thu Jul 04, 2013 7:33 pm    Post subject: Reply with quote

Yes, I remember that from your approach. But I hope Paul worked that out to remove using LOC somehow. May be it's just me but i find using pointer adds some mind melting twist to the whole generally simple idea, or at least rises the question why which can potentially stop people from trying new things if not explained well.

P.S. Anyway,looks like Paul's approach does not need LOC, but still needs an array. Decently, I need a fresh head to understand why. The modified code which shows threads correctly is here
Code:
module threadMod
   c_external start_thread@    '__start_thread' (REF,REF):integer*4
   c_external wait_for_thread@ '__wait_for_thread' (VAL)
   c_external lock@            '__lock' (VAL)
   c_external unlock@          '__unlock' (VAL)
   integer,parameter::IO_LOCK = 42    !Any value. Your choice
   integer ::  nEmployedThreads   


  contains
   subroutine threadFunc(ithread)
   call lock@(IO_LOCK);  print*, 'Started thread # ', ithread; call unlock@(IO_LOCK)

    d=2.22
    do i=1,200000000/nEmployedThreads
      d=alog(exp(d))
    enddo

   call lock@(IO_LOCK);  print*, 'Ended thread # ', ithread;  call unlock@(IO_LOCK)

   end subroutine threadFunc
  end threadMod
 !-------------------------------------------------------------------------------
  program Threads2
  use threadMod
  integer hThread(8)
   integer :: iThreadNo(8) = (/1,2,3,4,5,6,7,8/)

    print*, 'Wait ...testing pure no-thread case'
    call clock@ (time_start)
    nEmployedThreads = 1
    d=2.22
    do i=1,200000000/nEmployedThreads
      d=alog(exp(d))
    enddo

    call clock@ (time_finish)
    time = time_finish-time_start
    print*, 'Pure no-thread case time=', time

     TimeFor1Thread=time


 100 print*,' Enter number of parallel threads <= 8'
    read(*,*)   nEmployedThreads
    if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4

    call clock@ (time_start)

    do i = 1, nEmployedThreads
       hThread(i) = start_thread@(threadFunc, iThreadNo(i) )
    enddo

  call wait_for_thread@(0)

    call clock@ (time_finish)

    time = time_finish-time_start
    time2= time * nEmployedThreads
    if(nEmployedThreads.eq.1) then
     if(TimeFor1Thread.gt.time) TimeFor1Thread=time
    endif

    print*, 'Elapsed time, total CPU time=', time, time2
    print*, 'SPEEDUP=', TimeFor1Thread/time

    goto 100

  end


Last edited by DanRRight on Fri Jul 05, 2013 11:30 am; edited 5 times in total
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8210
Location: Salford, UK

PostPosted: Thu Jul 04, 2013 7:34 pm    Post subject: Reply with quote

I have tested Jalih matrix multiplication program using his routines and his DLL and compared the results with those obtained using the new routines. The results are the same and using two processors I get half the single processor time as expected.

There is very little to optimise. Start_thread@ has almost no overhead and just calls on CreateThread. Lock@ uses a Critical Section approach and, though this may not be optimal, it should have little effect on the performance.

If there is clear evidence that .NET does much better then I will have to get inside the .NET code and find out what it is doing.
Back to top
View user's profile Send private message AIM Address
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Thu Jul 04, 2013 7:43 pm    Post subject: Reply with quote

Matrix multiplications are slower with large arrays and may have their own overheads due to limited memory bandwidth if things go out of L1/L2/L3 cache and hence may hide inefficiencies. I remember i was getting speedups with large spread 2.5-4 on 4 cores with Jalih's method. Now i realized this method on another task but when run despite i get speedups of the order of 2-2.2 i always dream about NET's crazy speedups the example above showed. This example is completely inside L1 cache.

Definitely we have to do more testing. By the way don't you see the same very large speedups in NET case on your computer?

Even such no-threaded 10 lines code extracted from the codes above being run in NET mode goes 7.05 seconds as opposed to 9.01 seconds in regular x86 case, an almost 30% speedup. Please check if your mileage is the same

Code:
 
    Program  NETisWayFaster
    call clock@ (time_start)
    nEmployedThreads = 1
    d=2.22
    do i=1,200000000/nEmployedThreads
      d=alog(exp(d))
    enddo

    call clock@ (time_finish)
    time = time_finish-time_start
    print*, 'Pure no-threaded case=', time

   end


I compiled this snippet in NET mode
ftn95 NETisWayFaster.f95 /clr /link /multi_threaded
and in x86 one
ftn95 NETisWayFaster.f95 /link /opt /P6

And all thought NET is slow... NET is damn killing machine sleeping deep in the FTN95 internals Smile

Addition:
I tried to debug the multithreading code and all seems works fine (which is great about this method), the only kind of problem is that wait_for_thread@ is causing SDBG debugger to generate an assembler (which goes away with no problems after hitting F8 few times). Would be nice if no assembler window appeared at all when when this method fine tuning and debugging will be complete, or at least assembler window appeared on top of Fortran text window not closing it like it goes right now
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group