 |
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Thu Jul 04, 2013 3:51 pm Post subject: |
|
|
Thanks Paul for the efforts, this implementation makes parallelization as easy as 2x2.
My first observations. The parameter passed to subroutine threadFunc is somehow changed in between by one as you can see in this example. This probably needs your attention, it is important to have exactly the same values
Code: | module threadMod
c_external start_thread@ '__start_thread' (REF,REF):integer*4
c_external wait_for_thread@ '__wait_for_thread' (VAL)
c_external lock@ '__lock' (VAL)
c_external unlock@ '__unlock' (VAL)
integer,parameter::IO_LOCK = 42 !Any value. Your choice
integer :: nEmployedThreads
contains
subroutine threadFunc(ithread)
call lock@(IO_LOCK); print*, 'Started thread # ', ithread; call unlock@(IO_LOCK)
d=2.22
do i=1,200000000/nEmployedThreads
d=alog(exp(d))
enddo
call lock@(IO_LOCK); print*, 'Ended thread # ', ithread; call unlock@(IO_LOCK)
end subroutine threadFunc
end threadMod
!-------------------------------------------------------------------------------
program Threads2
use threadMod
integer hThread(8)
TimeFor1Thread = 1.e20
100 print*,' Enter number of parallel threads <= 8. Run one thread few times first'
read(*,*) nEmployedThreads
if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4
call clock@ (time_start)
do i = 1, nEmployedThreads
hThread(i) = start_thread@(threadFunc, i)
enddo
call wait_for_thread@(0)
call clock@ (time_finish)
time = time_finish-time_start
time2= time * nEmployedThreads
if(nEmployedThreads.eq.1) then
if(TimeFor1Thread.gt.time) TimeFor1Thread=time
endif
print*, 'Elapsed time, total CPU time=', time, time2
print*, 'SPEEDUP=', TimeFor1Thread/time
goto 100
end |
Another observation. The speedups i get are much smaller then amount of processors and way smaller then with your incredible and not yet completely explainable NET implementation. I get 6.6x speedups in NET and only 2.6x with this x86 method on my 4 core/8thread PC. The NET code is shown above at the beginning of this thread but i modified it a bit to demonstrate speedups easily. That says that some fine tuning of this new method is still needed
Code: | ! Compilation: ftn95 filename.f95 /clr /link /multi_threaded
include <clearwin.ins>
EXTERNAL runN
parameter (nThrMax=8)
common /threads_/nEmployedThreads, kThreadEnded(nThrMax)
character*80 getenv@
write(*,*) 'Processor ', getenv@('PROCESSOR_IDENTIFIER')
READ(getenv@('NUMBER_OF_PROCESSORS'),*) n_processorsTotal
write(*,*) ' Max number of threads=', n_processorsTotal
TimeFor1Thread = 1.e20
100 print*,' Enter number of parallel threads <= 8. Run one thread few times first'
read(*,*) nEmployedThreads
if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4
call clock@ (time_start)
!...set a flag of thread finished
kThreadEnded(:)=1
do i = 1, nEmployedThreads
CALL CREATE_THREAD@(runN,i)
call sleep1@(0.02)
enddo
!...wait till all threads finish
do while (minval(kThreadEnded)==0)
call sleep1@(0.1)
enddo
call clock@ (time_finish)
time = time_finish-time_start
time2= time * nEmployedThreads
if(nEmployedThreads.eq.1) then
if(TimeFor1Thread.gt.time) TimeFor1Thread=time
endif
print*, 'Elapsed time, total CPU time=', time, time2
print*, 'SPEEDUP=', TimeFor1Thread/time
goto 100
END |
Last edited by DanRRight on Thu Jul 04, 2013 4:26 pm; edited 3 times in total |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Thu Jul 04, 2013 4:18 pm Post subject: |
|
|
Continuation of the NET code
Code: |
!==============================================
subroutine runN (iThrHandle)
include <clearwin.ins>
parameter (nThrMax=8)
common /threads_/nEmployedThreads, kThreadEnded(nThrMax)
kThreadEnded(iThrHandle) = 0
lock; print*,'Started thread # ', iThrHandle ; end lock
d=2.22
do i=1,200000000/nEmployedThreads
d=alog(exp(d))
enddo
lock; print*, 'Ended thread # ', iThrHandle ; end lock
kThreadEnded(iThrHandle)=1
end |
|
|
Back to top |
|
 |
jalih
Joined: 30 Jul 2012 Posts: 196
|
Posted: Thu Jul 04, 2013 6:22 pm Post subject: Re: |
|
|
DanRRight wrote: | My first observations. The parameter passed to subroutine threadFunc is somehow changed in between by one as you can see in this example. This probably needs your attention, it is important to have exactly the same values |
Dan,
You can't use loop counter directly as thread parameter. Remember, you are passing a pointer to parameter for the thread, not a a value. Now all your threads share the same pointer as parameter and may also run at any order.
You should put all thread parameters into array and use the loop counter as array index counter:
Code: | hThread(i) = start_thread@(threadFunc, params(i)) |
|
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Thu Jul 04, 2013 7:33 pm Post subject: |
|
|
Yes, I remember that from your approach. But I hope Paul worked that out to remove using LOC somehow. May be it's just me but i find using pointer adds some mind melting twist to the whole generally simple idea, or at least rises the question why which can potentially stop people from trying new things if not explained well.
P.S. Anyway,looks like Paul's approach does not need LOC, but still needs an array. Decently, I need a fresh head to understand why. The modified code which shows threads correctly is here
Code: | module threadMod
c_external start_thread@ '__start_thread' (REF,REF):integer*4
c_external wait_for_thread@ '__wait_for_thread' (VAL)
c_external lock@ '__lock' (VAL)
c_external unlock@ '__unlock' (VAL)
integer,parameter::IO_LOCK = 42 !Any value. Your choice
integer :: nEmployedThreads
contains
subroutine threadFunc(ithread)
call lock@(IO_LOCK); print*, 'Started thread # ', ithread; call unlock@(IO_LOCK)
d=2.22
do i=1,200000000/nEmployedThreads
d=alog(exp(d))
enddo
call lock@(IO_LOCK); print*, 'Ended thread # ', ithread; call unlock@(IO_LOCK)
end subroutine threadFunc
end threadMod
!-------------------------------------------------------------------------------
program Threads2
use threadMod
integer hThread(8)
integer :: iThreadNo(8) = (/1,2,3,4,5,6,7,8/)
print*, 'Wait ...testing pure no-thread case'
call clock@ (time_start)
nEmployedThreads = 1
d=2.22
do i=1,200000000/nEmployedThreads
d=alog(exp(d))
enddo
call clock@ (time_finish)
time = time_finish-time_start
print*, 'Pure no-thread case time=', time
TimeFor1Thread=time
100 print*,' Enter number of parallel threads <= 8'
read(*,*) nEmployedThreads
if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4
call clock@ (time_start)
do i = 1, nEmployedThreads
hThread(i) = start_thread@(threadFunc, iThreadNo(i) )
enddo
call wait_for_thread@(0)
call clock@ (time_finish)
time = time_finish-time_start
time2= time * nEmployedThreads
if(nEmployedThreads.eq.1) then
if(TimeFor1Thread.gt.time) TimeFor1Thread=time
endif
print*, 'Elapsed time, total CPU time=', time, time2
print*, 'SPEEDUP=', TimeFor1Thread/time
goto 100
end |
Last edited by DanRRight on Fri Jul 05, 2013 11:30 am; edited 5 times in total |
|
Back to top |
|
 |
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8210 Location: Salford, UK
|
Posted: Thu Jul 04, 2013 7:34 pm Post subject: |
|
|
I have tested Jalih matrix multiplication program using his routines and his DLL and compared the results with those obtained using the new routines. The results are the same and using two processors I get half the single processor time as expected.
There is very little to optimise. Start_thread@ has almost no overhead and just calls on CreateThread. Lock@ uses a Critical Section approach and, though this may not be optimal, it should have little effect on the performance.
If there is clear evidence that .NET does much better then I will have to get inside the .NET code and find out what it is doing. |
|
Back to top |
|
 |
DanRRight
Joined: 10 Mar 2008 Posts: 2923 Location: South Pole, Antarctica
|
Posted: Thu Jul 04, 2013 7:43 pm Post subject: |
|
|
Matrix multiplications are slower with large arrays and may have their own overheads due to limited memory bandwidth if things go out of L1/L2/L3 cache and hence may hide inefficiencies. I remember i was getting speedups with large spread 2.5-4 on 4 cores with Jalih's method. Now i realized this method on another task but when run despite i get speedups of the order of 2-2.2 i always dream about NET's crazy speedups the example above showed. This example is completely inside L1 cache.
Definitely we have to do more testing. By the way don't you see the same very large speedups in NET case on your computer?
Even such no-threaded 10 lines code extracted from the codes above being run in NET mode goes 7.05 seconds as opposed to 9.01 seconds in regular x86 case, an almost 30% speedup. Please check if your mileage is the same
Code: |
Program NETisWayFaster
call clock@ (time_start)
nEmployedThreads = 1
d=2.22
do i=1,200000000/nEmployedThreads
d=alog(exp(d))
enddo
call clock@ (time_finish)
time = time_finish-time_start
print*, 'Pure no-threaded case=', time
end
|
I compiled this snippet in NET mode
ftn95 NETisWayFaster.f95 /clr /link /multi_threaded
and in x86 one
ftn95 NETisWayFaster.f95 /link /opt /P6
And all thought NET is slow... NET is damn killing machine sleeping deep in the FTN95 internals
Addition:
I tried to debug the multithreading code and all seems works fine (which is great about this method), the only kind of problem is that wait_for_thread@ is causing SDBG debugger to generate an assembler (which goes away with no problems after hitting F8 few times). Would be nice if no assembler window appeared at all when when this method fine tuning and debugging will be complete, or at least assembler window appeared on top of Fortran text window not closing it like it goes right now |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|