Silverfrost Forums

Welcome to our forums

New Topic \"NET\"

5 May 2013 11:03 (Edited: 5 May 2013 12:06) #12159

Thanks jalih, this works. Pity lock of threads does not work though. Also, one more moment puzzled me when i made the same code parallelized as in the NET example in the other thread substituting your matrix multiplication DO loop with with the DO loop from here

https://forums.silverfrost.com/Forum/Topic/2239

that speedup is only 3+ times versus 7+ times in NET case. Your matrix case also was around 3.9 times. Any clues why? Here is the text for convenience and simplicity

module test
  INCLUDE <windows.ins>
  STDCALL attach_thread 'attach_thread' (REF, VAL):integer*4
  STDCALL wait_object 'wait_object' (VAL):integer*4
  STDCALL close_handle 'close_handle' (VAL):integer*4
  STDCALL create_mutex 'create_mutex' (VAL):integer*4
  STDCALL release_mutex 'release_mutex' (VAL):integer*4

  integer, parameter :: threads = 8
  real d

  contains
    subroutine thread(ptr)
      integer :: ptr, i, j, x

      d =2.22 
      nEmployedThreads = 8
      do i=1,200000000/nEmployedThreads 
       d=alog(exp(d)) 
      enddo
      
      call ExitThread(0)
    end subroutine thread

end module test


WINAPP
  use test
  integer :: i, j, x
  integer :: thandle(threads)
  integer :: nEmployedThreads 

  write(*,*) 'Single threaded :'
  call clock(start)
      d =2.22 
      do i=1,200000000
       d=alog(exp(d)) 
      enddo
  call clock(finish)
  write(*,*) 'Total time in seconds:', finish-start

! Calculate work unit size for threads and assign starting positions for each thread

  write(*,*) 'Multi threaded  with 8 threads:'
  call clock(start)

! Start threads
  do i=1,threads,1
    thandle(i) = attach_thread(thread,loc(i))
  end do

! Wait for threads to finish
  do i=1,threads,1
    x = wait_object(thandle(i))
  end do

  call clock(finish)
  write(*,*) 'Total time in seconds:', finish-start
  write(*,*) 'All done.'

END
5 May 2013 11:43 #12160

Quoted from DanRRight Thanks jalih, this works. Pity lock of threads does not work though

Actually locking of threads work fine. Compile and try example below as console application.

Only writing into Clearwin window don't work currently, that is probably because handling of window messages is blocked while waiting for threads. I will write you a non-blocking version and post update soon.

module test
  INCLUDE <windows.ins>
  STDCALL attach_thread 'attach_thread' (REF, VAL):integer*4
  STDCALL wait_object 'wait_object' (VAL):integer*4
  STDCALL close_handle 'close_handle' (VAL):integer*4
  STDCALL create_mutex 'create_mutex' (VAL):integer*4
  STDCALL release_mutex 'release_mutex' (VAL):integer*4

  
  integer :: hMutex
  integer :: values(8) = (/1,2,3,4,5,6,7,8/)

  contains
    subroutine thread(ptr)
      integer :: ptr, i

      i = wait_object(hMutex)
      write(*,*) 'Hello from thread', ptr
      i=release_mutex(hMutex)
      call ExitThread(0)
    end subroutine thread

end module test


program mt
  use test
  implicit none

  integer :: i, x
  integer :: thandle(8)
  
  hMutex = create_mutex(1)
  write(*,*) 'Multithreading test'
  x=release_mutex(hMutex)

  do i=1,8,1
    thandle(i) = attach_thread(thread,loc(values(i)))
  end do

  do i=1,8,1
    x = wait_object(thandle(i))
  end do
  
  x = close_handle(hMutex)
  
  write(*,*) 'All done. Bye!'
end program mt
5 May 2013 12:24 #12161

Quoted from jalih

Only writing into Clearwin window don't work currently, that is probably because handling of window messages is blocked while waiting for threads. I will write you a non-blocking version and post update soon.

Added non-blocking version of wait_object() and new example

Have fun!

5 May 2013 12:51 #12162

Jalih, nice attempts, but I think too short thread confuses very much, seems the lock actually does not work in last example.Or may be works but threads are not launched. Please (always!) include this or similar snippet and see its actual launch and work for a second or two in Task Manager

      d =2.22 
      nEmployedThreads = 8
      do i=1,200000000/nEmployedThreads 
       d=alog(exp(d)) 
      enddo

After that code will never confuse if it works or not. Also, what do you think about speedup numbers as in my previous post couple hours ago?

5 May 2013 3:09 #12163

Quoted from DanRRight Jalih, nice attempts, but I think too short thread confuses very much, seems the lock actually does not work in last example.

Starting threads and locking should work fine in my last non-blocking thread wait example. I updated my previous example with your suggested code and program runs as it supposed to. Just re-download and test it.

picture here

You can more clearly see that locking really works by adding your extra code inside mutex guarded region, so threads execute one by one.

5 May 2013 6:59 (Edited: 6 May 2013 6:38) #12166

All works, Jalih, which is very good news, thanks for the efforts The last thing left is to understand why exactly the same NET example is almost twice (which i'd say is kind of surprising but unreasonable on 4 CPU cores because CPU may have 8 integer units perfect for 8 threads but it has only 4 fp FP ones and so increasing amount of threads to 8 should not give a boost to NET example but it DOES) faster. With my 4core/8threads processor i get NET executed in 2.5+ seconds versus 5 seconds with this last your modification using WinAPI. Getting speedup around 3 times is kind of smallish, good to expect it close to 4 but it is not unreasonable. Will do more testing and i of course invite others to do that too. Here is modified code of Jalih to be exact as NET one for the purpose of comparisons.

! Compilation:
! FTN95 main.f95
! SLINK main.obj t.dll

module test
  INCLUDE <windows.ins>
  STDCALL attach_thread 'attach_thread' (REF, VAL):integer*4
  STDCALL wait_object 'wait_object' (VAL):integer*4
  STDCALL check_object 'check_object' (VAL):integer*4
  STDCALL close_handle 'close_handle' (VAL):integer*4
  STDCALL create_mutex 'create_mutex' (VAL):integer*4
  STDCALL release_mutex 'release_mutex' (VAL):integer*4

  
  integer :: hMutex
  integer :: values(8) = (/1,2,3,4,5,6,7,8/)
  integer :: nEmployedThreads

  contains
    subroutine thread(ptr)
      integer :: ptr, i
      real d

      i = wait_object(hMutex)
      write(*,*) 'Starting calculation in thread', ptr
      i=release_mutex(hMutex)
      
      d = 2.22
      do i=i,200000000/nEmployedThreads,1
        d=alog(exp(d))
      end do
      
      call ExitThread(0)
    end subroutine thread

end module test

winapp
program mt
  use test
  implicit none

  integer :: i, x
  integer :: thandle(8)
  real d, finish, start
  

  hMutex = create_mutex(1)
  write(*,*) 'Multithreading test with up to 8 threads:'
  x=release_mutex(hMutex)

1 print*,' Enter number of parallel threads <= 8'
  read(*,*)   nEmployedThreads
  if(nEmployedThreads.lt.1.or.nEmployedThreads.gt.8) nEmployedThreads=4 

  call clock(start)

  do i=1,nEmployedThreads,1
    thandle(i) = attach_thread(thread,loc(values(i)))
  end do

  do i=1,nEmployedThreads,1
10 call temporary_yield@()
    x = check_object(thandle(i))
    if (x == 0) goto 10
  end do
  
  x = close_handle(hMutex)
  
  call clock(finish)
  write(*,*) 'Total time in seconds:', finish-start
  goto 1

end program mt

Questions

  1. I just noticed that even not running but waiting user input the code grabs CPU time. Why it is that?
  2. Do we need this lock in main program (versus in threads where we definitely need a lock)? Threads in main program should never collide and conflict, isn't it? hMutex = create_mutex(1) write(,) 'Multithreading test with up to 8 threads:' x=release_mutex(hMutex)
  3. Why BASIC was used in DLL to get WinAPI functions, aren't these functions available straight from FTN95?
6 May 2013 3:35 #12167

Quoted from DanRRight

Questions

  1. I just noticed that even not running but waiting user input the code grabs CPU time. Why it is that?

Program using ClearWin window must process messages, so main thread in this case can't just sleep waiting for threads to finish. If you would just make a console application, then the wait_object() could be used instead of polling with check_event() and situation would be better.

  1. Do we need this lock in main program (versus in threads where we definitely need a lock)? Threads in main program should never collide and conflict, isn't it? hMutex = create_mutex(1) write(,) 'Multithreading test with up to 8 threads:' x=release_mutex(hMutex)

You are right, the locking is only necessary after the threads have been attached (this is also necessary in main program). You can change the mutex creation line into hMutex = create_mutex(0) and remove the x = release_mutex(hMutex).

  1. Why BASIC was for used in DLL to get WinAPI functions, aren't these functions available straight from FTN95?

It's a personal preference. FTN95 don't have as complete header definitions as some other compilers do and I like the MiniBASIC's syntax. It's more readable than C and offers the same functionality.

6 May 2013 7:19 #12168

A bit more polishing and we are done. What you have been doing is very important. Single processor Fortran is almost dead. But the autoparallelization techniques are not in Fortran standard yet. So the only more or less portable approach at least within Windows is to use WinAPI (since the other way by using FTN95 for NET is not yet polished for extreme uses and OpenMP is not compatible with this compiler). I expect it should work with up to few dozen of threads OK on the good multi-core PC.

l'm OK to use third party libraries, but suspect that people will be reluctant to touch parallelization/multithreading with third party DLLs, so please think if possible to change with time BASIC's DLL to FTN95 existing definitions, with your knowledge of multiple languages that would be not that hard. That part was always not the best with this compiler. I know that Intel compiler has Fortran-friendly definitions of WinAPI functions, would be great if FTN95 developers adopted it too.

Is processing messages the reason of one core is not doing the calculations and so we see only 3 times speedup instead of at least 3.95 on 4 cores? (i lost your non-windows version, what is best to change in this code above to try and check this idea?)

May be Paul also will look at this and suggest something to improve this method and even include into the package, kind of standartizing it.

6 May 2013 6:01 #12171

I am keeping an eye on this conversation with the hope that I can include something in the FTN95 Win32 library.

6 May 2013 6:39 #12172

I made some small changes into my wrapper functions. Now check_object(hObject) returns directly what WaitForSingleObject(hObject, 0) returns, so now it returns 0 instead of 1 if the thread is finished.

Also because Dan don't seem to like basic, I re-wrote it in assembler. It makes winapi call directly using import library label, just for fun. DLL, sample for console application and application using ClearWin window are included and source for the DLL.

Available here

6 May 2013 10:28 #12174

Thanks jalih. I ran both these examples you provided and did not notice any significant differences, both give speedup around 3.15 finishing MT part in 4.75 s versus 15 sec in single thread. So Clearwin window despite grabbing the whole thread CPU time does not interfere with calculations. I hope you will find the reason for that small annoyance. It probably has the same nature as sleep@ versus sleep1@, one grabs CPU all the time and another not, Paul solved this few years ago. Somehow Clearwin+ itself does not take any CPU time, so developers will definitely help here.

Yes, assembler is probably better then BASIC. I forgot this kind of tricks from the times of FTN77 but it probably even can be used straight inside FTN95 Fortran text.

So being console application does not speeds up the run. 3.15x is also very very good, who would argue? Everyone with 4cores got 3 PCs as a present from jalih today. That alone makes it the major new feature of the decade. I am starting my 45236653-th attempt to parallelize the code today. Only one was successful with parallel algebra library from Equation dot com i use for more then decade. They also had parallel language MTASK but it did not have thread locks, so it was useless.

Usually i do not care if computers differ in performance by less then factor of 1.5-2. But the NET still somehow completes multithreading run in 2.3s. How FTN95 developers managed to achieve this true miracle 6.5x speedup is beyond my expertise. I bet no other classic parallel method like OpenMP will do that with FP calculations without combining regular FP and SSE or GPU like CUDA at the same time since i suppose FTN95 is not doing that. Doesn't NET also use WinAPI for multithreading? Developers definitely should not hide this but make it available with just the WinAPI approach too for their own and FTN95 glory. Can hacking squad here disassemble NET MulthiThreading code? jalih, davidb, John, others? 😃

7 May 2013 8:12 #12176

Dan and Jalih,

I have not been closely following this post, but from my scanning of the discussion, you now have a way of controling multiple threads in a FTN95_win_32 environment for a non-trivial calculation. Is it possible to provide a FTN95-Win32 sample that can use a multi-threaded aproach to multiply 2 large matrices ?

The two options to demonstrate could be: [C] = [A]' x [B] where the coefficients of C are dot products of the columns of A and B, or [C] = [A] x [B] where the coefficients of C are product of A_row and B_column.

Matrix multiplication is a good non-trivial example, as there can be no clashing between the threads and each thread can be allocated a partition of C, while sharing the memory of A and B. Assuming the number of threads could be 2 or 3 for I5 or 6 or 7 for I7, [C] could be partitioned by sets of columns.

Am I correct in my impression of what you have achieved is runing in an FTN95_win32 environment, and hopefilly not requiring C++ or Basic ?

I am not clear on the requirement for .NET or how memory can be shared between threads (arrays A and B), or allocated locally to a single thread (temporary rows of A or local variables). These types of memory management are essential.

Once this solution is demonstrated, it would be good to get some performance times for various sizes of A and B to demonstrate the thread management overhead. I would be pleased to develop some benchmark of this and provide a documented FTN95 example for others to use.

In previous OpenMP attempts I have tried, I have never got a good solution, as it often fails with either thread management overheads, or (I am told) cache clashes for addressing A and B. It would be good to try this thread managed option and see what is happening and hopefully document a multi-thread solution that others could adapt.

Hopefully I have understood what you have achieved. I would welcome your assistance and look forward to your comments.

John

7 May 2013 8:14 #12177

Hi Dan,

I added support for critical section object, this should give a little bit faster locking times than with mutex object.

Updated DLL with critical section usage example

7 May 2013 8:31 #12178

Quoted from JohnCampbell Dan and Jalih, Is it possible to provide a FTN95-Win32 sample that can use a multi-threaded aproach to multiply 2 large matrices ?

How about:

module test
  INCLUDE <windows.ins>
  STDCALL attach_thread 'attach_thread' (REF, VAL):integer*4
  STDCALL wait_object 'wait_object' (VAL):integer*4
  STDCALL check_object 'check_object' (VAL):integer*4
  STDCALL close_handle 'close_handle' (VAL):integer*4
  STDCALL create_mutex 'create_mutex' (VAL):integer*4
  STDCALL release_mutex 'release_mutex' (VAL):integer*4
  STDCALL init_criticalsection 'init_criticalsection' (REF)
  STDCALL enter_criticalsection 'enter_criticalsection' (REF)
  STDCALL leave_criticalsection 'leave_criticalsection' (REF)
  STDCALL delete_criticalsection 'delete_criticalsection' (REF)

  integer, parameter :: msize = 1000
  integer, parameter :: threads = 8

  integer :: work_unit
  integer, allocatable :: start_pos(:)
  real, dimension(:,:), allocatable :: A, B, C

  contains
    subroutine thread(ptr)
      integer :: ptr, i, j

      do i = ptr,ptr+work_unit-1,1
        do j = 1,msize,1
          C(i,j) = sum(a(i,:) * b(:,j))
        end do
      end do
      
      call ExitThread(0)
    end subroutine thread

end module test


winapp
program mt
  use test
  implicit none

  integer :: i, j, x
  integer :: thandle(threads)
  real(SELECTED_REAL_KIND(6,37)) :: start,finish

  allocate(A(msize,msize))
  allocate(B(msize,msize))
  allocate(C(msize,msize))

  ! put test data into matrices
  do i = 1,msize,1
    do j = 1,msize,1
      A(i,j) = 9999 * random@();
      B(i,j) = 9999 * random@();
    end do
  end do

  write(*,*) 'Single threaded 1000x1000 sized square matrix multiply:'
  call clock(start)
  
  do i = 1,msize,1
    do j = 1,msize,1
      C(i,j) = sum(a(i,:) * b(:,j))
    end do
  end do
  
  call clock(finish)
  write(*,*) 'Total time in seconds:', finish-start

  ! Calculate work unit size for threads and assign starting positions for each thread
  work_unit = msize/threads   
  allocate(start_pos(threads))

  start_pos(1) = 1
  do i = 2, threads, 1
    start_pos(i) = (i - 1) * work_unit + 1
  end do

  write(*,*) 'Multi threaded 1000x1000 sized square matrix multiply with 8 threads:'
  
  call clock(start)

  ! Start threads
  do i=1,threads,1
    thandle(i) = attach_thread(thread,loc(start_pos(i)))
  end do

  ! Wait for threads to finish
  do i=1,8,1
    10 call temporary_yield@()
    x = check_object(thandle(i))
    if (x /= 0) goto 10
  end do

  call clock(finish)
  write(*,*) 'Total time in seconds:', finish-start
  
  write(*,*) 'All done. Bye!'
  deallocate(A,B,C,start_pos)
end program mt

Am I correct in my impression of what you have achieved is runing in an FTN95_win32 environment, and hopefilly not requiring C++ or Basic ?

It's just a small FTN95 callable DLL. It's now written in assembler and compiled using MiniBASIC compiler.

7 May 2013 11:06 #12179

Jalih I see some speedup but it is within the noise ( 3.26 vs 3.15 or such). Is this strictly console method (criticalsection)?

Another thought - how about also to try same matrix method above with NET just to get more information about both methods? I will try when have time (i am under killing deadlines and am dead if not succeed to parallelize the code in next two days and not get results due to expected speedup), or may be John will help as most interested in matrix party 😃. NET may give us clue about the ultimate limits of possible

John, If your 3 matrices are larger combined then L3 cache size, which means larger then approximately 1000x1000 for real*8 and 24MB cache of some good processors then the trick is to partition by such chunks that completely fit into the cache. If execution of this chunk will be faster then the order of 1/100 second then you will not get any gain due to overhead. If forget about caches and all that and just go with larger and larger sizes then this method should still work giving you speedup. For example here are speedups for matrix sizes

100 2.5 250 2.96 500 3.07 750 3.14 1000 3.75 1250 4.15 1500 3.73 1750 3.52 2000 3.48 3000 4.02

Not bad at all, isn't it? Thanks Jalih again. But of course all also depends on specific cases, here processor is still doing a lot of work. Just the huge matrix multiplication may give you smaller speedup due to all mentioned above

7 May 2013 6:24 #12181

Quoted from DanRRight Is this strictly console method (criticalsection)?

No, it works in winapp as well. It provides synchronization similar to that provided by a mutex object, except that a critical section can be used only by the threads of a single process.

I will add app_mutex() function, that can be used to allow only one instance of the application to be running at the same time.

8 May 2013 1:05 #12184

Here is matrix code for NET. Since NET seems does not like allocatables or modules (no time to find why or where is my error), i wrote the code without them. Run it with one single thread first to get the statistics - minimum time will be taken as a single thread execution time to get speedup number (which is min single threaded time divided by min N-threaded time).

The multithreading capabilities of NET though are truly excellent, with 4cores/8 threads i get speedups around 3.95-4.05. The fact that we do not see numbers larger then 4 like on other NET example with truly independent threads is not unreasonable. But also possibly tells that threads in this matrix example may collide for read access doing sum
C(i,j) = sum(a(i,:) * b(:,j)) Other explanations are also possible

! Compilation: ftn95 amatr.f95 /clr /link /multi_threaded 
! Use this code at your own risk 
!
! Run first just one thread several times to get statistics - 
! min time will be taken as single thread excution time
!
  include <clearwin.ins> 
  EXTERNAL runN 
  parameter (nThrMax=8, msize=1000) 
  common /threads_/nEmployedThreads, iFlagThreadEnded(nThrMax) 
  common /matr_/ a(msize,msize),b(msize,msize),c(msize,msize)
  character*80 getenv@ 

  write(*,*) 'Processor ', getenv@('PROCESSOR_IDENTIFIER') 
  READ(getenv@('NUMBER_OF_PROCESSORS'),*) n_processorsTotal 
  write(*,*) ' Max number of threads=', n_processorsTotal 

!... put test data into matrices 
   do i = 1,msize,1 
     do j = 1,msize,1 
       A(i,j) = 9999 * random@(); 
       B(i,j) = 9999 * random@(); 
     end do 
   end do 

   timeST=10000

1 print*,' Enter number of parallel threads <= 8' 
  read(*,*)   nEmployedThreads 

  call clock@ (time_start) 

!...set a flag of thread finished 
  iFlagThreadEnded(:)=1 
  iFlagThreadEnded(1)=0 ! needed because threads are initialized too fast

  do i = 1, nEmployedThreads 
     CALL CREATE_THREAD@(runN, i) 
  enddo 

!...wait till all threads finish 
  do while (minval(iFlagThreadEnded)==0) 
    call sleep1@(0.1) 
  enddo 

  call clock@ (time_finish) 

  time = time_finish-time_start 
  time2= time * nEmployedThreads 
  if(time2.lt.timeST.and.nEmployedThreads.eq.1) timeST=time2
  print*, 'Runime, total CPU time, Speedup=', time, time2, timeST/time 

  goto 1 

  END 

 ======================================================= 
  subroutine runN (iThrHandle) 
  include <clearwin.ins> 
  parameter (nThrMax=8, msize=1000) 
  common /threads_/nEmployedThreads, iFlagThreadEnded(nThrMax) 
  common /matr_/ a(msize,msize),b(msize,msize),c(msize,msize)
  integer istart_pos(nThrMax)

   ithr = iThrHandle  
   iFlagThreadEnded(ithr) = 0 

   iwork_unit = msize/nEmployedThreads    
   istart_pos(1) = 1 
   do i = 2, nEmployedThreads, 1 
     istart_pos(i) = (i - 1) * iwork_unit + 1 
   end do 

!  lock;   print*,'Started thread # ', ithr ;   end lock 

       do i = istart_pos(ithr),istart_pos(ithr)+iwork_unit-1,1 
         do j = 1,msize,1 
           C(i,j) = sum(a(i,:) * b(:,j)) 
         end do 
       end do 

  iFlagThreadEnded(ithr)=1 
  end
8 May 2013 1:15 #12185

Jalih,

Thanks for the example and the link to the DLL. I hope to have some time in about 2 weeks and will look at this.

My plan is to:

  1. declare matrices A, B and A_tran
  2. Produce C = A x B and Ct = A_tran x B by both: a single thread, a multi thread approach, and a single thread using SSE2 instructions that David has supplied.
  3. compare these for accuracy and elapsed time for a range of matrix sizes, say 10k, 100k, 1mb, 10mb and 100mb

I will use ALLOCATE to scale up the matrix sizes. For A x B, I will create a temporary vector A_row to allow dot_product.

The multi thread approach would group by rows. I am assuming I would define 3 threads for my Core i5, each processing every 3rd row of A and C. I need to understand your thread calls so that the number of threads can vary. Potentially I can use the SSE instructions in each thread to combine mult-thread and vector instructions using FTN95.

I'll hopefully get back with a link to the code and run trace when I succeed.

John

PS : Dan, if this all works, then I will try to adapt my skyline solver using 3 threads and process 3 columns at a time, as equation.com

8 May 2013 4:05 #12186

John, I think this could be interesting to you. Didn't i tell you the most amazing tale/truth i discovered myself and even the author of parallel solver has no clue what and why this happen?

Doing tests i found that the skyline solver of Equation com running on AMD processors gives absolutely crazy speedups so that the worst cheapest multicore AMD laptop outperforms the best Intel PC by the factor of 2? Try it, may be it will help your code

8 May 2013 4:45 #12187

I added app_mutex() function into my DLL. You can use it as an easy way to allow only one instance of the application. I made module mt_dll, where I put all the STDCALL definitions.

My previous download link is updated...

example usage of the app_mutex() function, put somewhere at the start of program code:

  x = app_mutex()
  if (x == 0) then
    write(*,*) 'Only single instance of this application is allowed'
    stop
  end if
Please login to reply.