forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

New Topic "NET"
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Suggestions
View previous topic :: View next topic  
Author Message
DanRRight



Joined: 10 Mar 2008
Posts: 1645
Location: South Pole, Antarctica

PostPosted: Mon May 06, 2013 11:28 pm    Post subject: Reply with quote

Thanks jalih. I ran both these examples you provided and did not notice any significant differences, both give speedup around 3.15 finishing MT part in 4.75 s versus 15 sec in single thread. So Clearwin window despite grabbing the whole thread CPU time does not interfere with calculations. I hope you will find the reason for that small annoyance. It probably has the same nature as sleep@ versus sleep1@, one grabs CPU all the time and another not, Paul solved this few years ago. Somehow Clearwin+ itself does not take any CPU time, so developers will definitely help here.

Yes, assembler is probably better then BASIC. I forgot this kind of tricks from the times of FTN77 but it probably even can be used straight inside FTN95 Fortran text.

So being console application does not speeds up the run. 3.15x is also very very good, who would argue? Everyone with 4cores got 3 PCs as a present from jalih today. That alone makes it the major new feature of the decade. I am starting my 45236653-th attempt to parallelize the code today. Only one was successful with parallel algebra library from Equation dot com i use for more then decade. They also had parallel language MTASK but it did not have thread locks, so it was useless.

Usually i do not care if computers differ in performance by less then factor of 1.5-2. But the NET still somehow completes multithreading run in 2.3s. How FTN95 developers managed to achieve this true miracle 6.5x speedup is beyond my expertise. I bet no other classic parallel method like OpenMP will do that with FP calculations without combining regular FP and SSE or GPU like CUDA at the same time since i suppose FTN95 is not doing that. Doesn't NET also use WinAPI for multithreading? Developers definitely should not hide this but make it available with just the WinAPI approach too for their own and FTN95 glory. Can hacking squad here disassemble NET MulthiThreading code? jalih, davidb, John, others? Smile
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1835
Location: Sydney

PostPosted: Tue May 07, 2013 9:12 am    Post subject: Reply with quote

Dan and Jalih,

I have not been closely following this post, but from my scanning of the discussion, you now have a way of controling multiple threads in a FTN95_win_32 environment for a non-trivial calculation.
Is it possible to provide a FTN95-Win32 sample that can use a multi-threaded aproach to multiply 2 large matrices ?

The two options to demonstrate could be:
[C] = [A]' x [B] where the coefficients of C are dot products of the columns of A and B, or
[C] = [A] x [B] where the coefficients of C are product of A_row and B_column.

Matrix multiplication is a good non-trivial example, as there can be no clashing between the threads and each thread can be allocated a partition of C, while sharing the memory of A and B. Assuming the number of threads could be 2 or 3 for I5 or 6 or 7 for I7, [C] could be partitioned by sets of columns.

Am I correct in my impression of what you have achieved is runing in an FTN95_win32 environment, and hopefilly not requiring C++ or Basic ?

I am not clear on the requirement for .NET or how memory can be shared between threads (arrays A and B), or allocated locally to a single thread (temporary rows of A or local variables). These types of memory management are essential.

Once this solution is demonstrated, it would be good to get some performance times for various sizes of A and B to demonstrate the thread management overhead.
I would be pleased to develop some benchmark of this and provide a documented FTN95 example for others to use.

In previous OpenMP attempts I have tried, I have never got a good solution, as it often fails with either thread management overheads, or (I am told) cache clashes for addressing A and B. It would be good to try this thread managed option and see what is happening and hopefully document a multi-thread solution that others could adapt.

Hopefully I have understood what you have achieved. I would welcome your assistance and look forward to your comments.

John
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 189

PostPosted: Tue May 07, 2013 9:14 am    Post subject: Reply with quote

Hi Dan,

I added support for critical section object, this should give a little bit faster locking times than with mutex object.

Updated DLL with critical section usage example
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 189

PostPosted: Tue May 07, 2013 9:31 am    Post subject: Re: Reply with quote

JohnCampbell wrote:
Dan and Jalih,
Is it possible to provide a FTN95-Win32 sample that can use a multi-threaded aproach to multiply 2 large matrices ?

How about:
Code:

module test
  INCLUDE <windows.ins>
  STDCALL attach_thread 'attach_thread' (REF, VAL):integer*4
  STDCALL wait_object 'wait_object' (VAL):integer*4
  STDCALL check_object 'check_object' (VAL):integer*4
  STDCALL close_handle 'close_handle' (VAL):integer*4
  STDCALL create_mutex 'create_mutex' (VAL):integer*4
  STDCALL release_mutex 'release_mutex' (VAL):integer*4
  STDCALL init_criticalsection 'init_criticalsection' (REF)
  STDCALL enter_criticalsection 'enter_criticalsection' (REF)
  STDCALL leave_criticalsection 'leave_criticalsection' (REF)
  STDCALL delete_criticalsection 'delete_criticalsection' (REF)

  integer, parameter :: msize = 1000
  integer, parameter :: threads = 8

  integer :: work_unit
  integer, allocatable :: start_pos(:)
  real, dimension(:,:), allocatable :: A, B, C

  contains
    subroutine thread(ptr)
      integer :: ptr, i, j

      do i = ptr,ptr+work_unit-1,1
        do j = 1,msize,1
          C(i,j) = sum(a(i,:) * b(:,j))
        end do
      end do
     
      call ExitThread(0)
    end subroutine thread

end module test


winapp
program mt
  use test
  implicit none

  integer :: i, j, x
  integer :: thandle(threads)
  real(SELECTED_REAL_KIND(6,37)) :: start,finish

  allocate(A(msize,msize))
  allocate(B(msize,msize))
  allocate(C(msize,msize))

  ! put test data into matrices
  do i = 1,msize,1
    do j = 1,msize,1
      A(i,j) = 9999 * random@();
      B(i,j) = 9999 * random@();
    end do
  end do

  write(*,*) 'Single threaded 1000x1000 sized square matrix multiply:'
  call clock(start)
 
  do i = 1,msize,1
    do j = 1,msize,1
      C(i,j) = sum(a(i,:) * b(:,j))
    end do
  end do
 
  call clock(finish)
  write(*,*) 'Total time in seconds:', finish-start

  ! Calculate work unit size for threads and assign starting positions for each thread
  work_unit = msize/threads   
  allocate(start_pos(threads))

  start_pos(1) = 1
  do i = 2, threads, 1
    start_pos(i) = (i - 1) * work_unit + 1
  end do

  write(*,*) 'Multi threaded 1000x1000 sized square matrix multiply with 8 threads:'
 
  call clock(start)

  ! Start threads
  do i=1,threads,1
    thandle(i) = attach_thread(thread,loc(start_pos(i)))
  end do

  ! Wait for threads to finish
  do i=1,8,1
    10 call temporary_yield@()
    x = check_object(thandle(i))
    if (x /= 0) goto 10
  end do

  call clock(finish)
  write(*,*) 'Total time in seconds:', finish-start
 
  write(*,*) 'All done. Bye!'
  deallocate(A,B,C,start_pos)
end program mt


Quote:

Am I correct in my impression of what you have achieved is runing in an FTN95_win32 environment, and hopefilly not requiring C++ or Basic ?

It's just a small FTN95 callable DLL. It's now written in assembler and compiled using MiniBASIC compiler.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1645
Location: South Pole, Antarctica

PostPosted: Tue May 07, 2013 12:06 pm    Post subject: Reply with quote

Jalih
I see some speedup but it is within the noise ( 3.26 vs 3.15 or such). Is this strictly console method (criticalsection)?

Another thought - how about also to try same matrix method above with NET just to get more information about both methods? I will try when have time (i am under killing deadlines and am dead if not succeed to parallelize the code in next two days and not get results due to expected speedup), or may be John will help as most interested in matrix party Smile. NET may give us clue about the ultimate limits of possible

John,
If your 3 matrices are larger combined then L3 cache size, which means larger then approximately 1000x1000 for real*8 and 24MB cache of some good processors then the trick is to partition by such chunks that completely fit into the cache. If execution of this chunk will be faster then the order of 1/100 second then you will not get any gain due to overhead. If forget about caches and all that and just go with larger and larger sizes then this method should still work giving you speedup. For example here are speedups for matrix sizes

100 2.5
250 2.96
500 3.07
750 3.14
1000 3.75
1250 4.15
1500 3.73
1750 3.52
2000 3.48
3000 4.02

Not bad at all, isn't it? Thanks Jalih again. But of course all also depends on specific cases, here processor is still doing a lot of work. Just the huge matrix multiplication may give you smaller speedup due to all mentioned above
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 189

PostPosted: Tue May 07, 2013 7:24 pm    Post subject: Re: Reply with quote

DanRRight wrote:
Is this strictly console method (criticalsection)?

No, it works in winapp as well. It provides synchronization similar to that provided by a mutex object, except that a critical section can be used only by the threads of a single process.

I will add app_mutex() function, that can be used to allow only one instance of the application to be running at the same time.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1645
Location: South Pole, Antarctica

PostPosted: Wed May 08, 2013 2:05 am    Post subject: Reply with quote

Here is matrix code for NET. Since NET seems does not like allocatables or modules (no time to find why or where is my error), i wrote the code without them. Run it with one single thread first to get the statistics - minimum time will be taken as a single thread execution time to get speedup number (which is min single threaded time divided by min N-threaded time).

The multithreading capabilities of NET though are truly excellent, with 4cores/8 threads i get speedups around 3.95-4.05. The fact that we do not see numbers larger then 4 like on other NET example with truly independent threads is not unreasonable. But also possibly tells that threads in this matrix example may collide for read access doing sum
Code:
C(i,j) = sum(a(i,:) * b(:,j))
Other explanations are also possible

Code:

! Compilation: ftn95 amatr.f95 /clr /link /multi_threaded
! Use this code at your own risk
!
! Run first just one thread several times to get statistics -
! min time will be taken as single thread excution time
!
  include <clearwin.ins>
  EXTERNAL runN
  parameter (nThrMax=8, msize=1000)
  common /threads_/nEmployedThreads, iFlagThreadEnded(nThrMax)
  common /matr_/ a(msize,msize),b(msize,msize),c(msize,msize)
  character*80 getenv@

  write(*,*) 'Processor ', getenv@('PROCESSOR_IDENTIFIER')
  READ(getenv@('NUMBER_OF_PROCESSORS'),*) n_processorsTotal
  write(*,*) ' Max number of threads=', n_processorsTotal

!... put test data into matrices
   do i = 1,msize,1
     do j = 1,msize,1
       A(i,j) = 9999 * random@();
       B(i,j) = 9999 * random@();
     end do
   end do

   timeST=10000

1 print*,' Enter number of parallel threads <= 8'
  read(*,*)   nEmployedThreads

  call clock@ (time_start)

!...set a flag of thread finished
  iFlagThreadEnded(:)=1
  iFlagThreadEnded(1)=0 ! needed because threads are initialized too fast

  do i = 1, nEmployedThreads
     CALL CREATE_THREAD@(runN, i)
  enddo

!...wait till all threads finish
  do while (minval(iFlagThreadEnded)==0)
    call sleep1@(0.1)
  enddo

  call clock@ (time_finish)

  time = time_finish-time_start
  time2= time * nEmployedThreads
  if(time2.lt.timeST.and.nEmployedThreads.eq.1) timeST=time2
  print*, 'Runime, total CPU time, Speedup=', time, time2, timeST/time

  goto 1

  END

 =======================================================
  subroutine runN (iThrHandle)
  include <clearwin.ins>
  parameter (nThrMax=8, msize=1000)
  common /threads_/nEmployedThreads, iFlagThreadEnded(nThrMax)
  common /matr_/ a(msize,msize),b(msize,msize),c(msize,msize)
  integer istart_pos(nThrMax)

   ithr = iThrHandle 
   iFlagThreadEnded(ithr) = 0

   iwork_unit = msize/nEmployedThreads   
   istart_pos(1) = 1
   do i = 2, nEmployedThreads, 1
     istart_pos(i) = (i - 1) * iwork_unit + 1
   end do

!  lock;   print*,'Started thread # ', ithr ;   end lock

       do i = istart_pos(ithr),istart_pos(ithr)+iwork_unit-1,1
         do j = 1,msize,1
           C(i,j) = sum(a(i,:) * b(:,j))
         end do
       end do

  iFlagThreadEnded(ithr)=1
  end
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1835
Location: Sydney

PostPosted: Wed May 08, 2013 2:15 am    Post subject: Reply with quote

Jalih,

Thanks for the example and the link to the DLL.
I hope to have some time in about 2 weeks and will look at this.

My plan is to:
1) declare matrices A, B and A_tran
2) Produce C = A x B and Ct = A_tran x B by both:
a single thread,
a multi thread approach, and
a single thread using SSE2 instructions that David has supplied.
3) compare these for accuracy and elapsed time for a range of matrix sizes, say 10k, 100k, 1mb, 10mb and 100mb

I will use ALLOCATE to scale up the matrix sizes.
For A x B, I will create a temporary vector A_row to allow dot_product.

The multi thread approach would group by rows. I am assuming I would define 3 threads for my Core i5, each processing every 3rd row of A and C. I need to understand your thread calls so that the number of threads can vary.
Potentially I can use the SSE instructions in each thread to combine mult-thread and vector instructions using FTN95.

I'll hopefully get back with a link to the code and run trace when I succeed.

John

PS : Dan, if this all works, then I will try to adapt my skyline solver using 3 threads and process 3 columns at a time, as equation.com
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1645
Location: South Pole, Antarctica

PostPosted: Wed May 08, 2013 5:05 am    Post subject: Reply with quote

John, I think this could be interesting to you. Didn't i tell you the most amazing tale/truth i discovered myself and even the author of parallel solver has no clue what and why this happen?

Doing tests i found that the skyline solver of Equation com running on AMD processors gives absolutely crazy speedups so that the worst cheapest multicore AMD laptop outperforms the best Intel PC by the factor of 2? Try it, may be it will help your code
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 189

PostPosted: Wed May 08, 2013 5:45 am    Post subject: Reply with quote

I added app_mutex() function into my DLL. You can use it as an easy way to allow only one instance of the application. I made module mt_dll, where I put all the STDCALL definitions.

My previous download link is updated...

example usage of the app_mutex() function, put somewhere at the start of program code:
Code:

  x = app_mutex()
  if (x == 0) then
    write(*,*) 'Only single instance of this application is allowed'
    stop
  end if
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1835
Location: Sydney

PostPosted: Wed May 08, 2013 8:17 am    Post subject: Reply with quote

Here is my code for doing the multi-thread matrix multiplication. I need to make it "thread wise". Vec_Sum is a Dot_Product, which can be replaced by David's SSE code.

John
Code:
      subroutine matmul_thread_test (A,B,C, chk, l,m,n, times)
!
!           A(1000000,100),   800 mb
!           B(100,10),          8 kb
!           C(1000000,10)      80 mb
!
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n), chk(l,n)
      real*4    times(2)
!
      integer*4 thread, n_thread
      real*4    err_max, ts(2), te(2)
      external  err_max
!
! 8)  If row of A is used - sequential Vec_Sum
!     3 threads
!
      C = 0
        call time_step (ts)
      n_thread = 3
      do thread = 1,n_thread
!
        call matmul_this_thread (thread, n_thread, A,B,C, l,m,n)
!
      end do
        call time_step (te)   ;   times = te - ts
        write (*,*) times, ' 8)  row Vector_sum Thread ', err_max (c, chk, l,n)
!
      end

      subroutine matmul_this_thread (thread, n_thread, A,B,C, l,m,n)
!
      integer*4 thread, n_thread
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n)
!
      integer*4 i,j
      real*8, dimension(:), allocatable :: row
      real*8    Vec_Sum
      external  Vec_Sum
!
      allocate ( row(m) )
!
        do i = thread,l,n_thread                     ! l = 1000000
          row(1:m) = A(i,1:m)                        ! m = 100
          do j = 1,n                                 ! n = 10
            C(i,j) = Vec_Sum (row, B(1,j), m)        ! m = 100
          end do
        end do
!
      deallocate ( row )
      end

      real*8 function Vec_Sum (A, B, n)
      integer*4 n, i
      real*8    A(n), B(n), s
!
      s = 0
      do i = 1,n
        s = s + a(i)*b(i)
      end do
      Vec_Sum = s
      end
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1645
Location: South Pole, Antarctica

PostPosted: Wed May 08, 2013 8:12 pm    Post subject: Reply with quote

About debugging. Currently it is impossible to use SDBG debugger inside the thread subroutines. Debugger either crashes encountering instructions in third party DLL or does not provide textual information about variables inside the thread subroutine and the game is over. Why? How about including assembler code into Fortran text instead of DLL, will it debug then? Does FTN95 with Dev Studio (i do not use it) debug inside the threads?
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 5043
Location: Salford, UK

PostPosted: Wed May 08, 2013 8:55 pm    Post subject: Reply with quote

If the DLL is compiled with FTN95 or SCC in /DEBUG mode then you can probably use SDBG but you may need to set a break point from within a DLL code file. It might be a bit tricky but I think it will work.
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 189

PostPosted: Thu May 09, 2013 2:23 pm    Post subject: Reply with quote

I added a couple of more functions:

get_numcpu()
set_threadpriority()
get_threadpriority()


get_numcpu()
Returns the number of processors or cores as integer.

set_threadpriority(nPriority)
Sets the priority value for the current thread. Returns nonzero integer for success.

Some possible parameter values to try are:
THREAD_PRIORITY_LOWEST = -2
THREAD_PRIORITY_BELOW_NORMAL = -1
THREAD_PRIORITY_NORMAL = 0
THREAD_PRIORITY_ABOVE_NORMAL = 1
THREAD_PRIORITY_HIGHEST = 2
THREAD_PRIORITY_TIME_CRITICAL = 15
THREAD_PRIORITY_IDLE = -15

get_threadpriority()
Returns priority value of the current thread as integer.

Updated DLL package
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1645
Location: South Pole, Antarctica

PostPosted: Thu May 09, 2013 10:58 pm    Post subject: Reply with quote

Jalih, From the words of Paul follows one important conclusion we have to take into account. I hope that in future all what you have done could be rewritten either to allow all definitions to be directly in the Fortran code or in C for SCC because no matter how great speedup will be achieved the ability to efficiently debug is more important. Only this compiler has this key feature no other compilers have - to find even nanoneedle in the hive. It's the developer's time versus computer time. The developer's time and nerves lost searching the bug are much more important and expensive then anything else. If it is possible (and i thought before that it is not, and only using print from the thread is doable) we can not lose possibility to use the debugger
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Suggestions All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4  Next
Page 3 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group