Silverfrost Forums

Welcome to our forums

New Topic \"NET\"

8 May 2013 7:17 #12189

Here is my code for doing the multi-thread matrix multiplication. I need to make it 'thread wise'. Vec_Sum is a Dot_Product, which can be replaced by David's SSE code.

John

      subroutine matmul_thread_test (A,B,C, chk, l,m,n, times)
!
!           A(1000000,100),   800 mb
!           B(100,10),          8 kb
!           C(1000000,10)      80 mb
!
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n), chk(l,n)
      real*4    times(2)
!
      integer*4 thread, n_thread
      real*4    err_max, ts(2), te(2)
      external  err_max
!
! 8)  If row of A is used - sequential Vec_Sum
!     3 threads
!
      C = 0
        call time_step (ts)
      n_thread = 3
      do thread = 1,n_thread
!
        call matmul_this_thread (thread, n_thread, A,B,C, l,m,n)
!
      end do
        call time_step (te)   ;   times = te - ts
        write (*,*) times, ' 8)  row Vector_sum Thread ', err_max (c, chk, l,n)
!
      end

      subroutine matmul_this_thread (thread, n_thread, A,B,C, l,m,n)
!
      integer*4 thread, n_thread
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n)
!
      integer*4 i,j
      real*8, dimension(:), allocatable :: row
      real*8    Vec_Sum
      external  Vec_Sum
!
      allocate ( row(m) )
!
        do i = thread,l,n_thread                     ! l = 1000000
          row(1:m) = A(i,1:m)                        ! m = 100
          do j = 1,n                                 ! n = 10
            C(i,j) = Vec_Sum (row, B(1,j), m)        ! m = 100
          end do
        end do
!
      deallocate ( row )
      end

      real*8 function Vec_Sum (A, B, n)
      integer*4 n, i
      real*8    A(n), B(n), s
!
      s = 0
      do i = 1,n
        s = s + a(i)*b(i)
      end do
      Vec_Sum = s
      end
8 May 2013 7:12 #12191

About debugging. Currently it is impossible to use SDBG debugger inside the thread subroutines. Debugger either crashes encountering instructions in third party DLL or does not provide textual information about variables inside the thread subroutine and the game is over. Why? How about including assembler code into Fortran text instead of DLL, will it debug then? Does FTN95 with Dev Studio (i do not use it) debug inside the threads?

8 May 2013 7:55 #12192

If the DLL is compiled with FTN95 or SCC in /DEBUG mode then you can probably use SDBG but you may need to set a break point from within a DLL code file. It might be a bit tricky but I think it will work.

9 May 2013 1:23 #12193

I added a couple of more functions:

get_numcpu() set_threadpriority() get_threadpriority()

get_numcpu()

Returns the number of processors or cores as integer.

set_threadpriority(nPriority)

Sets the priority value for the current thread. Returns nonzero integer for success.

Some possible parameter values to try are: THREAD_PRIORITY_LOWEST = -2 THREAD_PRIORITY_BELOW_NORMAL = -1 THREAD_PRIORITY_NORMAL = 0 THREAD_PRIORITY_ABOVE_NORMAL = 1 THREAD_PRIORITY_HIGHEST = 2 THREAD_PRIORITY_TIME_CRITICAL = 15 THREAD_PRIORITY_IDLE = -15

get_threadpriority()

Returns priority value of the current thread as integer.

Updated DLL package

9 May 2013 9:58 #12194

Jalih, From the words of Paul follows one important conclusion we have to take into account. I hope that in future all what you have done could be rewritten either to allow all definitions to be directly in the Fortran code or in C for SCC because no matter how great speedup will be achieved the ability to efficiently debug is more important. Only this compiler has this key feature no other compilers have - to find even nanoneedle in the hive. It's the developer's time versus computer time. The developer's time and nerves lost searching the bug are much more important and expensive then anything else. If it is possible (and i thought before that it is not, and only using print from the thread is doable) we can not lose possibility to use the debugger

10 May 2013 4:47 #12196

I have not had the opportunity to look at this subject but I am hopeful that I will be able to include all of Jalih's good work in salflibc.dll.

10 May 2013 7:30 #12197

Quoted from DanRRight I hope that in future all what you have done could be rewritten either to allow all definitions to be directly in the Fortran code or in C for SCC because no matter how great speedup will be achieved the ability to efficiently debug is more important.

I haven't really played around much with the SDBG debugger but I think, it don't support the debugging of multiple threads.

As my DLL is just a wrapper for native win32 api functions, it should not matter what language it is written.

10 May 2013 10:12 #12199

Jalih, What inherently prevents SDBG to do the debug in initially defined thread? If set specific breakpoint condition in the specific thread shouldn't debugger stop at that condition and display whatever you want? Like here, for example, you tell debugger to stop at thread #3 when i=1000.

!...ptr is thread handle number

   do i=i,200000000/8,1
     d=alog(exp(d))
       if(i.eq.1000.and.ptr.eq.3) then
           i2=i+1
       endif
   end do

Let other threads at this moment of debugger interruption do whatever they want and even finish the run (or debugger could stop other threads too, whatever). Right now something is preventing debugger from displaying debug information.

Debugger is absolutely necessary thing. Without it you can only debug using printing out of the code and write only small well structured subprograms. And as a programmer must be well organized (that's not me, i make 3 errors and a dozen of typos per line). If someone else will touch it - the whole code is dead and you will never find where and why. But if debugger will be able to tell about threads conflict when two threads write into the same variable - that would be best help ever, writing multithreaded codes would be super easy.

My initial experience with parallelization of real code? Right now i'm on the end of my second 16 hours day rebuilding relatively small subroutine of 2K lines (which is a postprocessor part of a large code) into parallel code. Feels like i am in the middle of the dark forest with the hope permanently leaving me and returning back. I'm placing debug prints and plant fake code to find which variable is still not local so it conflicts freezing the code. 1000th of freezes resolved by such blind carpet bombing and i'm just a half way done. I'd not even start doing that if this was not my own code i know each and every letter. But if debugger worked in threads or at least tell which variables are conflicting, i'd do that rebuild in 2 hours

11 May 2013 7:37 #12202

Quoted from DanRRight Jalih, What inherently prevents SDBG to do the debug in initially defined thread? If set specific breakpoint condition in the specific thread shouldn't debugger stop at that condition and display whatever you want?

The problem is that, the SDGB knows nothing about the threads. Remember all threads in the process share memory but have their own stack.

Maybe trace buffer support could be added to SDGB to provide log of the thread events? Simple example and information available here.

11 May 2013 12:51 #12203

Was fighting whole day&night today to find the reason of code freeze and crash inside subroutine which is called by the main thread subroutine. Whole this subroutine code is completely legitimate, all variables are defined, all write variables are local, they do not use common blocks and passing their values back to thread via dummy variable list. Any suggestion for possible reasons why calling one more subroutine could cause crashing? For example, can even reading of the same array variables (which are in common block) by two competing threads cause conflict and freeze? Or FTN95 lacks similar /multi_threading key like NET has to make threads safe environment? Anyway, i cornered the place of conflict causing freeze to single line with all variables printed before the crash absolutely OK and local. Still i get the crash. Yes, the code is bouncing neat unsafe place by using enlarged stack and /3GB which also could be the reason.

My enthusiasm is still high (i hope FTN95 developers will help) but right now things are not great after finding couple deadly pitfalls. Do not try to parallelize anything larger then few tens of lines. The worst thing is that the method right now basically kills the debugger and that means that for large codes the game is over. The parallelization of numerical methods is maximum what is worth to do right now.

The real application code needs complete rebuild and is not parallelizable because anything may freeze the code and finding where becomes impossible task. The FTN95 developers are the first who need to go through the whole process from beginning to the end. Debugger now has to show separately local and global variables. The most important addition to the parallel threads debugger would be revealing 'writing to the same global variable' conflicts with exact pointing at this spot in the Fortran code. The code right now became too fragile. Any variable not set causes code to freeze. That is what is called 'bewitched'. LOL Debugger does not provide any information besides useless assembler texts. Since SDBG is not working, the debugging process looks like exact copy of how people were debugging in 1970th - 10000 of recompilations, line by line printing on the display everything, cutting code by 2, 4, 8 till the error is found etc.. Worst part is that after all legitimate reasons for crash were sorted out the code still crashes on some places with no reason. Good news is gaining experience in parallel programming we all are moving to.

12 May 2013 12:56 #12207

Well...after crazy week of non stop debugging I give up until better times. The work was so absurdly intense that Saturday evening i with big surprise found that it's not a Thursday LOL

It's impossible to to find the source of conflict violation. Probably some undefined variable exists or threads write to the same variable and there is almost no way to find WHERE without debugger...The method disables debugger entirely in all places not just threads which is unacceptable for large codes killing all potential benefits of large speedups.

It's time for developers to step in

14 May 2013 5:29 (Edited: 16 May 2013 7:52) #12216

I made one step further (these large speedups are very appealing!) finding the reason of the crash in one specific place. This is mystery to me. It follows me for 25 years. Basically if the code has exp(-A) with A more then 50-70 which is causing underflow than this may cause access violation. I can not confirm that with the simple code because it usually works OK, but that's definitely the place. When i made restriction A=min(A,50.) the error was gone. Is it possible that processor issues interrupt reporting underflow which is ignored by the compiler (like it generally should) but when two interrupts are issued at the same time by the two threads then this causes the whole code crash?

15 May 2013 12:06 (Edited: 16 May 2013 7:50) #12217

Paul, Yes, i made the simple demo code, and seems the reason is as it was suspected above. Here is that old roach which was hiding under the rock for 25 years crashing this and even FTN77 compilers. Code freezes threads or not depending on using one of another marked line. This crashed sometimes even regular single-threaded code causing me to lose weeks - see this older post

https://forums.silverfrost.com/Forum/Topic/2007&highlight=freezes

Compilation

FTN95 mt_cwin.f95 slink mt_cwin.obj mt.dll

You will need Jalih's mt.dll from the links above. I keep Jalih's original Fortran text just as multithreading example so all DO loops here used for speed testing are not needed. And i decreased single threaded DO range by 10 which is also irrelevant to demonstrate this bug. Code takes exp(-A) one by one until A becomes too large around 70 and exp(-A) too small compared to real*4 machine zero which is somewhere below 10**(-37). Co-processor in this case must issue an underflow interrupt and potentially the hardware logics which handles it either slow or buggy (and in my IBM PC XT in 80th we with the friends made on the knees out of copycatted TTL parts including processors of some european country it was even absent always causing computer crash when there was any FP error or underflow... LOL ) or can only handle one core and when two co-processors issue interrupts all thing crashes

module test
  INCLUDE <windows.ins>
  STDCALL attach_thread 'attach_thread' (REF, VAL):integer*4
  STDCALL wait_object 'wait_object' (VAL):integer*4
  STDCALL check_object 'check_object' (VAL):integer*4
  STDCALL close_handle 'close_handle' (VAL):integer*4
  STDCALL create_mutex 'create_mutex' (VAL):integer*4
  STDCALL release_mutex 'release_mutex' (VAL):integer*4

  
  integer :: hMutex
  integer :: values(8) = (/1,2,3,4,5,6,7,8/)

  real AAA(100)	

  contains
    subroutine thread(ptr)
      integer :: ptr, i
      real d


      i = wait_object(hMutex)
      write(*,*) 'Starting calculation in thread', ptr
      i=release_mutex(hMutex)
      
      do i=1,100
        underexp = aaa(i)             !  <--- bug 
!         underexp = min(50.,aaa(i))  !  <--- works
         d = exp(-underexp)
      enddo	
      d = 2.22
      do i=i,200000000/8,1
        d=alog(exp(d))
      end do
      
      call ExitThread(0)
    end subroutine thread

end module test

winapp
program mt
  use test
  implicit none

  integer :: i, x
  real :: start, finish, d
  integer :: thandle(8)


  do i=1,100; 
    AAA(i)=i+0.1; 
  enddo

  write(*,*) 'Single threaded test'
  call clock(start)

  d = 2.22
  do i=i,20000000 ! 0
    d=alog(exp(d))
  end do

  call clock(finish)
  write(*,*) 'Total time in seconds:', finish-start

  write(*,*) 'Multithreading test'
  
  hMutex = create_mutex(0)

  call clock(start)
  
  do i=1,8,1
    thandle(i) = attach_thread(thread,loc(values(i)))
  end do

  do i=1,8,1
    10 call temporary_yield@()
    x = check_object(thandle(i))
    if (x /= 0) goto 10
  end do

  call clock(finish)
  
  x = close_handle(hMutex)
  
  write(*,*) 'All done. Bye!'
  write(*,*) 'Total time in seconds:', finish-start
end program mt
16 May 2013 6:19 #12219

I have logged this for investigation.

17 May 2013 7:03 #12227

Finally i finished parallelization of my now even more bewitched code. Worst were the problems with the exponent as i described it above, then finding which variable is local and which global, and debugging by printouts from the code. Tremendous difficulty debugging parallel code without debugger is due to mainly that the errors which take place in one place (and you will get info about them only using printout from the code as a Neanderthals method of debugging) will confuse you and usually get reported in the other. That will happen until you plant inside the code literally hundreds of prints. Then after crazy weeks of such debugging the wild horse is yours.

So this approach works, thanks to Jalih, he has done pretty good job with everything operating reliably. Now it's time to adopt it most logical way into FTN95 and raising it to the next level. For example

  • merging some syntax with NET by substituting wait_object(ihMutex)/release_mutex(ihMutex) to LOCK/UNLOCK,
  • taking C WinAPI definitions,
  • allowing debug in threads,
  • threads collision/conflict in debugger has to point to exact place of access violation in the Fortran source code,
  • separating the debug windows for local and global variables (or marking them by color or some other way)
2 Jul 2013 11:03 #12549

Jalih

Is the code for attach_thread available for inspection? I confess that I am missing a trick here somewhere.

Paul

p.s. Problem sorted.

Please login to reply.