Topic: Two multithreading programs in General

Matrix multiplications are slower with large arrays and may have their own overheads due to limited memory bandwidth if things go out of L1/L2/L3 cache and hence may hide inefficiencies. I remember i was getting speedups with large spread 2.5-4 on 4 cores with Jalih's method. Now i realized this method on another task but when run despite i get speedups of the order of 2-2.2 i always dream about NET's crazy speedups the example above showed. This example is completely inside L1 cache.

Definitely we have to do more testing. By the way don't you see the same very large speedups in NET case on your computer?

Even such no-threaded 10 lines code extracted from the codes above being run in NET mode goes 7.05 seconds as opposed to 9.01 seconds in regular x86 case, an almost 30% speedup. Please check if your mileage is the same

    Program  NETisWayFaster
    call clock@ (time_start) 
    nEmployedThreads = 1
    d=2.22 
    do i=1,200000000/nEmployedThreads 
      d=alog(exp(d)) 
    enddo 

    call clock@ (time_finish) 
    time = time_finish-time_start 
    print*, 'Pure no-threaded case=', time

   end

I compiled this snippet in NET mode ftn95 NETisWayFaster.f95 /clr /link /multi_threaded and in x86 one ftn95 NETisWayFaster.f95 /link /opt /P6

And all thought NET is slow... NET is damn killing machine sleeping deep in the FTN95 internals 😃

Addition: I tried to debug the multithreading code and all seems works fine (which is great about this method), the only kind of problem is that wait_for_thread@ is causing SDBG debugger to generate an assembler (which goes away with no problems after hitting F8 few times). Would be nice if no assembler window appeared at all when when this method fine tuning and debugging will be complete, or at least assembler window appeared on top of Fortran text window not closing it like it goes right now