forums.silverfrost.com

DanRRight · Posted: Thu Jul 04, 2013 3:51 pm Post subject:

Thanks Paul for the efforts, this implementation makes parallelization as easy as 2x2.
My first observations. The parameter passed to subroutine threadFunc is somehow changed in between by one as you can see in this example. This probably needs your attention, it is important to have exactly the same values

DanRRight · Posted: Thu Jul 04, 2013 4:18 pm Post subject:

Continuation of the NET code

jalih · Joined: 30 Jul 2012 Posts: 196

DanRRight · Posted: Thu Jul 04, 2013 7:33 pm Post subject:

Yes, I remember that from your approach. But I hope Paul worked that out to remove using LOC somehow. May be it's just me but i find using pointer adds some mind melting twist to the whole generally simple idea, or at least rises the question why which can potentially stop people from trying new things if not explained well.

P.S. Anyway,looks like Paul's approach does not need LOC, but still needs an array. Decently, I need a fresh head to understand why. The modified code which shows threads correctly is here

PaulLaidler · Posted: Thu Jul 04, 2013 7:34 pm Post subject:

I have tested Jalih matrix multiplication program using his routines and his DLL and compared the results with those obtained using the new routines. The results are the same and using two processors I get half the single processor time as expected.

There is very little to optimise. Start_thread@ has almost no overhead and just calls on CreateThread. Lock@ uses a Critical Section approach and, though this may not be optimal, it should have little effect on the performance.

If there is clear evidence that .NET does much better then I will have to get inside the .NET code and find out what it is doing.

DanRRight · Posted: Thu Jul 04, 2013 7:43 pm Post subject:

Matrix multiplications are slower with large arrays and may have their own overheads due to limited memory bandwidth if things go out of L1/L2/L3 cache and hence may hide inefficiencies. I remember i was getting speedups with large spread 2.5-4 on 4 cores with Jalih's method. Now i realized this method on another task but when run despite i get speedups of the order of 2-2.2 i always dream about NET's crazy speedups the example above showed. This example is completely inside L1 cache.

Definitely we have to do more testing. By the way don't you see the same very large speedups in NET case on your computer?

Even such no-threaded 10 lines code extracted from the codes above being run in NET mode goes 7.05 seconds as opposed to 9.01 seconds in regular x86 case, an almost 30% speedup. Please check if your mileage is the same