Silverfrost Forums

Welcome to our forums

Sparse Matrix Solution tools for matrix inversion

3 Apr 2016 4:39 #17379

John, Is the skyline solver your primary CPU time killer? Try LAIPE parallel skyline solver and compare it to your parallel method, interesting what you'll get. And one more info: on AMD processors vs Intel ones the LAIPE skyline solver was literally doing unexplainable miracles: even cheapest 4-core AMD laptops were beating super-overclocked Intel desktop monsters. Do not believe me? Go to equation dot com website and download skyline test program compiled using IVF Fortran to supposedly best fit Intel computers. Run it on AMD and Intel and see yourself

4 Apr 2016 12:56 #17380

Hi Dan, The skyline solver is the solver I use. I am aware of Laipe, but prefer to develop my own solver and learn from the process. I have made good progress and have a solver which works well in comparison to other solvers I have reviewed. The main outcome of this study was:

  • use of OpenMP to enable multiple threads
  • use of vector instructions to speed up each thread
  • partitioning of the solution process to balance thread load
  • and importantly adopting a strategy of cache usage to minimise the limitations of the memory transfer bottleneck.

When I first started learning about OpenMP, I did not appreciate the limitation of memory access speed, transferring information to and from memory. It is important to use strategies to keep information in the cache and to modify the cache info, minimising the transfers between cache and memory. If you have 8 threads all accessing memory this becomes the performance bottleneck. Even single thread AVX vector instructions become constrained and can only work effectively if the vectors are already in the cache. You can see this when running on processors with different clock rates, the performance ratios are more dominated by the memory access speed and the cache size. (Early on, I wrote tests that could not show the benefit of AVX over SSE, because of the memory speed problem)

There are lots of calculations that become too complex for multi-thread coding (too much work!) or the calculation packet is too small to overcome the multi-thread overheads (entering a !$OMP region can take about 20,000 processor cycles). I have found they can be more easily improved by running multiple single-thread processes that target vector instructions. The hopeful inclusion of vector instructions in FTN95 will maintain it as a useful development and operation compiler.

John

Please login to reply.