forums.silverfrost.com

DanRRight

Parallelization on PC becomes more and more mainstream. We compared here before some parallel methods and libraries with conventional solvers. Latest news, specifically interested for John Campbell is that constant bandwidth solution got 42x speedup on 48 cores PC, shrinking waiting hours to minutes and minutes to seconds, see the site Equation dot com

John-Silver · Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley

Interesting article, shows that a 1:1 core to computing time ratio can , against all the expected odds, (almost) be achieved ... with one important proviso as stated:-

" One thing is mentioned here. Parallel computing is not everything, either. Sometimes, parallel computing could be an extra burden, and cannot gain a benefit. It could happen when the problem size is small, having no much computations that are able to be distributed among cooperative cores. In such situation, parallel computing cannot gain a benefit, but in contrary may slow a computing"

which still leaves the user to decide whether or not to use it (assuming they have a 4x1\2=48core machine ... which is not a PC btw .. .maybe it should be classified a PC-ray ? :O)

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

There certainly are some interesting articles on equation.com.
Over the last 6+ months, I have been modifying my skyline solver to utilise new features in processors and OpenMP. I have had some success and identified coding changes that utilise:
SSE2 and AVX instructions
!$OMP coding strictures, and
Cache size data structures to limit cache conflicts
I have been surprised by the importance of cache issues, which is even more important when !$OMP is being used, as memory access rates become very important.

I have found the equation.com version of gFortran to be better to use from my testing, as they have improved the !$OMP startup overheads compared to the mingw version.

Your discussion of the article on 48-core approach is an interesting discussion. I am not familiar with the PowerEdge hardware being discussed, but the performance being quoted is not very impressive, when compared to say a single i7-4770 using 8 threads where 50,000 equations and a bandwidth of 7,000 would be expected to take significantly less than 330 seconds. (I just ran this on my 2-core/4-thread i5-2300 and it took 159 seconds to reduce this matrix !) I am looking to test 4 core and 6-core configurations to see how it scales up, but multiple chips would have more complex communication problems.
Mult-core (chip?) would be more complex, although they appear to have developed an approach to controlling this hardware configuration. It would be better if it was more impressive in comparison to simpler i5 or i7 configurations.

John

John-Silver · Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley

You make a good point John about it not being particularly quick relative to the examples you quote, but I think the article Dan quotes wasn't intended as an assessment of the absolute processing capability of that hardware but more of a demonstration that the efficiency of 1:1 with respect to nx1 processors/threads v a single processor/thread is achievable (for the 'right' kind of problem obviously.

Choosing between hardwares is then another question , obviously related because the total speed is obviously a combination of the basic speed and the multi-processing capability efficiency .

You'd need the single core/thread figures for the hardwars you mention to be able to compare the relative merits of those hardwares against the one used for the equation.com example.

... and the comparative results might be different again for different problems on different machines !

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

John,

I don't agree with what you are saying; it is still extremely slow.
While it might be good to demonstrate high efficiency with n processors (which my testing can't), the times they quote at November 2014 are nothing like what can be achieved on an i5 or i7 processor, which I would suggest are at the cheap end of multi-core processors.
When I read the article, I was struck by how slow one process was: 13,878 seconds is extremely slow. The matrix is 2.6 gb in size, so I can't easily do a single thread test on FTN95 to show, but I would estimate the times are 600 seconds with SSE instructions and 1,200 seconds with x87. That is a long way from 13,000 seconds. ( I will when ftn95_64 is released. )

My testing to date on 4 cores/8 threads does not achieve a 50% reduction in run time compared to the i5. This is what I am trying to improve at this time. It is difficult to know what is the cause, as many causes are possible, including memory access times or cache clashes.

The observation of achieving near 100% efficiency for n threads would be better if the absolute times were comparable to other hardware. Near 100% efficiency would certainly challenge what I can achieve at this time and what I would expect achievable for a skyline solver.

One of the interesting outcomes of the project I have undertaken is to identify how much I don't know about modern processor architecture and how difficult it is to find out. For example, how is three levels of cache L1, L2 and L3 shared among 4 cores and 8 threads ? or how do 2 threads share 1 core and achieve 100% efficiency using hyper-threading ? Then this article comes along and claims 100% efficiency for a skyline solver with 48 threads. Hopefully I will look back on this post in 12 months and see where I am wrong, although you can hide a lot in 13,000 seconds!

John

DanRRight · Posted: Mon Apr 13, 2015 12:36 am Post subject:

JohnCampbell,

Yes there could be a lot of things involved, for example that large size matrices are mostly memory bandwidth bound and then scalability only depends on number of processor cores, their cache size, cache coherency of solver and not on SSE or processor speed clock. May be this server chip was inefficient but better I'd ask author to provide his test source and run it on different solvers to make sure you compare apples to apples.

Before i checked many different AMD chips and all were kind of similar - they were way more efficient then Intel on this specific solver. Even cheapest laptop A series ones were faster then anything Intel makes. Here are results of the tests using IVF compiler (exes taken from Equation site):

1) AMD and skyline on LAIPE 6 years old 6-cores 2.8GHz Phenom 1100T processor
2) latest overclocked to 4.5GHz 4cores/8threads Intel 4770k

AMD: Processors: 6 Elapsed Time (Seconds): 0.37
Intel: Processors: 8 Elapsed Time (Seconds): 0.50

And that is the same LAIPE library which we tested here a year ago on dense matrix solvers

John-Silver · Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley

Wow you're both way too experienced for me on this !
I was only looking at it from a novice's point of view who knows nothing of the nitty gritty .
Correct me if I'm wrong but I assumed 'n' processors can't run faster than n times a single processor, so my comment was simply that it was trying to indicate that the 'efficiency' wrt that was almost optimum (i.e. 1:1).
All the other factors as John C. mentions then obviously plug in to the equation of getting the fastest processing possible and the best machine for the job. i.e. the overall efficiency.
I guess I was trying to say what Dan pout succinctly, apples v apples is needed for the comparison.

mecej4 · Joined: 31 Oct 2006 Posts: 1886

The role of cache (shared and private) in a multiprocessor/multicore system is to make the speed-up calculation more complex. In fact, there is such a thing as superlinear speedup: http://en.wikipedia.org/wiki/Speedup#Super-linear_speedup .

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

John,

The more I try to understand OpenMP, the less I realise I know. Version 4.0 of the specification is now out, but all I use is !$OMP PARALLEL DO !

I think you have the efficiency assumptions correct.
The main assumption in assessing parallel efficiency is that there is a fixed amount of computation to be performed. If you have N threads, and divide the computation evenly, then it takes 1/Nth the time + thread management overheads.

The reduced efficiency reflects both the significance of the thread management overhead and the inability to split the computation into N equal parts. You would have to share the overhead.

The key problems for a skyline solver is that a direct equation reduction is a very sequential process and not all computation can be done independently.
You could split it up efficiently using a DO loop, where the calculations change independent parts of the matrix.
The single thread solver is basically 3 DO loops, so any efficient parallel solver must act on the outer loop and find N balanced threads, which is the challenge.
The tests I have done going from 4 threads to 8 or 12 don't show the efficiency I would like, so 48 threads looks a long way off. As you increase the number of threads, other problems start to effect the run time, especially memory access bandwidth. More issues to understand ! Sharing memory between 48 cores looks to be a significant problem I don't know how to solve.

Then again, a successful algorithm needs a new approach, which that paper claims. The problem with the efficiency claim is that their single thread calculation is swamped by some other source of overhead, which discredits their efficiency claims.
I've read the article a few times, and I understand there are 50,000 equations, with fixed band of 7,000 using real*8 arithmetic. I ran this test size problem on my single thread Skyline solver, which does not have cache blocking and it took 698 seconds on my i5-2300. This was compiled using gFortran 4.9.1 (from equation.com) and included vector instructions. I am not sure how a Dell PowerEdge server compares to an Acer i5-2300 I bought for $800 but I think 13,000 seconds is not a reasonable baseline performance. If they showed 1/48th of 700 seconds, say 15 seconds, rather than 333 seconds, then I would be impressed.
There are people out there with better algorithms than me; I only hope they publish how they do it so the rest of us can learn.

John

DanRRight · Posted: Mon Apr 13, 2015 7:37 pm Post subject:

John, The source code for the test must be exactly the same.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

I am not sure what you mean by the same code.
I recompiled my cache management version skyline solver without -fopenmp and it ran in 512 seconds.
cache vs no cache is 512 vs 698 ; this shows that for a 2.6gb matrix, cache management into 6mb chunks does provide improvement. (27%)

4 thread vs 1 thread is 159 vs 512 ; this shows that 512 /(159*4) = 81% efficiency with 4 threads. 81% is a lot less than 108% that is reported in the poweredge test. ( As Mecej4 posted, you can get better than 100% if the resulting calculation has better cache usage, although this would be unusual )
I need to do the 4 core/8 thread test, but I would estimate an efficiency of less than 70%. (Improving these efficiencies is my latest challenge and probably why I have so much to say in this thread.)

Results for matrix multiplication would be different, where the calculations are very independent. For this calculation, much higher efficiency can be achieved, although the cache management strategy can be more difficult and varies depending on the shape of the matrices.

Dan, I am sure your next question will be when will FTN95_64 support OpenMP. ( Actually providing openmp with "Release x64" would be a help )

John

DanRRight · Posted: Tue Apr 14, 2015 8:02 am Post subject:

John,
I do not know what is wrong here but common sense tells me that you and LAIPE guys used some different test sources. May be they made a typo or something. May be they call a half-width as a matrix width, or even did not transpose the matrix or some other stupid reason etc, i do not know. You should ask them for the source code for test program. They also provide the whole library source by the way for gfortran. You have seen in our own tests here that library works well on dense matrices. The rules tell that for clean experiment you have to use absolutely the same initial conditions

mecej4 · Joined: 31 Oct 2006 Posts: 1886

I hope that readers do not mind if I refresh this old thread. DanRRight and I have exchanged a few private messages this week regarding using the Laipe library with FTN95-64.

Laipe is a library for solving simultaneous linear equations. It comes in two versions, Laipe1 and Laipe2. Laipe2 is included in the downloads of GFortran/GCC that are provided by Equation.com and are used by many people, including myself. I had been under the impression that the library was functional only if one had purchased a license ($1800), but it turns out that if one figures out a few issues that are un-/under-documented, Laipe2 is quite usable as is, with Gfortran. Once you build a DLL out of the library, you can use the 32- and 64-bit Laipe libraries from FTN95-32 and FTN95-64.

DanRRight has reported in the past about the performance of Laipe1 and other linear equation solvers (in this thread). He used, as far as I know, only Laipe1.

My testing shows that Laipe1 and Laipe2 can be useful, BUT: (i) the documentation is badly in need of revision, and (ii) I have not seen Laipe do better than the commercial competition, namely, MKL/Lapack, and several sparse equation solvers (MKL-Pardiso, Pardiso-project.org, UMFPack, SuperLU, etc.).

Here are two short test programs and the results that I obtained. The first is a test program to use with Laipe2.

mecej4 · Joined: 31 Oct 2006 Posts: 1886

Here are the timings on a four-core i7-2720MQ laptop running Windows 10-64. The MKL and Laipe2 columns are for 64-bit EXEs, but the Laipe1 column is only for the 32-bit library that I have access to.

DanRRight · Posted: Wed Mar 22, 2017 6:23 pm Post subject:

Mecej4,

Great job.
Seems the author of Laipe still did not update it to use SSE/AWX
1) But was parallel Laipe not faster then serial SuperLU and UMFPack on 4 cores? Sounds strange
2) Also how about your previous assessment that LAIPE2 is always slower then LAIPE1 while the test shows opposite?
3) How about using 64bit Laipe with 64bit FTN95 ?
4) Do you know if MKL have block matrix solver for the symmetric matrix like below? Arrow show the current width of block. In Laipe this is Decompose_VAG_8