Topic: Salford FTN95 run time performance in General

DrTip

Posts: 73 Manchester

Back to Top

2 Dec 2010 9:54 #7193

Hi John

whilst I am not going to argue with your specific problem. Or for that matter try and give you fortran code...

but I was struck by this

Quoted from JohnCampbell

Eddie, direct solution of linear equations is not suited to distributed processing, while iterative solvers (such as gauss-seidel and conjugate gradient methods) are suited to distributed processing. I don't have the time to understand how they work !!

I get the point generally that at a high level iterative solvers can have bits to do in a distributed manner and therefore make code faster.

but I was wondering (in the abstract) why you say you can't distribute direct equation solvers? (in principle)

consider

3 = 4x + 5y 6 = 7x + 8y

algebraic direct solution would be done something like eliminate y by multiplying top by through by (8/5) and taking bottom equation.

(8/5)3 - 6 = (8/5)4x - 7x

solve for this and then sub answer for x back in. Job done I will cease to teach the sucking of eggs 😄

However, in principle the distributed bit could be done down at the low level of the multiplications.

in the above we have to multiply each term by (8/5) each term could be assigned a separate thread (processor) and then the answer returned.

which in principle could be quicker the single thread process.

Obviously maintaining many threads etc has its own overheads and pitfalls. I wouldn't do it for such a trivial problem, but may be if the problem becomes large enough, such things become may more reasonable :?:

I haven't really come to terms with multi core computers yet, but thats the way the world is going and I have seen articles about how this sort of lower level distribution, is the way to take advantage of it.

if you have 55 cores kicking about doing nothing I think we need to start rethinking how we write code!

anyway just a thought

Carl

EKruck

Posts: 221 Aalen, Germany

Back to Top

2 Dec 2010 9:27 #7198

Carl,

of corse you can destribute your equation solution to several threads.

I have made many trials with this for equation solution. As long as your matrix does not fit into memory you will just heat your CPU 100% and slow your process down. There is too much overhead.

Erwin

JohnCampbell

Posts: 2526 Sydney

Back to Top

3 Dec 2010 1:29 #7199

Carl,

Direct equation solution is suited to multi-processor or parallel processing in the same pc, while I understand that itterative solvers are more suited to distributed processing, using multiple pc's. I'm a bit out of my field of expertise but the main reason is itterative solvers can partition the problem, sharing a boundary, while direct solvers (that I use), require fast memory access to the full matrix. I should point out that the matrix I solve is symmetric, sparce and ordered to be banded. It does not need to be positive definite. ( Erwin, I think there are versions of variable bandwidth solvers for non-symmetric equations, but I am not familiar with their solution stability. Variable band solutions often have extreme maximum bandwidth, but good average bandwidth, such as in a spoked wheel problem) The equation solver I used is based on Crout LU decomposition, rather than Gauss. Crout basically changes the order of the do loops so that each coefficient in the matrix is updated once (a single write to disk).

There are 3 key stages of an equation solution.

LU decomposition (foward reduction) of the matrix to a banded triangular form; which uses Dot Product in the Crout approach.
LU decomposition (foward reduction) of the load vector(s) which uses Dot Product.
Backward substitution of the load vectors, which uses vector subtraction.

I can certainly see how VEC_SUB could be easily parallelised, using multiple processors, however Dot_Product is slightly more difficult, as the sum from each processor must be accumulated when all processors have finished their part of the dot_product. There must be a minimum size (say 20) before this is worthwile.

My large problems are a matrix of say 500,000 equations of an average bandwidth of 2,000, that is a matrix of 1 billion coefficients (8gb in size). This requires 1 billion dot_products to update the coefficients, or initiation of 1 billion multi-processor dot_product calculations. I am planning to test this approach shortly, but my expectation is that on a quad processor machine this would require 4 billion dot_product 'threads' of average length 500 in the same 8gb of shared memory. This is not something I am familiar with!

By contrast, I understand that distributed processing has the problem partitioned so they share the partition boundary solution, itterating in their local field, between updates of the boundary.

I'm not sure where all this will take me, but if I can solve the problem quicker, this will lead me to more itterations in direct integration solutions or in a non-linear solution based on iterative linear solutions.

My problem with all this effort is this approach is the way I have addressed this problem for 20 years and I suspect that itterative solvers have provided better ways of solving a problem when you have a near correct answer.

We keep learning !!

John

DrTip

Posts: 73 Manchester

Back to Top

3 Dec 2010 10:45 #7200

John/Erwin

Sorry I overlooked the distinction of multi pc, and multi core.

Right I am on message now. The big thing about multi pc is of course the bottle neck becomes the communication between the machines. This is something I have experience with.

Have to allow for some of the machines falling over etc. Have to make sure the network doesn't fail etc. All in all hard work.

Erwin I know hard drives are slow and this is the challenge making FTN95 play with more memory than the 4GB limit that is really a 32bit limit.

obviously step one buy lots of RAM which you are able to do, you can certainly get a lot more than 8GB into a cheapish DELL PC

two things that I would like to try to get round this limit though I haven't tried are (presuming that 64bit FTN95 is not coming anytime soon)

using a RAM disk approach. which opens up access to all that extra RAM,( but would look in code like file writes? ) I think this was suggested here somewhere a few years ago.
A client/server configuration would also be worth considering, you have a server process written in a 64bit C++. This allows full memory access. This could store all the data to be manipulated. you then add some hooks to manipulate the data from the client (ie the fortran code) and with some dots to be filled in bobs your uncle.

(wouldn't be as quick as normal memory access because data would have to be copied in and out, but would be much quicker than writing to disk. The C++ program would need to be very clever with threads if a multi threaded client was access it...)

Carl

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

19 Jan 2011 11:43 #7563

Carl,

It may have been me that suggested the RAM Disk solution. John doesn't like it much. He also doesn't like the 'get a faster hard drive' solution. Both of these would speed up his processing, but not hugely.

My later suggestion for using lots of cheap computers was not 'distributed processing'where they all work on aspects of the same analysis, but letting each one go off and solve the whole problem, for example, each one does one 'load case', using its own disk as 'backing store' - each computer builds its own stiffness matrices etc. No computer actually solves the problem any quicker, but after the first solution comes in, the others follow thick and fast. Supposing you have 10 problems to solve. Using 1 computer, you maybe want them solved 10x faster to get them all done in the same time. Using 10 computers you get all 10 answers in the time it takes to solve 1, and that begins to look (on average) 10x faster.

My further argument is that if you can't get the answer in 8hrs (the working day), there is no point in getting a solution much faster than in 24 hrs (ready for start of work tomorrow). If you can get 10 solutions ready for the start of work tomorrow, that is about equal to doing them one after the other aiming to complete by close of work today, i.e. an effective work rate not just 10x faster, but more like 30x !

This doesn't help you if you only need one solution, or if Problem n depends on the answers to Problem n-1. It doesn't help much if the heat from 10 computers makes your office like a sauna.

It does help to have a neighbourhood shop selling second-user systems cheaply!

In my solution, the computers don't communicate with each other, they just communicate witha server (or NAS) when they have finished....

Eddie

JohnCampbell

Posts: 2526 Sydney

Back to Top

20 Jan 2011 5:53 #7564

Eddie,

My understanding of Finite Element / Finite Difference analysis is there are two main classes of solution.

Direct solution by setting up a large system of equations and generating a huge matrix; These approaches recognise reordering and sparcity to improve matrix size and solution times. Frontal solvers and Skyline solvers are 2 classic examples. ( My examples go up to 8gb matrix size) A key calculation at the inner loop is a dot_product calculation. I'm not aware of this calculation being shared between PC's but can be multi-threaded on a multi-core machine. (I've been trying to do this lately, without success as yet) I have worked in this areas for many years.
Itterative finite element, finite difference or field solution solvers; In these solutions a large set of simultaneous equations are not assembled, but the displacement field estimate is improved itteratively. I have not done a lot of work in this area, apart from using some packages, such as Examine-3D boundary element solution. These huge field modelling approaches are suitable for multi pc approaches and there are a large number of linux cluster approaches documented.

Certainly for my FE direct solver approach, if there are large amounts of disk I/O, then faster disk access is a big plus. However, my recent interest has been to do an in-memory solution to eliminate the disk I/O and to try and utilise multiple cpu's

I've had some success, as I now have a 64-bit version runing. I've managed to apply 'vectorisation', which utilises new graphics vector instructions in the CPU, which I understand parallels 4 x real4 or 2 x real8 calculations. This has provided some 30% to 50% improvement in run times. This is the first instruction set change in a long time that actually does something for my type of calculations. I've never observed any significant benefits from /P6, SS2/SSE(?) or similar options. (Paul, vector instructions are worth investigating, even if only in Dot_Product, other vector syntax instructions or PURE loops)

Multi-threading is more elusive for me, as I am yet to fully understand this. There is a lot of new jargon to cope with here. My attempts to date have not worked!

I've recently run a model with a 7.7gb stiffness matrix stored in (virtual) memory. Unfortunately my IT dept. have provided me with only 6gb of physical memory and my out of memory 32-bit solver does a much better job on partitioning the matrix reduction, compared to the windows paging system. There is more memory on order !

Running 'out of core' 32-bit programs on a 64-bit operating system also provides significant improvement for moderate sizes problems, as the extra memory provides much better disk I/O buffering, compared to a 32-bit OS.

Eddie, my apologies but multiple PC's is not a solution option for me. My present approach of 64-bit and more memory is where I am trying to go, rather than higher speed/cost disks.

There is still a lot to learn.

John

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

20 Jan 2011 9:44 #7565

Hi John,

I work in a University that believes the newest and highest performing computers should go to typists! I run my own stuff in a shed at the bottom of the garden, build my own computers with components bought with my own money, and therefore have a good idea what does what inside the tin box. For you to have an IT department means that you aren't a 'shed dweller' or 'SoHo man' like me - I assumed for some reason that you were.

When I was doing and programming FE, I never touched the iterative solvers, but paradoxically, when I did FD, I never did it by direct solution!

Assuming that FTN95 optimises as it says it does, i.e. up to P6, means that none of the latest SSE stuff is touched at all. There's a missed opportunity, and no wonder FTN95 gets outclassed on speed in the Polyhedron benchmarks. The shoe might be on the other foot if Polyhedron published compilation benchmarks.

My understanding of the dot_product calculation is that in practice it comes in two flavours: the 3x3 version for coordinate transformation that might get done zillions of times, and the row x column version that gets done less often, but has many more elements. All sorts of things come into play for execution speed: in the first of these the overhead of calling a subprogram all those times might be a factor, whereas in the latter (which I assume that you are doing), that overhead may be less of a factor than the time it takes to post results back to the accumulator, e.g. in:

      A=0.0
      Do 10 I=1,N
      A = A + B(I)*C(I)
10   CONTINUE

i.e. the time to continually write to A - which is what Startz showed all those years ago by skipping that step and leaving the intermediate result in the 8087 stack. Once one has got rid of the gross overheads, one can capitalize on the higher speeds of the multiply instructions in the SSE'n' operations. Presumably the Intel compiler makes use of the new generations of cpu facility. If so, the gains aren't all that big, are they (see Poyhedron benchmarks again)?

Eddie

DrTip

Posts: 73 Manchester

Back to Top

20 Jan 2011 10:40 #7569

This does raise some questions in my mind about your problem.

I entirely agree with the 8 hours thing. I would actually go further, in my experience of large scale transport models models taking longer than about an hour to run actually take about a week to run (in total). The reason being that the models are so hard to set up correctly that it take about 3 attempts to get the input data correct. So if the run time is about an hour or so a working day can have a successful run. Once the run time starts hitting the over night threshold a successful run takes around a week. Which tends to lead to unanalysed results and unhappy clients.

Anyway, John have you done some analysis of the what routines use the most memory and run time? When I have done this in the past I have always found I was wrong as to where the problem actually is.

The comments about the complier optimisation is interesting. Have your tried compiling to .NET? I have found that in some instances this can actually execute faster than native code (because the CLR runtime which handles the final compilation step is targeting the actual machine the code is executing on rather than a P6.)

Some older codes won't work and if your are using 3rd party native dll libraries then it just isn't worth the effort but if your can just compile and it just works, its worth playing with (especially since it starts to make better use of 64 bit processors etc)

Carl

JohnCampbell

Posts: 2526 Sydney

Back to Top

20 Jan 2011 12:23 #7575

Eddie,

I don't agree with your 8-hour analogy. I rarely can think of 8 problems to solve at 1 time. Usually I solve 1, which suggests 1 or 2 more, and then I gradually build to a solution. As for improving solution speed, faster solutions provide for doing more solutions sequentially, ie better speed for the 'sledge-hammer' itterative non-linear approach. Your comment about FE and FD is my point of the relative preference for direct and itterative solvers for each. One thing I've always wanted to explore, is using a direct solver for a first order FE solution, then using an itterative solver for non-linear itterative steps, although is does not take much for the previous itteration solution to not be a near answer for the next itteration for fast convergence. There are lots that could be done to reduce the number of itterations with smarter prediction of the solution, which I guess is the basis of the new generation itterative solvers, few of which are public domain. These days I don't have the time to get up to speed on itterative solvers. I let the IT dept keep updating me with better and faster machines, and see what I can do with the new hardware. Unfortunately, over the last 20 years, most of the speed improvement has been hardware and not changes to my program algorithms. My program changes have been more to adapt to the changeing hardware. A good example of this is my hidden line removal graphics for FE models. I divide the model into faces and use XYZ for every pixel on each face, to get the colour for every pixel in a virtual screen, before dumping the result to the real screen. I have some uses with up to 10 million faces and am always amazed how quickly it can be processed. It's a simple algorithm which is improved by being able to divide the model surface into so many elements and process them so quickly. 20 years ago it was not an option. Now I'm painting 20 frames a second ( for smaller but real problems).

My recent attempts an multi-thread (parallel) runs has been very interesting. I actually got it to work today, but the run times were not impressive. I'm told the minimum vector size in the dot_product for it to be effective are of the order of 10,000, whereas I needed it to be 100 to 1,000. Utilising the graphics vector instruction set looks easier to apply. Salford should consider this, say /V4 optimisation.

We can only keep trying. Eddie and Carl, thanks for your thoughts.

John

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

22 Jan 2011 8:26 #7608

John,

BTW this technique you use in graphics, called z-buffering, is done in OpenGL automatically and is very fast
are you sure your dot product slow speed is the reason? Even if it pages to harddisk so does the matrix solver. CPU time for dot product scales as n2 with matrix size n while CPU time for matrix solvers scales as n3. For simplicity i assume n is the size of block submatrix, it is of square shape. Total time for block matrix is approximately n**3 times number of blocks. I am sure you have to try parallel solvers Decompose_VAG_D and Substitute_VAG_D (or even single precision Decompose_VAG_S) for block matrices if matrix fits into 32bit limit (great if you can use much more than that in 64 bits OS, but my parallel libraries from Equation dot com are for 32 bit OS. 64bit Intel IVF uses also 2GB limit with static arrays, while allocatable ones can be much larger according to their doc, but this probably needs recompilation of libraries for 64bits).
Yes, i also use Ramdisk for Windows, it works fine for many different purposes. 8-16gb DDR3 SDRAM costs $100-200 today. That solves the problem of paging to harddisk on 64bit OS. Or just fill the whole server motherboard with a lot of cheap memory

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

23 Jan 2011 3:55 #7611

Hi John,

It just goes to show that there are a huge diversity of needs – and really faster computers would address all of them. It isn’t just what we do, it’s also how we like to work.

My thoughts were driven by a problem I had to solve in the 1970’s – it was a component that suffered tension fractures in service, and it was clear that the angle between 2 faces on this component and the radius between them affected the stress concentration. The problem was to do different combinations to find the optimum. The 32k word computer (ICL 1900 with an 8 Mb disk) I was using would allow me a maximum of 60 isoparametric elements, and I was using just about the maximum number of nodes. I could get one run overnight. Unlike Carl, I could debug the data errors in my deck of cards on a second (but slower) 32k word computer (Elliot 4120 with tapes) in a few minutes – I just couldn’t run the analysis (in this case because the OS would stop every 30 minutes to ask if I wanted to continue, and I couldn’t face that for several days; plus BACKSPACE was unreliable, as it stretched the tapes!). I was allowed one (overnight) run every day on the 1900. I knew I wanted at least 6 combinations, and it took over a week to get them done. I wanted enough “stress concentration” values, one from each run, to draw a graph because I didn’t much trust the absolute values for the individual results but I was happy to accept that there should be an underlying trend from which I could “pick a winner”. The real proof came when the components were manufactured (it worked!).

That was a case where multiple computers would have shortened the elapsed time. I agree that sometimes one has to lurch towards the solution step at a time, and then multiple computers don’t help.

The next time I did a similar job, not only was it on a VAX that solved it in a few minutes, but I could look at the results on a Tektronix screen then and there instead of drawing it all out on a slow drum plotter.

I was intrigued by 'Dr Tip' Carl’s 3 failed runs before he got one that worked. I found a long time ago that I had 3 kinds of errors: punch card mistakes (today’s typos); analysing the wrong problem or giving the program a set of data it would choke on. I never completely avoided the former until I moved away from creating datafiles in a text editor, but I avoided the second by making sure I had a pictorial representation of whatever I was analysing well before embarking on a lengthy calculation process. Experience with Clearwin style interfaces and Windows generally makes both of these much simpler, and avoids the halting run-edit-run that I grew up on. The latter category sends you back into programming, sometimes for days ...

I now run something on my PC (not FE) that I first developed on the Elliot 4120 in 1973. (I had an Algol-60 version before that, but it wouldn’t run on any other computer – that’s the beauty of Fortran!). The run times now are simply not measurable in useful units (i.e. <<1sec). However, in 1973, the run time started out as 3 hours. I got that down to 45 minutes by dint of careful re-coding. However the run times dropped to less than 6 sec on a CDC6400, which taught me a lesson about the value of fast hardware. I was astonished later to discover that a very basic early PC was getting on for as fast as a VAX (or at least the percentage of it one was allowed), and now they are significantly faster than the CDC. Perhaps it is only a matter of waiting for the right machine to come along. The best part of 40 years might do it, if you can wait!

I hadn’t included the time-wasting that comes from wrong data, but I was pleased that Carl noticed a related pattern to that I theorized but in his work. Time passes in quanta, and they are of variable size, and variable impact. As a result, getting 10% faster (at whatever cost) is excellent for some users, and pointless for others. Ten times faster is a desirable goal for all.

Dan mentions Z-buffering in OpenGL being a standard facility. I’m sure it is, although I’ve never had the need, nor been able to fathom out the details of OpenGL. Several users on this forum swear by it. However John you did say that wasn’t your problem, and even your code for that was fast. I agree with Dan that it is very easy to misunderstand where the time is taken. It never occurred to me in the past that it took so long (relatively) to work the line-printer, and while I printed as I progressed, the program would always be slow. However, you know that the solution is faster, as we used to say, “in-core” - and that is a cheap gain if all it takes is being able to address RAM in a machine you've already got, without changing your Fortran code (much).

RAMDISK is a way of using the code you’ve got on the machine you’ve got, substituting the faster RAM that you aren’t using for the slower hard disk. For only the cost of the RAMDISK software, that also seems to me to be a cheap gain, whatever it is.

Eddie

JohnHorspool

Posts: 260 Gloucestershire UK

Back to Top

23 Jan 2011 8:20 #7612

'Dan mentions Z-buffering in OpenGL being a standard facility'

the beauty of that is that it is all done in hardware by the graphics card, fast enough for dynamic displays but requiring only relatively minimal programming effort. Something I really appreciate, since like I Eddie I also wrote my first graphics program using a line printer and an ICL1900 before discovering the 'wonders' of the tektronix terminal connected to a VAX !

JohnCampbell

Posts: 2526 Sydney

Back to Top

23 Jan 2011 11:41 #7613

I've got work deadlines, but two quick replies,

Dan, with skyline solvers, the run times are not related to n^{3, but about n}2.3. However, storage is related to about n^2, which can quickly eat into the 64_bit size increase.

Eddie, Your tale of the 70's with isoparametric elements interests me. Were they 8-node? This was about the time we were learning about modelling crack singlarities by moving the mid-side nodes. Without graphics screens, understanding the displacement fields in the elements certainly would have been challenging. Even when I did have tektronics, being able to correctly display the distorted fields wasn't much easier. There were a lot of bad rules about distorted elements, based on mis-understandings of the internal element fields.

I have (had) a standard benchmark run I used for a long time. It first ran in 6 days on a Pr1me 400 on a fixed band solver. I got it down to 30 minutes on a Pr1me 750 with a skyline solver, with 95% being the equation solution. It now runs in less than 0.5 seconds, with the equation solver only 0.03 seconds and result printing about 80%. Times have changed! ( couldn't resist a 3rd comment)

John

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

24 Jan 2011 8:26 #7616

As so often, I got chopped. I don't need to say that John H swears by it (OpenGL), but it wasn't John C's problem. I can't fathom the instructions for OpenGL, notwithstanding that I have a £1500 Quadro card in my PC that no doubt would run it beautifully.

My 60-element model was run linear elastic with 8-node isoparametric elements with 2x2 Gaussian integration, and I was looking for relative stress concentrations, not modelling plasticity or cracks (or even 'no-tension' which would have been OK for the material involved). Eventually, the result was confirmed by manufacturing test specimens. All the graphics were done on a single-pen drum plotter! I modelled the whole component initially, then a part of it in more detail.

Eddie