forums.silverfrost.com

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Eddie and Erwin,

Thank you very much for your responses. Your points are spot on.

I agree that the quest for faster run times is not entirely logical.
It is certainly blinkered, as for my programs, once the equation matrix does not fit into memory, disk I/O is certainly significant in run time performance.
However problems requiring disk I/O should be avoided as when the matrix is fully in memory, it is the processor speed, and itterations are much faster. Hence the need for 64-bit.

Eddie, direct solution of linear equations is not suited to distributed processing, while itterative solvers (such as gauss-seidel and conjugate gradient methods) are suited to distributed processing. I don't have the time to understand how they work !!

Erwin, my equation solver is based on the skyline equation solver. It is more applicable for variable bandwidth problems. It dates back to late 70's and includes papers by Powel and Taylor from UCB. You basically have a profile array which is the bandwidth for each equation. For a symmetric banded matrix. The matrix is stored in a vector, with the profile array then pointing to the position of the diagonal. It can't do worse than a fixed band solver. I have always preferred the simplicity of this to the Frontal solver.
The associated band width optimisers are based on papers by Hoit and Sloan (80's)

For linear equation solution, the matrix reduction is based on dot_product
the "load vector" is forward reduced by dot_product, but backward substitution is with vec_sub, which is where I have all the processor unpredictability.
It would be interesting to see how equation.com could be used to utilise multiple processors for dot_product. It could give up to a 2 or 4x improvement. Dan may know of this although I'm not sure of the cost of using this software. I am going to try to see if I can produce a multi-processor version of dot_product to run on a dual and quad core machine.

The thread in my latest posts is that the way modern processors optimise code for dot_product, makes old optimisation approaches based on op count and assembler less significant, as I think I am right in saying that within the processor it does other calculations to try and optimise performance. My latest test on the Xeon may have indicated this.

We'll keep trying !!

Sebastian · Joined: 20 Feb 2008 Posts: 177

DanRRight · Posted: Wed Dec 01, 2010 3:57 am Post subject:

Anyone interested with such libraries ? I have also some simple test benches of using these LIB files with FTN95. Let me know in PM.

My interest here is the following. I have two such libraries: one is built for Microsoft Fortran (may be i keep ones even for some other compilers somewhere) around 10 years ago and it is a bit slower. The latest one, about one year old, is faster, but it does not always works with FTN95 due to incompatibilities since it was built with Intel Fortran which is not super-compatible with FTN95. For example dense matrix solvers work fine, but some sparse ones don't, etcetcetc.

But that's easy to fix to anyone who knows Intel Vis Fortran and the way how to make DLL there. I even have the detailed instruction personally from Intel's Steve Lionel, but i do not have latest version of IVF and do not have time to devote to all that, the older version of library work fine and delivers speedup i was almost happy. Using newer library/DLL would be nice though since it is built for modern processors.

If anyone can transform this library into DLL that would be great help for me. The author refused to make DLLs himself by unknown reason while it would be great universal solution - his one single DLL would work across all Windows compilers, instead of making 5 different LIBs for each. But you all know how hard to convince people to change something substantial.

Also, on a different matters of this thread, i agree 64-bit FTN95 is urgently needed. But seems that does not depend on us much if we can not vote with the money. The developers do not clarify this question though. I am really puzzled where difficulty lies in transferring to 64 bits. Many compilers already have done that.

EKruck · Joined: 09 Jan 2010 Posts: 224 Location: Aalen, Germany

Thanks, Sebastian.

There are nice tools on that page.

But I did not find out, how to disable the swap file from within a Fortran program ?

Erwin

Sebastian · Joined: 20 Feb 2008 Posts: 177

I'm not aware that this is possible from an application, only using the operating system configuration (which requires a reboot).

Btw. be sure to check the english (en-us) sysinternals page as well since it links to even more programs.

DrTip · Joined: 01 Aug 2006 Posts: 74 Location: Manchester

Hi John

whilst I am not going to argue with your specific problem.
Or for that matter try and give you fortran code...

but I was struck by this

EKruck · Joined: 09 Jan 2010 Posts: 224 Location: Aalen, Germany

Carl,

of corse you can destribute your equation solution to several threads.

I have made many trials with this for equation solution. As long as your matrix does not fit into memory you will just heat your CPU 100% and slow your process down. There is too much overhead.

Erwin

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Carl,

Direct equation solution is suited to multi-processor or parallel processing in the same pc, while I understand that itterative solvers are more suited to distributed processing, using multiple pc's. I'm a bit out of my field of expertise but the main reason is itterative solvers can partition the problem, sharing a boundary, while direct solvers (that I use), require fast memory access to the full matrix.
I should point out that the matrix I solve is symmetric, sparce and ordered to be banded. It does not need to be positive definite. ( Erwin, I think there are versions of variable bandwidth solvers for non-symmetric equations, but I am not familiar with their solution stability. Variable band solutions often have extreme maximum bandwidth, but good average bandwidth, such as in a spoked wheel problem)
The equation solver I used is based on Crout LU decomposition, rather than Gauss. Crout basically changes the order of the do loops so that each coefficient in the matrix is updated once (a single write to disk).

There are 3 key stages of an equation solution.
1) LU decomposition (foward reduction) of the matrix to a banded triangular form; which uses Dot Product in the Crout approach.
2) LU decomposition (foward reduction) of the load vector(s) which uses Dot Product.
3) Backward substitution of the load vectors, which uses vector subtraction.

I can certainly see how VEC_SUB could be easily parallelised, using multiple processors, however Dot_Product is slightly more difficult, as the sum from each processor must be accumulated when all processors have finished their part of the dot_product. There must be a minimum size (say 20) before this is worthwile.

My large problems are a matrix of say 500,000 equations of an average bandwidth of 2,000, that is a matrix of 1 billion coefficients (8gb in size). This requires 1 billion dot_products to update the coefficients, or initiation of 1 billion multi-processor dot_product calculations. I am planning to test this approach shortly, but my expectation is that on a quad processor machine this would require 4 billion dot_product "threads" of average length 500 in the same 8gb of shared memory. This is not something I am familiar with!

By contrast, I understand that distributed processing has the problem partitioned so they share the partition boundary solution, itterating in their local field, between updates of the boundary.

I'm not sure where all this will take me, but if I can solve the problem quicker, this will lead me to more itterations in direct integration solutions or in a non-linear solution based on iterative linear solutions.

My problem with all this effort is this approach is the way I have addressed this problem for 20 years and I suspect that itterative solvers have provided better ways of solving a problem when you have a near correct answer.

We keep learning !!

John

DrTip · Joined: 01 Aug 2006 Posts: 74 Location: Manchester

John/Erwin

Sorry I overlooked the distinction of multi pc, and multi core.

Right I am on message now. The big thing about multi pc is of course the bottle neck becomes the communication between the machines. This is something I have experience with.

Have to allow for some of the machines falling over etc. Have to make sure the network doesn't fail etc. All in all hard work.

Erwin I know hard drives are slow and this is the challenge making FTN95 play with more memory than the 4GB limit that is really a 32bit limit.

obviously step one buy lots of RAM which you are able to do, you can certainly get a lot more than 8GB into a cheapish DELL PC

two things that I would like to try to get round this limit though I haven't tried are (presuming that 64bit FTN95 is not coming anytime soon)

1. using a RAM disk approach. which opens up access to all that extra RAM,( but would look in code like file writes? ) I think this was suggested here somewhere a few years ago.

2. A client/server configuration would also be worth considering, you have a server process written in a 64bit C++. This allows full memory access. This could store all the data to be manipulated. you then add some hooks to manipulate the data from the client (ie the fortran code) and with some dots to be filled in bobs your uncle.

(wouldn't be as quick as normal memory access because data would have to be copied in and out, but would be much quicker than writing to disk. The C++ program would need to be very clever with threads if a multi threaded client was access it...)

Carl

LitusSaxonicum · Posted: Thu Jan 20, 2011 12:43 am Post subject:

Carl,

It may have been me that suggested the RAM Disk solution. John doesn't like it much. He also doesn't like the "get a faster hard drive" solution. Both of these would speed up his processing, but not hugely.

My later suggestion for using lots of cheap computers was not "distributed processing"where they all work on aspects of the same analysis, but letting each one go off and solve the whole problem, for example, each one does one "load case", using its own disk as "backing store" - each computer builds its own stiffness matrices etc. No computer actually solves the problem any quicker, but after the first solution comes in, the others follow thick and fast. Supposing you have 10 problems to solve. Using 1 computer, you maybe want them solved 10x faster to get them all done in the same time. Using 10 computers you get all 10 answers in the time it takes to solve 1, and that begins to look (on average) 10x faster.

My further argument is that if you can't get the answer in 8hrs (the working day), there is no point in getting a solution much faster than in 24 hrs (ready for start of work tomorrow). If you can get 10 solutions ready for the start of work tomorrow, that is about equal to doing them one after the other aiming to complete by close of work today, i.e. an effective work rate not just 10x faster, but more like 30x !

This doesn't help you if you only need one solution, or if Problem n depends on the answers to Problem n-1. It doesn't help much if the heat from 10 computers makes your office like a sauna.

It does help to have a neighbourhood shop selling second-user systems cheaply!

In my solution, the computers don't communicate with each other, they just communicate witha server (or NAS) when they have finished....

Eddie

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Eddie,

My understanding of Finite Element / Finite Difference analysis is there are two main classes of solution.
1) Direct solution by setting up a large system of equations and generating a huge matrix; These approaches recognise reordering and sparcity to improve matrix size and solution times. Frontal solvers and Skyline solvers are 2 classic examples. ( My examples go up to 8gb matrix size) A key calculation at the inner loop is a dot_product calculation. I'm not aware of this calculation being shared between PC's but can be multi-threaded on a multi-core machine. (I've been trying to do this lately, without success as yet) I have worked in this areas for many years.
2) Itterative finite element, finite difference or field solution solvers; In these solutions a large set of simultaneous equations are not assembled, but the displacement field estimate is improved itteratively. I have not done a lot of work in this area, apart from using some packages, such as Examine-3D boundary element solution. These huge field modelling approaches are suitable for multi pc approaches and there are a large number of linux cluster approaches documented.

Certainly for my FE direct solver approach, if there are large amounts of disk I/O, then faster disk access is a big plus.
However, my recent interest has been to do an in-memory solution to eliminate the disk I/O and to try and utilise multiple cpu's

I've had some success, as I now have a 64-bit version runing.
I've managed to apply "vectorisation", which utilises new graphics vector instructions in the CPU, which I understand parallels 4 x real*4 or 2 x real*8 calculations. This has provided some 30% to 50% improvement in run times. This is the first instruction set change in a long time that actually does something for my type of calculations. I've never observed any significant benefits from /P6, SS2/SSE(?) or similar options. (Paul, vector instructions are worth investigating, even if only in Dot_Product, other vector syntax instructions or PURE loops)

Multi-threading is more elusive for me, as I am yet to fully understand this. There is a lot of new jargon to cope with here. My attempts to date have not worked!

I've recently run a model with a 7.7gb stiffness matrix stored in (virtual) memory. Unfortunately my IT dept. have provided me with only 6gb of physical memory and my out of memory 32-bit solver does a much better job on partitioning the matrix reduction, compared to the windows paging system. There is more memory on order !

Running "out of core" 32-bit programs on a 64-bit operating system also provides significant improvement for moderate sizes problems, as the extra memory provides much better disk I/O buffering, compared to a 32-bit OS.

Eddie, my apologies but multiple PC's is not a solution option for me.
My present approach of 64-bit and more memory is where I am trying to go, rather than higher speed/cost disks.

There is still a lot to learn.

John

LitusSaxonicum · Posted: Thu Jan 20, 2011 10:44 am Post subject:

Hi John,

I work in a University that believes the newest and highest performing computers should go to typists! I run my own stuff in a shed at the bottom of the garden, build my own computers with components bought with my own money, and therefore have a good idea what does what inside the tin box. For you to have an IT department means that you aren't a "shed dweller" or "SoHo man" like me - I assumed for some reason that you were.

When I was doing and programming FE, I never touched the iterative solvers, but paradoxically, when I did FD, I never did it by direct solution!

Assuming that FTN95 optimises as it says it does, i.e. up to P6, means that none of the latest SSE stuff is touched at all. There's a missed opportunity, and no wonder FTN95 gets outclassed on speed in the Polyhedron benchmarks. The shoe might be on the other foot if Polyhedron published compilation benchmarks.

My understanding of the dot_product calculation is that in practice it comes in two flavours: the 3x3 version for coordinate transformation that might get done zillions of times, and the row x column version that gets done less often, but has many more elements. All sorts of things come into play for execution speed: in the first of these the overhead of calling a subprogram all those times might be a factor, whereas in the latter (which I assume that you are doing), that overhead may be less of a factor than the time it takes to post results back to the accumulator, e.g. in:

DrTip · Joined: 01 Aug 2006 Posts: 74 Location: Manchester

This does raise some questions in my mind about your problem.

I entirely agree with the 8 hours thing. I would actually go further, in my experience of large scale transport models models taking longer than about an hour to run actually take about a week to run (in total). The reason being that the models are so hard to set up correctly that it take about 3 attempts to get the input data correct. So if the run time is about an hour or so a working day can have a successful run. Once the run time starts hitting the over night threshold a successful run takes around a week. Which tends to lead to unanalysed results and unhappy clients.

Anyway, John have you done some analysis of the what routines use the most memory and run time? When I have done this in the past I have always found I was wrong as to where the problem actually is.

The comments about the complier optimisation is interesting. Have your tried compiling to .NET? I have found that in some instances this can actually execute faster than native code (because the CLR runtime which handles the final compilation step is targeting the actual machine the code is executing on rather than a P6.)

Some older codes won't work and if your are using 3rd party native dll libraries then it just isn't worth the effort but if your can just compile and it just works, its worth playing with (especially since it starts to make better use of 64 bit processors etc)

Carl

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Eddie,

I don't agree with your 8-hour analogy. I rarely can think of 8 problems to solve at 1 time. Usually I solve 1, which suggests 1 or 2 more, and then I gradually build to a solution.
As for improving solution speed, faster solutions provide for doing more solutions sequentially, ie better speed for the "sledge-hammer" itterative non-linear approach.
Your comment about FE and FD is my point of the relative preference for direct and itterative solvers for each.
One thing I've always wanted to explore, is using a direct solver for a first order FE solution, then using an itterative solver for non-linear itterative steps, although is does not take much for the previous itteration solution to not be a near answer for the next itteration for fast convergence.
There are lots that could be done to reduce the number of itterations with smarter prediction of the solution, which I guess is the basis of the new generation itterative solvers, few of which are public domain. These days I don't have the time to get up to speed on itterative solvers.
I let the IT dept keep updating me with better and faster machines, and see what I can do with the new hardware. Unfortunately, over the last 20 years, most of the speed improvement has been hardware and not changes to my program algorithms. My program changes have been more to adapt to the changeing hardware.
A good example of this is my hidden line removal graphics for FE models. I divide the model into faces and use XYZ for every pixel on each face, to get the colour for every pixel in a virtual screen, before dumping the result to the real screen. I have some uses with up to 10 million faces and am always amazed how quickly it can be processed. It's a simple algorithm which is improved by being able to divide the model surface into so many elements and process them so quickly. 20 years ago it was not an option. Now I'm painting 20 frames a second ( for smaller but real problems).

My recent attempts an multi-thread (parallel) runs has been very interesting. I actually got it to work today, but the run times were not impressive. I'm told the minimum vector size in the dot_product for it to be effective are of the order of 10,000, whereas I needed it to be 100 to 1,000. Utilising the graphics vector instruction set looks easier to apply. Salford should consider this, say /V4 optimisation.

We can only keep trying. Eddie and Carl, thanks for your thoughts.

John

DanRRight · Posted: Sat Jan 22, 2011 9:26 am Post subject:

John,

1) BTW this technique you use in graphics, called z-buffering, is done in OpenGL automatically and is very fast

2) are you sure your dot product slow speed is the reason? Even if it pages to harddisk so does the matrix solver. CPU time for dot product scales as n**2 with matrix size n while CPU time for matrix solvers scales as n**3. For simplicity i assume n is the size of block submatrix, it is of square shape. Total time for block matrix is approximately n**3 times number of blocks. I am sure you have to try parallel solvers Decompose_VAG_D and Substitute_VAG_D (or even single precision Decompose_VAG_S) for block matrices if matrix fits into 32bit limit (great if you can use much more than that in 64 bits OS, but my parallel libraries from Equation dot com are for 32 bit OS. 64bit Intel IVF uses also 2GB limit with static arrays, while allocatable ones can be much larger according to their doc, but this probably needs recompilation of libraries for 64bits).

3) Yes, i also use Ramdisk for Windows, it works fine for many different purposes. 8-16gb DDR3 SDRAM costs $100-200 today. That solves the problem of paging to harddisk on 64bit OS. Or just fill the whole server motherboard with a lot of cheap memory