Topic: Parallelization with FTN95 in General

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

13 Mar 2008 6:06 (Edited: 19 May 2008 3:53) #2917

I just got 4-core Q6600 and found that my parallel libraries made by Equation dot com do not give me speedup I had with previous Intel processors. In fact, I get maximum speedup less than factor 2, just 1.3-1.4 with 2 cores deployed, and then with 3 and 4 cores speedup even starts to decrease!

Can anyone who has Intel/AMD dual processors or latest AMD native quad processor test the same simple fortran code and check if you will have proportional speedup on systems of linear equations?

Send me your email.

Robert

Posts: 450 Manchester

Back to Top

14 Mar 2008 10:48 #2919

Out of interest, what sort of speed increases did you see on other systems?

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

14 Mar 2008 9:17 #2922

Other cases could be sometimes pretty good even with Q6600. For example

Number of equations: 2000000 Half bandwidth: 8

Processor: 1 Elapsed Time (Seconds): 7.14 Processors: 2 Elapsed Time (Seconds): 3.53 Processors: 3 Elapsed Time (Seconds): 2.47 Processors: 4 Elapsed Time (Seconds): 2.00

Also fun to play with parallelization itself. It definitely has a future

JohnCampbell

Posts: 2526 Sydney

Back to Top

15 Mar 2008 10:05 #2924

Have you obtained these improvements using Salford FTN95 ? I would be very interested in seeing this work with ftn95.

John

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

17 Mar 2008 5:26 #2934

All the improvements are in the libraries linked to FTN95. You just call subroutine and link it with any compiler.

There also exist library MTASK (which I do not see the author of equation dot com advertises by some reason) which represents a simple parallel language where you can arrange the code deviding it for pieces (for example devide DO loop on N independent areas for each of N processor and at the end all of them will finish N times faster if devided correctly. Processors which finish earlier will wait others finished). All is done same way like Clearwin, Winteracter, any graphics, or any external to Fortran libraries are working - just call subroutines/functions in the fortran text linking them with slink. Nice playing toy.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

28 Mar 2008 11:34 #2959

Interesting looks comparison between AMD and INTEL processors Above you have seen benchmark of 2.4GHz Intel quad core processor Q6600.

Here is the score for almost same clock 2.31GHz AMD Phenom 9600 (thanks to John Horspool)

Number of equations: 2000000 Half bandwidth: 8

Processor: 1 Elapsed Time (Seconds): 3.05 Processors: 2 Elapsed Time (Seconds): 1.52 Processors: 3 Elapsed Time (Seconds): 1.06 Processors: 4 Elapsed Time (Seconds): 0.84

Surprize-surprize! AMD is 2.5 (!!!) times faster! We have to overclock INTEL to 5.5-6.0 GHz to get this result, not really easy to achieve. In fact I only succeeded to overclock new 45nm processors like E8400 to 4.5GHz on air where it is not as superstable (4.2 GHz is OK), and besides this is just the dual-core processor

This confirms that SPEC also shows higher scores for AMD (by 60% or so). AMD loses only in games (by 10-15%, who cares?) due to slow integer and couple floating point multimedia extensions subroutines but has no cache coherency problems like in the tests I showed in my first post above for INTEL.

By the way, dual core INTEL processors despite being slower then AMD nevertheless scale perfectly with amount of cores, because cores are on the same die and there is no slow down via bus transfers and cache incoherency. So hopefully INTEL Nehalem processors will be better with parallelization (I mean they will scale better for broader variety of tasks)

Good news also is that the company promiss to make parallel libraries for Salford compilers, right now I use ones built for other compilers and compile it with /IMPORT_LIB switch and then SLINK as usually

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

19 May 2008 3:49 #3218

I think to make Salford FTN95 as parallel language 99.9% is already done, let me know if I'm wrong.

First multithreading is already done with winio@. Now to do simple parallel functions it is necessary just to implement thread safe output like print* into separate screen units like it is done right now when you define OUTunit1, OUTunit2

   i = winio@('%pv%120.10cw[hscroll,vscroll]&amp;', OUTunit1)
   i = winio@('%pv%120.10cw[hscroll,vscroll]&amp;', OUTunit2)

and do a little coding for two-three more winio functions. Suppose you need to parallelize do loop in dual-core CPU.

You arrange the loop do i=1,N ............. enddo

into two functions loop1 with do i=1,N/2 ................ enddo

and loop2 i=N/2+1, N ................ enddo

and would call (I take %xx names arbitrary just for demonstration)

i=winio@('%np&',n_processors) ! find amount of processors i=winio@('%em&',2) !employ just two of them if you have more than 2 i=winio@('%lt&',1,loop1) ! launch first task on first processor i=winio@('%lt&',2,loop2) ! launch second task in second processor i=winio@('%we') ! wait end for both tasks execution

Both threads will print on screen in separate text windows OUTunit1, OUTunit2. That's all we need. it is exactly how basically simple Mtask language of www.equation.com works. Very simple and effective

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

20 May 2008 7:03 #3219

winio@ processes the Windows message queue on a single thread. There is no built-in multi-threading under Win32.

.NET does have its own multi-threading but under Win32 you will need to access the Windows API threading functions directly.

I some respects, with a single processor, multi-threading is not much different from multi-processing because different processes take over the CPU for intervals of time. However, the threading functions provide ways of syncronising the various threads and sharing and locking the common data.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

25 May 2008 4:02 #3254

I ran your Threads.f95 example under win32 and will tell you that I'd like to get even such kind of parallelism if you would implement it in the form as above (winio@ or similar clear and simple language). Of course the freedom to employ specific amount of processors instead of all of them would be better. The ultimate wish would be ability to employ specific cores on the CPU for specific threads. This is for the such tasks and threads which require access to the same cache for coherency and ultimate speed (like with linear algebra).

'Salford Fortran. Build-in multithreaded parallelism for modern multicore processors'

or something similar. Would sound good to me 😃

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

21 Jun 2008 6:02 #3396

Three months ago I wrote to the mentioned above company and convinced their programmers to take a look at Salford FTN95 to make the native parallel library for this compiler the same way as they make libraries for GFortran, Lahey, Intel, Absoft, you name it. They worked all this time. And guess what? They failed! Seems unlike all other compilers (most of which are faster then Salford, sometimes just a bit, sometimes very substantially), the Salford FTN95 reorders statements when its optimization is switched on. As you understand, this is killing for parallelization. Without optimization Salford is 3 times slower. Slower then ten years ago discontinued Microsoft fortran. Fortran is not just logics, simplicity, reliability, rich libraries, development speed, Fortran is also ultimate execution speed.

Now what left is fast new Intel fortran library (faster then my old Microsoft Fortran one which is compatible with Salford by factor 1.5). It works sometimes though has complaining about some missing symbols (__intel_f2int, _fltused ...) but mostly fail. Let's look at the problem from another point of view. If changing the compiler is not a viable option or difficult task by some reason, is there any way to make it working with Salford using any compatible wrappers, dll etc?

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

21 Jun 2008 7:12 #3398

You are obviously not a fan of FTN95 but I wonder why you are making unsubstantiated statements like 'the Salford FTN95 reorders statements when its optimization is switched on'. If you can produce any evidence that FTN95 does this (when it is not appropriate) then please let us have the details so that we can fix the problem.

This forum is provided and maintained by Silverfrost for the benefit of FTN95 users. You are welcome not to use FTN95 if it does not serve your purposes but it would be better if you left your critical remarks in another place.

JohnHorspool

Posts: 260 Gloucestershire UK

Back to Top

21 Jun 2008 11:31 #3400

Working on a 64bit XP OS with a pure number crunching source code (no graphics and no clearwin) I found a default compile with 32bit FTN95 produced an exe that ran substantially faster than one produced using a 64bit version of the gfortran compiler !

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

22 Jun 2008 1:59 (Edited: 22 Jun 2008 2:34) #3401

http://www.polyhedron.com/benchamdwin

Of course your mileage may vary. By tuning, or using external libraries (like in this case with parallel algebra libraries - parallelization is our unavoidable future) you can get the best out of best.

Here is strength of Salford: developer's debugging time, compile time, cleaner codes

http://www.polyhedron.com/pb05-win32-diagnose0html

And of course Clearwin, Virtual Common, .NET etc...

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

22 Jun 2008 2:08 #3402

Quoted from PaulLaidler You are obviously not a fan of FTN95 but I wonder why you are making unsubstantiated statements like 'the Salford FTN95 reorders statements when its optimization is switched on'

Paul, I wrote in front of this sentene one importantword: 'Seems...' and then what you wrote is correct. Means, I guess or we guess.

In return, I have to note the unsubstantiated statement 'You are obviously not a fan of FTN95...'. I am really sorry if you understood me this only way. I use only Salford/Silverfrost since probably 1988, two decades, and like it more than any other compiler. I used great DOS/DBOS version FTN77, went through hell with buggy FTN90 and was mostly happy with FTN95 recommending it to anyone. But I like it to be even better by pointing not only on its strengths, but also on its weaknesses. That helps to make substantial workarounds and be 100% happy with FTN95.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

22 Jun 2008 10:05 #3405

I'm confused. Doesn't optimisation reorder statements? I thought that you needed an Assembler to have exact translation, so even non-optimised compilation must reorder statements to a degree.

I think Dan is after a Holy Grail - ClearWin+ and the clear advantages of fast compilation, excellent diagnostics, and some of the excellent add-ons of FTN95 (which it has), together with 'best of the pack' execution speed (with benchmarks), 64-bit code (to allow the use of >3Gb RAM) and making use of all cpu cores through multi-threading - which it doesn't. Multi-threading, I will remind us, was present in the DBOS FTN77 - although it was a fat lot of good (i.e. to translate into US English: not much use) with a single core cpu, and DBOS FTN77 was certainly one of the compilers at that time that produced the fastest runtime.

Eddie

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

23 Jun 2008 7:15 #3407

As I understand it, opimisation does not reorder Fortran statements as such but it does optimise the way in which a given Fortran statement is represented in assembly code. Optimisations can include removing repeated expressions and holding certain intermediate values in registers rather than writing them back to memory but only in a way that does not change or reorder the expressed intention of the programmer.

JohnCampbell

Posts: 2526 Sydney

Back to Top

23 Jun 2008 8:42 #3408

This thread is including something of interest to me.

While I am a strong supporter of FTN95, and acknowledge its strengths in checkmate, debugging and clearwin+, there are some aspects of run-time performance which could be improved, if run-time benchmarks are a true indicater.

I would like an option where array operations could be implemented using an automatic optimisation. I don't like how dot_product is implemented as in-line code and performance can change, depending on the compiler options. I typically compile with /debug, and avoid /opt, due to problems with this in many past compilers. My past experience is general optimisation does not always work best, but nor does selective optimisation levels. I am waiting for the results of work on memory management for /3gb and hope this addresses some of the performance problems with real*8 calculations. As with some of DanRRight's comments, a lot of our bad impressions are basd on past experience, which may not be correct for the current compiler.

I saw some of the results from test procedures from equation.com, to drive multiple processors. It certainly would be interesting if this approach could be applied to some basic (large) vector operations. Dan may be right in that 'parallelization is our unavoidable future'. It's worth watching.

regards John

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

23 Jun 2008 1:33 #3409

There are 48 optimisations for which we have internal documentation. I will investigate to see if this documentation might be released in some form.

/INHIBIT_OPTIMISATION <n>

inhibits a given optimisation and number 41 is documented as 'dot product detection'.

Please note that many optisations are applied even when /OPT does not appear on the command line.

Andrew

Posts: 186 Frankfurt, Germany

Back to Top

23 Jun 2008 11:00 #3410

While I am a strong supporter of FTN95, and acknowledge its strengths in checkmate, debugging and clearwin+, there are some aspects of run-time performance which could be improved, if run-time benchmarks are a true indicater.

Indeed, while I do not doubt that many compilers produce faster runtime code, the difference in performance with real world code may differ somewhat than that you may find from benchmarks that have been around for a long time. Performance of individual codes is of course highly dependent on a range of factors.

The holy grail of compilers is hard to reach - the compilers producing the fastest binaries are mostly those with the weakest diagnostic capabilities. There are some that do generally well for both performance and diagnostics, but from past impressions, compilation time can be extremely slow.

Some compilers fit better than others into requirements, depending on where the focus of development lies. Horses for courses.

JohnCampbell

Posts: 2526 Sydney

Back to Top

25 Jun 2008 4:07 #3413

I have, for a long time, been trying to identify how I can improve the calculation performance of my equation solver in my finite element program. I checked my past emails to Salford, and a lot of the identified problems I was having were reported in 2002, so I can't confirm this is still the case. There is a vague indication that other compilers have better performance in this area, but I don't have any definite proof. Certainly in 2002, I was getting results where the run time performance for 'dot_product' could vary by a factor of 2, and my then past knowledge assumed that real*8 arithmetic should be a substantial part of the compute time. I was asking myself what was happening in this extra processing time, as the mathematical computation part does not change. My conclusion was that it was associated with either unnecessary transfer of data between memory and the more confusing movement of data between memory 'secondary cache' and the processor. For the last few years I have not been able to run benchmarks that reliably indicate performance and also show performance improvements that relate to programming strategies of the 70's and 80's. I put this down to the vagaries of the intel cache management.

The problem now gets more complicated, with the larger problem size. I have been trying to improve performance where the active matrix size is in the range of 1gb to 3gb. Any disk I/O now has a huge performance penalty, which can be compounded by virtual memory mapping, even where there is adequate physical memory.

The equation solver I use is a skyline solver for large sets of linear simultaneous equations (symmetric), which was a preferred direct solver in the 70's to 90's. It has two basic array processes: Dot_Product and vector_A = Vector_A - beta * Vector_B These vectors are typically 0-20,000 elements long.

My holy grail is to get a procedure for these two, which optimises performance. The three areas I have identified as problems are:

unnecessary variable shifts ( as in 2002)
not utilising multiple processors
not getting unnecessary disk transfers

To me the basic mission is, Dot_product gets the starting address and byte step of 2 vectors in memory, then produces the a.b answer. What puzzles me is why it is so difficult to optimise.

I look forward to improvements to memory management, especially in SLINK, when addressing improvements to the /3gb switch.

Keep up the good work.

John

ps : I wonder what I would do next if this problem had a solution ?