View previous topic :: View next topic |
Author |
Message |
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Thu Mar 13, 2008 7:06 pm Post subject: Parallelization with FTN95 |
|
|
I just got 4-core Q6600 and found that my parallel libraries made by Equation dot com do not give me speedup I had with previous Intel processors. In fact, I get maximum speedup less than factor 2, just 1.3-1.4 with 2 cores deployed, and then with 3 and 4 cores speedup even starts to decrease!
Can anyone who has Intel/AMD dual processors or latest AMD native quad processor test the same simple fortran code and check if you will have proportional speedup on systems of linear equations?
Send me your email.
Last edited by DanRRight on Mon May 19, 2008 4:53 pm; edited 1 time in total |
|
Back to top |
|
|
Robert
Joined: 29 Nov 2006 Posts: 445 Location: Manchester
|
Posted: Fri Mar 14, 2008 11:48 am Post subject: |
|
|
Out of interest, what sort of speed increases did you see on other systems? |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Fri Mar 14, 2008 10:17 pm Post subject: |
|
|
Other cases could be sometimes pretty good even with Q6600.
For example
Number of equations: 2000000
Half bandwidth: 8
Processor: 1
Elapsed Time (Seconds): 7.14
Processors: 2
Elapsed Time (Seconds): 3.53
Processors: 3
Elapsed Time (Seconds): 2.47
Processors: 4
Elapsed Time (Seconds): 2.00
Also fun to play with parallelization itself.
It definitely has a future |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Sat Mar 15, 2008 11:05 am Post subject: |
|
|
Have you obtained these improvements using Salford FTN95 ?
I would be very interested in seeing this work with ftn95.
John |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Mon Mar 17, 2008 6:26 pm Post subject: |
|
|
All the improvements are in the libraries linked to FTN95.
You just call subroutine and link it with any compiler.
There also exist library MTASK (which I do not see the author of equation dot com advertises by some reason) which represents a simple parallel language where you can arrange the code deviding it for pieces (for example devide DO loop on N independent areas for each of N processor and at the end all of them will finish N times faster if devided correctly. Processors which finish earlier will wait others finished). All is done same way like Clearwin, Winteracter, any graphics, or any external to Fortran libraries are working - just call subroutines/functions in the fortran text linking them with slink. Nice playing toy. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Sat Mar 29, 2008 12:34 am Post subject: |
|
|
Interesting looks comparison between AMD and INTEL processors
Above you have seen benchmark of 2.4GHz Intel quad core
processor Q6600.
Here is the score for almost same clock 2.31GHz AMD Phenom 9600
(thanks to John Horspool)
Number of equations: 2000000
Half bandwidth: 8
Processor: 1
Elapsed Time (Seconds): 3.05
Processors: 2
Elapsed Time (Seconds): 1.52
Processors: 3
Elapsed Time (Seconds): 1.06
Processors: 4
Elapsed Time (Seconds): 0.84
Surprize-surprize! AMD is 2.5 (!!!) times faster!
We have to overclock INTEL to 5.5-6.0 GHz
to get this result, not really easy to achieve.
In fact I only succeeded to overclock
new 45nm processors like E8400 to 4.5GHz
on air where it is not as superstable
(4.2 GHz is OK), and besides this is just the
dual-core processor
This confirms that SPEC also shows
higher scores for AMD (by 60% or so).
AMD loses only in games (by 10-15%, who cares?)
due to slow integer and couple floating point
multimedia extensions subroutines but has
no cache coherency problems like in the tests
I showed in my first post above for INTEL.
By the way, dual core INTEL processors
despite being slower then AMD nevertheless
scale perfectly with amount of cores, because
cores are on the same die and there is no
slow down via bus transfers and cache incoherency.
So hopefully INTEL Nehalem processors will
be better with parallelization (I mean they
will scale better for broader variety of tasks)
Good news also is that the company promiss
to make parallel libraries for Salford compilers,
right now I use ones built for other compilers and
compile it with /IMPORT_LIB switch and then SLINK
as usually |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Mon May 19, 2008 4:49 pm Post subject: |
|
|
I think to make Salford FTN95 as parallel language 99.9% is already done, let me know if I'm wrong.
First multithreading is already done with winio@. Now to do simple parallel functions it is necessary just to implement thread safe output like print* into separate screen units like it is done right now when you define OUTunit1, OUTunit2
i = winio@('%pv%120.10cw[hscroll,vscroll]&', OUTunit1)
i = winio@('%pv%120.10cw[hscroll,vscroll]&', OUTunit2)
and do a little coding for two-three more winio functions. Suppose you need to parallelize do loop in dual-core CPU.
You arrange the loop
do i=1,N
.............
enddo
into two functions loop1 with
do i=1,N/2
................
enddo
and loop2
i=N/2+1, N
................
enddo
and would call (I take %xx names arbitrary just for demonstration)
i=winio@('%np&',n_processors) ! find amount of processors
i=winio@('%em&',2) !employ just two of them if you have more than 2
i=winio@('%lt&',1,loop1) ! launch first task on first processor
i=winio@('%lt&',2,loop2) ! launch second task in second processor
i=winio@('%we') ! wait end for both tasks execution
<do your other job here>
Both threads will print on screen in separate text windows OUTunit1, OUTunit2. That's all we need. it is exactly how basically simple Mtask language of www.equation.com works.
Very simple and effective |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7927 Location: Salford, UK
|
Posted: Tue May 20, 2008 8:03 am Post subject: |
|
|
winio@ processes the Windows message queue on a single thread. There is no built-in multi-threading under Win32.
.NET does have its own multi-threading but under Win32 you will need to access the Windows API threading functions directly.
I some respects, with a single processor, multi-threading is not much different from multi-processing because different processes take over the CPU for intervals of time. However, the threading functions provide ways of syncronising the various threads and sharing and locking the common data. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Sun May 25, 2008 5:02 pm Post subject: |
|
|
I ran your Threads.f95 example under win32 and will tell you that I'd like to get even such kind of parallelism if you would implement it in the form as above (winio@ or similar clear and simple language). Of course the freedom to employ specific amount of processors instead of all of them would be better. The ultimate wish would be ability to employ specific cores on the CPU for specific threads. This is for the such tasks and threads which require access to the same cache for coherency and ultimate speed (like with linear algebra).
"Salford Fortran. Build-in multithreaded parallelism for modern multicore processors"
or something similar. Would sound good to me |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Sat Jun 21, 2008 7:02 pm Post subject: |
|
|
Three months ago I wrote to the mentioned above company and convinced their programmers to take a look at Salford FTN95 to make the native parallel library for this compiler the same way as they make libraries for GFortran, Lahey, Intel, Absoft, you name it. They worked all this time. And guess what? They failed! Seems unlike all other compilers (most of which are faster then Salford, sometimes just a bit, sometimes very substantially), the Salford FTN95 reorders statements when its optimization is switched on. As you understand, this is killing for parallelization. Without optimization Salford is 3 times slower. Slower then ten years ago discontinued Microsoft fortran. Fortran is not just logics, simplicity, reliability, rich libraries, development speed, Fortran is also ultimate execution speed.
Now what left is fast new Intel fortran library (faster then my old Microsoft Fortran one which is compatible with Salford by factor 1.5). It works sometimes though has complaining about some missing symbols (__intel_f2int, _fltused ...) but mostly fail. Let's look at the problem from another point of view. If changing the compiler is not a viable option or difficult task by some reason, is there any way to make it working with Salford using any compatible wrappers, dll etc? |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7927 Location: Salford, UK
|
Posted: Sat Jun 21, 2008 8:12 pm Post subject: |
|
|
You are obviously not a fan of FTN95 but I wonder why you are making unsubstantiated statements like "the Salford FTN95 reorders statements when its optimization is switched on". If you can produce any evidence that FTN95 does this (when it is not appropriate) then please let us have the details so that we can fix the problem.
This forum is provided and maintained by Silverfrost for the benefit of FTN95 users. You are welcome not to use FTN95 if it does not serve your purposes but it would be better if you left your critical remarks in another place. |
|
Back to top |
|
|
JohnHorspool
Joined: 26 Sep 2005 Posts: 270 Location: Gloucestershire UK
|
Posted: Sun Jun 22, 2008 12:31 am Post subject: |
|
|
Working on a 64bit XP OS with a pure number crunching source code (no graphics and no clearwin) I found a default compile with 32bit FTN95 produced an exe that ran substantially faster than one produced using a 64bit version of the gfortran compiler ! |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Sun Jun 22, 2008 2:59 am Post subject: |
|
|
http://www.polyhedron.com/benchamdwin
Of course your mileage may vary. By tuning, or using external libraries (like in this case with parallel algebra libraries - parallelization is our unavoidable future) you can get the best out of best.
Here is strength of Salford: developer's debugging time, compile time, cleaner codes
http://www.polyhedron.com/pb05-win32-diagnose0html
And of course Clearwin, Virtual Common, .NET etc...
Last edited by DanRRight on Sun Jun 22, 2008 3:34 am; edited 4 times in total |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Sun Jun 22, 2008 3:08 am Post subject: Re: |
|
|
PaulLaidler wrote: | You are obviously not a fan of FTN95 but I wonder why you are making unsubstantiated statements like "the Salford FTN95 reorders statements when its optimization is switched on" |
Paul, I wrote in front of this sentene one importantword: "Seems..." and then what you wrote is correct. Means, I guess or we guess.
In return, I have to note the unsubstantiated statement "You are obviously not a fan of FTN95...". I am really sorry if you understood me this only way. I use ***only*** Salford/Silverfrost since probably 1988, two decades, and like it more than any other compiler. I used great DOS/DBOS version FTN77, went through hell with buggy FTN90 and was mostly happy with FTN95 recommending it to anyone. But I like it to be even better by pointing not only on its strengths, but also on its weaknesses. That helps to make substantial workarounds and be 100% happy with FTN95. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Sun Jun 22, 2008 11:05 pm Post subject: |
|
|
I'm confused. Doesn't optimisation reorder statements? I thought that you needed an Assembler to have exact translation, so even non-optimised compilation must reorder statements to a degree.
I think Dan is after a Holy Grail - ClearWin+ and the clear advantages of fast compilation, excellent diagnostics, and some of the excellent add-ons of FTN95 (which it has), together with "best of the pack" execution speed (with benchmarks), 64-bit code (to allow the use of >3Gb RAM) and making use of all cpu cores through multi-threading - which it doesn't. Multi-threading, I will remind us, was present in the DBOS FTN77 - although it was a fat lot of good (i.e. to translate into US English: not much use) with a single core cpu, and DBOS FTN77 was certainly one of the compilers at that time that produced the fastest runtime.
Eddie |
|
Back to top |
|
|
|