forums.silverfrost.com

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

I have run them with FTN95 and an old Lahey compiler on my Core i5 to produce the following results:

DanRRight · Posted: Mon May 14, 2012 6:41 pm Post subject:

Can anyone run this on other compilers?

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

My laptop is a bit slower than John's core I5 so I need to re-run using Silverfrost FTN95 run to provide a reference case.

I used the same options for FTN95 as John. For the other compilers I compiled at -O2.

I haven't had time to investigate using OpenMP yet.

Silverfrost FTN95

10146 equations
1292 average profile
13116399 coefficients
100.07 storage (mb)

Method 1 CPU_time = 41.527 ops/sec = 213.57E+06 Original DO Loop

Method 2 CPU_time = 41.309 ops/sec = 214.70E+06 F90 syntax

Method 3 CPU_time = 17.659 ops/sec = 502.23E+06 F77 wrapper for DO Loop

Method 4 CPU_time = 18.112 ops/sec = 489.69E+06 F77 wrapper for Dot_Produ
ct
Method 5 CPU_time = 41.371 ops/sec = 214.38E+06 Paul option

Method 6 CPU_time = 40.919 ops/sec = 216.75E+06 alternate Paul option

13085960 Number of dot_product calls
8869050198 Number of itterations

NAG Fortran Compiler Windows

10164 equations
1290 average profile
13117181 coefficients
100.08 storage (mb)

Method 1 CPU_time = 25.725 ops/sec = 344.00E+06 Original DO Loop
Method 2 CPU_time = 19.141 ops/sec = 462.31E+06 F90 syntax
Method 3 CPU_time = 19.017 ops/sec = 465.35E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 19.079 ops/sec = 463.83E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 21.747 ops/sec = 406.93E+06 Paul option
Method 6 CPU_time = 19.843 ops/sec = 445.96E+06 alternate Paul option
13086688 Number of dot_product calls
8849310852 Number of itterations

Intel ifort Linux

10220 equations
1283 average profile
13116753 coefficients
100.07 storage (mb)

Method 1 CPU_time = 14.473 ops/sec = 607.65E+06 Original DO Loop
Method 2 CPU_time = 14.485 ops/sec = 607.15E+06 F90 syntax
Method 3 CPU_time = 14.301 ops/sec = 614.96E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 14.301 ops/sec = 614.96E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 14.417 ops/sec = 610.01E+06 Paul option
Method 6 CPU_time = 14.441 ops/sec = 609.00E+06 alternate Paul option
13086093 Number of dot_product calls
8794467678 Number of itterations

gfortran linux

10107 equations
1297 average profile
13116884 coefficients
100.07 storage (mb)

Method 1 CPU_time = 16.669 ops/sec = 534.48E+06 Original DO Loop
Method 2 CPU_time = 16.597 ops/sec = 536.80E+06 F90 syntax
Method 3 CPU_time = 16.613 ops/sec = 536.29E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 16.613 ops/sec = 536.29E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 16.605 ops/sec = 536.54E+06 Paul option
Method 6 CPU_time = 16.637 ops/sec = 535.51E+06 alternate Paul option
13086562 Number of dot_product calls
8909344714 Number of itterations

open64 linux

10187 equations
1287 average profile
13116561 coefficients
100.07 storage (mb)

Method 1 CPU_time = 16.649 ops/sec = 530.03E+06 Original DO Loop
Method 2 CPU_time = 16.501 ops/sec = 534.79E+06 F90 syntax
Method 3 CPU_time = 16.633 ops/sec = 530.54E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 16.613 ops/sec = 531.18E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 16.537 ops/sec = 533.62E+06 Paul option
Method 6 CPU_time = 17.929 ops/sec = 492.19E+06 alternate Paul option
13085999 Number of dot_product calls
8824525977 Number of itterations
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

and these two:

oracle studio linux

10101 equations
1298 average profile
13116850 coefficients
100.07 storage (mb)

Method 1 CPU_time = 41.175 ops/sec = 216.37E+06 Original DO Loop
Method 2 CPU_time = 40.757 ops/sec = 218.58E+06 F90 syntax
Method 3 CPU_time = 40.993 ops/sec = 217.33E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 40.874 ops/sec = 217.96E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 41.270 ops/sec = 215.87E+06 Paul option
Method 6 CPU_time = 41.051 ops/sec = 217.02E+06 alternate Paul option
13086546 Number of dot_product calls
8908901037 Number of itterations

Absoft Windows

10248 equations
1279 average profile
13115873 coefficients
100.07 storage (mb)

Method 1 CPU_time = 28.205 ops/sec = 310.75E+06 Original DO Loop
Method 2 CPU_time = 28.642 ops/sec = 306.01E+06 F90 syntax
Method 3 CPU_time = 33.805 ops/sec = 259.27E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 33.509 ops/sec = 261.56E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 35.740 ops/sec = 245.24E+06 Paul option
Method 6 CPU_time = 37.612 ops/sec = 233.03E+06 alternate Paul option
13085128 Number of dot_product calls
8764684555 Number of itterations
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

David,

Thanks for your results.
I have tested on a Xeon W3505, which is an old processor, using Salford, Lahey and Intel. Salford and Lahey are similar. Some of the Salford optimised methods fail and non-optimised fail badly.
However for Intel, I get a contra result with the F77 wrapper to the DO loop. This is a worry, as this is my typical (80's) coding style of using libraries of basic routines when writing code.
I have also tested on a Core i5-2540M which supports AVX, however in a skyline solver 50% of arguments are not 16-byte alligned. These 3 options (/o1 /o2 and /QxAVX) show the benefit of vector and AVX instructions.
The most reliable method appears to be 4 : wrapper to Dot-Product, as both Salford and Lahey fail on array sections.
Going back to the 80's approach to optimisation, which was to minimise the number of floating point operations, it is amazing that different options produce such a large spead of run times, in comparison to the time required for the floating point operations ( assuming this to be about 13 seconds on Xeon). The options that fail ( > 20 seconds ) are puzzling.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

The other results

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

It is interesting to look at the different trends between options and compilers. A number of interesting ones are:
Intel's poor performance for method 3 worries me as this is my preferred style of programing; with libraries of simple routines.
Absoft method 2 vs Method 4 is unusual.

What is the code doing for "> 30 seconds" would be worth understanding.
Is it a poor instruction set or just not suited to the processor "optimisation"?
FTN95 without /opt is a long way off the pace. Is this due to an old x86 instruction set, which hasn't changed since /P6.

I've been looking at this for years, but feel I don't realy understand.
Look forward to any other ideas.

John

Wilfried Linder · Posted: Tue May 15, 2012 12:54 pm Post subject:

Last Friday, Dan wrote in this thread "... you can do some parallelization with FTN95 and do that NATIVELY what no other compiler can do". Now I noticed something interesting:

I start the task manager and use the fourth tab ("Performance"? / in German "Leistung"). Here I can see the usage of my 4 processor cores in percent. Before I start my FTN95 program, all of them show nearly zero %. Then I start my program, and immediately all cores are working.

Does this really mean that FTN95 make an automatic parallelisation?

Regards - Wilfried

DanRRight · Posted: Tue May 15, 2012 4:01 pm Post subject:

General conclusion is that FTN95 is not optimized for some array options. Because several other compilers clearly optimized all 6 John's variants and bring consistently similar speed.

Further optimization of FTN95 plus adding AVX instructions may bring factor of 2 at least. Meantime multithreading may give factor of 4 on 4 cores (John - think how to divide the external loop on 4 )

David - was that older Absoft compiler ?
Wilfried - you've probably made Silverfrost's day today - their dream came through Smile

You know, large spontaneous mutation may happen, according to theory of evolution.

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

Dan,

Yes. Its a very old version 9.0 of the Absoft compiler. I am sure the newer releases are much faster.

Wilfried,

What you are seeing in task manager is that the operating system is spreading the task of your program across multiple cores and hardware threads. But you don't get more than 100% of a core all together this way. With 2 cores, you may get 50% on one, 50% on another.

With a parallel program it is possible to get 100% on each core, or 200% in total. (Actually, you never get exactly 200% for the full run time of the program, but it can be 199% depending on the program.)

General

I don't fully understand the reason for the difference in speed with FTN95 for methods which are essentially the same. The other compilers show more consistent performance. Paul is having a look at it so we should wait for him to do that.

For John's code, it seems that there is not enough work in the dot product to make parallelisation worthwhile. I agree that the best strategy would be to vectorise the dot product somehow. With FTN95 the only way to do this is to use the inline assembler Confused

and write SSE2 or AVX machine code. Or use a different compiler that can vectorise the Fortran code automatically. This should buy you a factor of 2 to 4.

It might be possible to parallelise the outer loop and get another factor of 4 (on a 4 core machine), but I have not studied the code enough to know if this is possible.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan and David,

The outer loop is basically:

DanRRight · Posted: Thu May 17, 2012 2:37 am Post subject:

John,
Are threads completely independent in your example above? They do not have to access the same matrix element at the same time

For your task with bandwidth 1500 the most suitable would be using latest NVIDIA Kepler (see today announcement) which has almost exactly such amount of CUDA cores. Your dot product will take less time then you push Enter to complete Smile

.

BTW, I hope Silverfrost will add SSE and AVX as an option. Or may be CUDA too to make a true killing machine?

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Or possibly a library interface of some basic array procedures as a first step.
Dot_Product and Vector_A = Vector_A + const * Vector_B would get my vote.

Also matrix multiplication [A] x [B] or [A]transpose x [B] would be a good start.
And put severe limitations on the arguments, to suit the optimisation, say limiting to REAL*8 arguments and not 2D array sections.

Once one of these could be generated, the others might be able to follow, say as a public domain list of compatioble routines for SSE2 or AXV.

Even proving one function works with FTN95 would be a significant step.

John

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

I have written dot product using the Silverfrost FTN95 inline assembler which uses sse2 instructions to vectorise the contribution from successive sets of four values from each array.

I am still learning and there are probably some optimisations that can be done.

Note that the code only works for arrays whose sizes are exact multiples of 4. This is an interim step. I will modify it soon, to account for more general array sizes.

I don't know how fast the code is yet - I haven't timed it.

The code uses single precision and SSE2 but I can easily modify it for double precision later.

Unfortunately I don't have a machine which supports AVX.

I debugged this with FTN95s debugger. The only issue I had is I can't view any of the SSE registers. (I had to "poke" values into Fortran variables to see them Wink

)

Don't forget it needs to be compiled in win32 mode. (Not dot net).

Anyway here is the code.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

David,

Based on your earlier post, coding of the following form appears to work: