forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Fortran 2003/2008
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2079
Location: Sydney

PostPosted: Sun May 13, 2012 12:15 pm    Post subject: Reply with quote

I have run them with FTN95 and an old Lahey compiler on my Core i5 to produce the following results:
Code:
Lahey/Fujitsu Fortran 95 Compiler Release 5.50j
Copyright (C) 1994-2000 Lahey Computer Systems.  All rights reserved.
Copyright (C) 1998-2000 FUJITSU LIMITED. All rights reserved.

Options: -o1 -tpp -ntrace -nzero

    10241 equations
     1280 average profile
 13116540 coefficients
   100.07 storage (mb)

Method 1  CPU_time =   12.886  ops/sec =  680.39E+06  Original DO Loop             
Method 2  CPU_time =   29.999  ops/sec =  292.25E+06  F90 syntax                   
Method 3  CPU_time =   10.530  ops/sec =  832.59E+06  F77 wrapper for DO Loop       
Method 4  CPU_time =   10.592  ops/sec =  827.69E+06  F77 wrapper for Dot_Product   
Method 5  CPU_time =   12.636  ops/sec =  693.83E+06  Paul option                   
Method 6  CPU_time =   12.558  ops/sec =  698.14E+06  alternate Paul option         
13085816  Number of dot_product calls
8767279190  Number of itterations


[FTN95/Win32 Ver. 6.10.0 Copyright (c) Silverfrost Ltd 1993-2011]

 /opt /p6 /lgo

Program entered
     10146 equations
      1292 average profile
  13116399 coefficients
    100.07 storage (mb)

 Method 1  CPU_time =   29.469  ops/sec =  300.97E+06  Original DO Loop             
 Method 2  CPU_time =   29.687  ops/sec =  298.75E+06  F90 syntax                   
 Method 3  CPU_time =   10.561  ops/sec =  839.77E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =   10.561  ops/sec =  839.77E+06  F77 wrapper for Dot_Product   
 Method 5  CPU_time =   29.110  ops/sec =  304.68E+06  Paul option                   
 Method 6  CPU_time =   28.923  ops/sec =  306.65E+06  alternate Paul option         
              13085960 Number of dot_product calls
            8869050198 Number of itterations


The best that either do is 840 million itterations per second, for the F77 wrapper.
Both are a single processor, without vector instructions.
Lahey manages to improve the performance of the other DO loops, but both fail with array sections.
Based on this test only, Salford only appears to be able to optimise simple DO loops.
Again, the difference between Method 4 and Method 2 for both compilers is difficult to understand, especially by old rules of floating point operations counts. There is a lot of nothing going on with Method 2.
I'll need to see what performance can be derived from vector or OpenMP instructions. (Any ideas davidb ?)

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1967
Location: South Pole, Antarctica

PostPosted: Mon May 14, 2012 6:41 pm    Post subject: Reply with quote

Can anyone run this on other compilers?
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 556
Location: UK

PostPosted: Mon May 14, 2012 9:02 pm    Post subject: Reply with quote

My laptop is a bit slower than John's core I5 so I need to re-run using Silverfrost FTN95 run to provide a reference case.

I used the same options for FTN95 as John. For the other compilers I compiled at -O2.

I haven't had time to investigate using OpenMP yet.

Silverfrost FTN95

10146 equations
1292 average profile
13116399 coefficients
100.07 storage (mb)

Method 1 CPU_time = 41.527 ops/sec = 213.57E+06 Original DO Loop

Method 2 CPU_time = 41.309 ops/sec = 214.70E+06 F90 syntax

Method 3 CPU_time = 17.659 ops/sec = 502.23E+06 F77 wrapper for DO Loop

Method 4 CPU_time = 18.112 ops/sec = 489.69E+06 F77 wrapper for Dot_Produ
ct
Method 5 CPU_time = 41.371 ops/sec = 214.38E+06 Paul option

Method 6 CPU_time = 40.919 ops/sec = 216.75E+06 alternate Paul option

13085960 Number of dot_product calls
8869050198 Number of itterations

NAG Fortran Compiler Windows

10164 equations
1290 average profile
13117181 coefficients
100.08 storage (mb)

Method 1 CPU_time = 25.725 ops/sec = 344.00E+06 Original DO Loop
Method 2 CPU_time = 19.141 ops/sec = 462.31E+06 F90 syntax
Method 3 CPU_time = 19.017 ops/sec = 465.35E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 19.079 ops/sec = 463.83E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 21.747 ops/sec = 406.93E+06 Paul option
Method 6 CPU_time = 19.843 ops/sec = 445.96E+06 alternate Paul option
13086688 Number of dot_product calls
8849310852 Number of itterations

Intel ifort Linux

10220 equations
1283 average profile
13116753 coefficients
100.07 storage (mb)

Method 1 CPU_time = 14.473 ops/sec = 607.65E+06 Original DO Loop
Method 2 CPU_time = 14.485 ops/sec = 607.15E+06 F90 syntax
Method 3 CPU_time = 14.301 ops/sec = 614.96E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 14.301 ops/sec = 614.96E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 14.417 ops/sec = 610.01E+06 Paul option
Method 6 CPU_time = 14.441 ops/sec = 609.00E+06 alternate Paul option
13086093 Number of dot_product calls
8794467678 Number of itterations

gfortran linux

10107 equations
1297 average profile
13116884 coefficients
100.07 storage (mb)

Method 1 CPU_time = 16.669 ops/sec = 534.48E+06 Original DO Loop
Method 2 CPU_time = 16.597 ops/sec = 536.80E+06 F90 syntax
Method 3 CPU_time = 16.613 ops/sec = 536.29E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 16.613 ops/sec = 536.29E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 16.605 ops/sec = 536.54E+06 Paul option
Method 6 CPU_time = 16.637 ops/sec = 535.51E+06 alternate Paul option
13086562 Number of dot_product calls
8909344714 Number of itterations

open64 linux

10187 equations
1287 average profile
13116561 coefficients
100.07 storage (mb)

Method 1 CPU_time = 16.649 ops/sec = 530.03E+06 Original DO Loop
Method 2 CPU_time = 16.501 ops/sec = 534.79E+06 F90 syntax
Method 3 CPU_time = 16.633 ops/sec = 530.54E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 16.613 ops/sec = 531.18E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 16.537 ops/sec = 533.62E+06 Paul option
Method 6 CPU_time = 17.929 ops/sec = 492.19E+06 alternate Paul option
13085999 Number of dot_product calls
8824525977 Number of itterations
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl


Last edited by davidb on Mon May 14, 2012 9:04 pm; edited 1 time in total
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 556
Location: UK

PostPosted: Mon May 14, 2012 9:04 pm    Post subject: Reply with quote

and these two:

oracle studio linux

10101 equations
1298 average profile
13116850 coefficients
100.07 storage (mb)

Method 1 CPU_time = 41.175 ops/sec = 216.37E+06 Original DO Loop
Method 2 CPU_time = 40.757 ops/sec = 218.58E+06 F90 syntax
Method 3 CPU_time = 40.993 ops/sec = 217.33E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 40.874 ops/sec = 217.96E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 41.270 ops/sec = 215.87E+06 Paul option
Method 6 CPU_time = 41.051 ops/sec = 217.02E+06 alternate Paul option
13086546 Number of dot_product calls
8908901037 Number of itterations

Absoft Windows

10248 equations
1279 average profile
13115873 coefficients
100.07 storage (mb)

Method 1 CPU_time = 28.205 ops/sec = 310.75E+06 Original DO Loop
Method 2 CPU_time = 28.642 ops/sec = 306.01E+06 F90 syntax
Method 3 CPU_time = 33.805 ops/sec = 259.27E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 33.509 ops/sec = 261.56E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 35.740 ops/sec = 245.24E+06 Paul option
Method 6 CPU_time = 37.612 ops/sec = 233.03E+06 alternate Paul option
13085128 Number of dot_product calls
8764684555 Number of itterations
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2079
Location: Sydney

PostPosted: Tue May 15, 2012 1:52 am    Post subject: Reply with quote

David,

Thanks for your results.
I have tested on a Xeon W3505, which is an old processor, using Salford, Lahey and Intel. Salford and Lahey are similar. Some of the Salford optimised methods fail and non-optimised fail badly.
However for Intel, I get a contra result with the F77 wrapper to the DO loop. This is a worry, as this is my typical (80's) coding style of using libraries of basic routines when writing code.
I have also tested on a Core i5-2540M which supports AVX, however in a skyline solver 50% of arguments are not 16-byte alligned. These 3 options (/o1 /o2 and /QxAVX) show the benefit of vector and AVX instructions.
The most reliable method appears to be 4 : wrapper to Dot-Product, as both Salford and Lahey fail on array sections.
Going back to the 80's approach to optimisation, which was to minimise the number of floating point operations, it is amazing that different options produce such a large spead of run times, in comparison to the time required for the floating point operations ( assuming this to be about 13 seconds on Xeon). The options that fail ( > 20 seconds ) are puzzling.
Code:
The following have been run on a Xeon W3505

  It is now Monday, 14 May 2012 at 11:00:50.973
[FTN95/Win32 Ver. 6.10.0 Copyright (c) Silverfrost Ltd 1993-2011]
ftn95 col_test.f95 /opt /lgo
ftn95 col_test.f95 /opt /p6 /lgo
ftn95 col_test.f95 /opt /p6 /SINGLE_Threaded /lgo
ftn95 col_test.f95 /opt /pe /lgo
     10146 equations
      1292 average profile
  13116399 coefficients
    100.07 storage (mb)

 Method 1  CPU_time =   37.391  ops/sec =  237.20E+06  Original DO Loop             
 Method 2  CPU_time =   37.219  ops/sec =  238.30E+06  F90 syntax                   
 Method 3  CPU_time =   13.125  ops/sec =  675.74E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =   13.141  ops/sec =  674.93E+06  F77 wrapper for Dot_Product   
 Method 5  CPU_time =   37.188  ops/sec =  238.50E+06  Paul option                   
 Method 6  CPU_time =   36.797  ops/sec =  241.03E+06  alternate Paul option         
              13085960 Number of dot_product calls
            8869050198 Number of itterations

ftn95 col_test.f95 /p6 /SINGLE_Threaded /lgo
ftn95 col_test.f95 /p6 /lgo
ftn95 col_test.f95 /lgo
     10146 equations
      1292 average profile
  13116399 coefficients
    100.07 storage (mb)

 Method 1  CPU_time =   52.172  ops/sec =  170.00E+06  Original DO Loop             
 Method 2  CPU_time =   55.125  ops/sec =  160.89E+06  F90 syntax                   
 Method 3  CPU_time =   37.484  ops/sec =  236.61E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =   37.469  ops/sec =  236.71E+06  F77 wrapper for Dot_Product   
 Method 5  CPU_time =   52.641  ops/sec =  168.48E+06  Paul option                   
 Method 6  CPU_time =   44.688  ops/sec =  198.47E+06  alternate Paul option         
              13085960 Number of dot_product calls
            8869050198 Number of itterations

ifort 64 Ver 11.1 Build 20100414
ifort /source:col_test.f95 /free /o1 /Qvec-
Microsoft (R) Incremental Linker Version 9.00.21022.08
     10220 equations
      1283 average profile
  13116753 coefficients
    100.07 storage (mb)
 
 Method 1  CPU_time =   13.031  ops/sec =  674.88E+06  Original DO Loop             
 Method 2  CPU_time =   12.984  ops/sec =  677.31E+06  F90 syntax                   
 Method 3  CPU_time =   12.781  ops/sec =  688.08E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =   12.797  ops/sec =  687.24E+06  F77 wrapper for Dot_Product   
 Method 5  CPU_time =   12.906  ops/sec =  681.41E+06  Paul option                   
 Method 6  CPU_time =   14.594  ops/sec =  602.62E+06  alternate Paul option         
              13086093  Number of dot_product calls
            8794467678  Number of itterations


Last edited by JohnCampbell on Tue May 15, 2012 2:11 am; edited 1 time in total
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2079
Location: Sydney

PostPosted: Tue May 15, 2012 2:00 am    Post subject: Reply with quote

The other results
Code:
ifort /source:col_test.f95 /free /o2
     10220 equations
      1283 average profile
  13116753 coefficients
    100.07 storage (mb)
 
 Method 1  CPU_time =    8.781  ops/sec =    1.00E+09  Original DO Loop             
 Method 2  CPU_time =    8.578  ops/sec =    1.03E+09  F90 syntax                   
 Method 3  CPU_time =   12.812  ops/sec =  686.40E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =    8.547  ops/sec =    1.03E+09  F77 wrapper for Dot_Product   
 Method 5  CPU_time =    8.578  ops/sec =    1.03E+09  Paul option                   
 Method 6  CPU_time =    8.641  ops/sec =    1.02E+09  alternate Paul option         
              13086093  Number of dot_product calls
            8794467678  Number of itterations

lf95 -o1 -tpp -ntrace -nzero col_test.f95
Lahey/Fujitsu Fortran 95 Compiler Release 5.50j   
    10241 equations
     1280 average profile
 13116540 coefficients
   100.07 storage (mb)

Method 1  CPU_time =   16.859  ops/sec =  520.02E+06  Original DO Loop             
Method 2  CPU_time =   38.328  ops/sec =  228.74E+06  F90 syntax                   
Method 3  CPU_time =   13.203  ops/sec =  664.03E+06  F77 wrapper for DO Loop       
Method 4  CPU_time =   13.219  ops/sec =  663.25E+06  F77 wrapper for Dot_Product   
Method 5  CPU_time =   16.672  ops/sec =  525.87E+06  Paul option                   
Method 6  CPU_time =   16.656  ops/sec =  526.37E+06  alternate Paul option         
13085816  Number of dot_product calls
8767279190  Number of itterations

Results on Core i5-2540M showing the benefit of vector instructions.
Code:
/o1 /Qvec-
     10220 equations
      1283 average profile
  13116753 coefficients
    100.07 storage (mb)
 
 Method 1  CPU_time =   10.655  ops/sec =  825.39E+06  Original DO Loop             
 Method 2  CPU_time =   10.436  ops/sec =  842.67E+06  F90 syntax                   
 Method 3  CPU_time =   10.483  ops/sec =  838.91E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =   10.405  ops/sec =  845.19E+06  F77 wrapper for Dot_Product   
 Method 5  CPU_time =   10.312  ops/sec =  852.87E+06  Paul option                   
 Method 6  CPU_time =   10.764  ops/sec =  817.02E+06  alternate Paul option         
              13086093  Number of dot_product calls
            8794467678  Number of itterations

/o2
     10220 equations
      1283 average profile
  13116753 coefficients
    100.07 storage (mb)
 
 Method 1  CPU_time =    6.412  ops/sec =    1.37E+09  Original DO Loop             
 Method 2  CPU_time =    6.380  ops/sec =    1.38E+09  F90 syntax                   
 Method 3  CPU_time =   10.343  ops/sec =  850.29E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =    6.380  ops/sec =    1.38E+09  F77 wrapper for Dot_Product   
 Method 5  CPU_time =    6.380  ops/sec =    1.38E+09  Paul option                   
 Method 6  CPU_time =    6.427  ops/sec =    1.37E+09  alternate Paul option         
              13086093  Number of dot_product calls
            8794467678  Number of itterations

/o2 /QxAVX
     10220 equations
      1283 average profile
  13116753 coefficients
    100.07 storage (mb)
 
 Method 1  CPU_time =    6.037  ops/sec =    1.46E+09  Original DO Loop             
 Method 2  CPU_time =    5.944  ops/sec =    1.48E+09  F90 syntax                   
 Method 3  CPU_time =   14.711  ops/sec =  597.82E+06  F77 wrapper for DO Loop       
 Method 4  CPU_time =    5.944  ops/sec =    1.48E+09  F77 wrapper for Dot_Product   
 Method 5  CPU_time =    5.881  ops/sec =    1.50E+09  Paul option                   
 Method 6  CPU_time =    6.068  ops/sec =    1.45E+09  alternate Paul option         
              13086093  Number of dot_product calls
            8794467678  Number of itterations
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2079
Location: Sydney

PostPosted: Tue May 15, 2012 5:21 am    Post subject: Reply with quote

It is interesting to look at the different trends between options and compilers. A number of interesting ones are:
Intel's poor performance for method 3 worries me as this is my preferred style of programing; with libraries of simple routines.
Absoft method 2 vs Method 4 is unusual.

What is the code doing for "> 30 seconds" would be worth understanding.
Is it a poor instruction set or just not suited to the processor "optimisation"?
FTN95 without /opt is a long way off the pace. Is this due to an old x86 instruction set, which hasn't changed since /P6.

I've been looking at this for years, but feel I don't realy understand.
Look forward to any other ideas.

John
Back to top
View user's profile Send private message
Wilfried Linder



Joined: 14 Nov 2007
Posts: 314
Location: Düsseldorf, Germany

PostPosted: Tue May 15, 2012 12:54 pm    Post subject: Reply with quote

Last Friday, Dan wrote in this thread "... you can do some parallelization with FTN95 and do that NATIVELY what no other compiler can do". Now I noticed something interesting:

I start the task manager and use the fourth tab ("Performance"? / in German "Leistung"). Here I can see the usage of my 4 processor cores in percent. Before I start my FTN95 program, all of them show nearly zero %. Then I start my program, and immediately all cores are working.

Does this really mean that FTN95 make an automatic parallelisation?

Regards - Wilfried
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1967
Location: South Pole, Antarctica

PostPosted: Tue May 15, 2012 4:01 pm    Post subject: Reply with quote

General conclusion is that FTN95 is not optimized for some array options. Because several other compilers clearly optimized all 6 John's variants and bring consistently similar speed.

Further optimization of FTN95 plus adding AVX instructions may bring factor of 2 at least. Meantime multithreading may give factor of 4 on 4 cores (John - think how to divide the external loop on 4 )

David - was that older Absoft compiler ?
Wilfried - you've probably made Silverfrost's day today - their dream came through Smile You know, large spontaneous mutation may happen, according to theory of evolution.


Last edited by DanRRight on Wed May 16, 2012 12:41 am; edited 3 times in total
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 556
Location: UK

PostPosted: Tue May 15, 2012 5:29 pm    Post subject: Reply with quote

Dan,

Yes. Its a very old version 9.0 of the Absoft compiler. I am sure the newer releases are much faster.

Wilfried,

What you are seeing in task manager is that the operating system is spreading the task of your program across multiple cores and hardware threads. But you don't get more than 100% of a core all together this way. With 2 cores, you may get 50% on one, 50% on another.

With a parallel program it is possible to get 100% on each core, or 200% in total. (Actually, you never get exactly 200% for the full run time of the program, but it can be 199% depending on the program.)

General

I don't fully understand the reason for the difference in speed with FTN95 for methods which are essentially the same. The other compilers show more consistent performance. Paul is having a look at it so we should wait for him to do that.

For John's code, it seems that there is not enough work in the dot product to make parallelisation worthwhile. I agree that the best strategy would be to vectorise the dot product somehow. With FTN95 the only way to do this is to use the inline assembler Confused and write SSE2 or AVX machine code. Or use a different compiler that can vectorise the Fortran code automatically. This should buy you a factor of 2 to 4.

It might be possible to parallelise the outer loop and get another factor of 4 (on a 4 core machine), but I have not studied the code enough to know if this is possible.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2079
Location: Sydney

PostPosted: Thu May 17, 2012 2:20 am    Post subject: Reply with quote

Dan and David,

The outer loop is basically:
Code:
      A_ptr_0 = 1-IEQ_bot                       ! virtual pointer to A(0)
!
      DO J = JB,JT                              ! loop over range of equations
         {calculate JBAND A_ptr_b and B_ptr_b}
         c = Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )
         A(A_ptr_0+J) = A(A_ptr_0+J) - c
      END DO

As A varies with each ittertion and the updated value of A is used in the next itteration, there is a limit to being able to multi-thread.
It might be able to be done by doing:
DO J = JB,JT,4
{perform 4 blocks simultaneously, then the lower 4x4 triangle last}
This is similar to the approach for a multi-block storage requirement.

The most interesting idea is from davidb and "With FTN95 the only way to do this is to use the inline assembler (for Vec_Sum or Dot_Product) and write SSE2 or AVX machine code."
I am not capable of doing this or know the limitations as to how to test if the approach is supported by the processor ?
I'm also trying to understand the issues associated with memory alignment and if this can be better managed. This may also need to be addressed.

Thanks for the ideas. I hope we can all learn something from this exercise.

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1967
Location: South Pole, Antarctica

PostPosted: Thu May 17, 2012 2:37 am    Post subject: Reply with quote

John,
Are threads completely independent in your example above? They do not have to access the same matrix element at the same time

For your task with bandwidth 1500 the most suitable would be using latest NVIDIA Kepler (see today announcement) which has almost exactly such amount of CUDA cores. Your dot product will take less time then you push Enter to complete Smile.

BTW, I hope Silverfrost will add SSE and AVX as an option. Or may be CUDA too to make a true killing machine?
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2079
Location: Sydney

PostPosted: Thu May 17, 2012 2:56 am    Post subject: Reply with quote

Or possibly a library interface of some basic array procedures as a first step.
Dot_Product and Vector_A = Vector_A + const * Vector_B would get my vote.

Also matrix multiplication [A] x [B] or [A]transpose x [B] would be a good start.
And put severe limitations on the arguments, to suit the optimisation, say limiting to REAL*8 arguments and not 2D array sections.

Once one of these could be generated, the others might be able to follow, say as a public domain list of compatioble routines for SSE2 or AXV.

Even proving one function works with FTN95 would be a significant step.

John
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 556
Location: UK

PostPosted: Thu May 17, 2012 10:17 pm    Post subject: Reply with quote

I have written dot product using the Silverfrost FTN95 inline assembler which uses sse2 instructions to vectorise the contribution from successive sets of four values from each array.

I am still learning and there are probably some optimisations that can be done.

Note that the code only works for arrays whose sizes are exact multiples of 4. This is an interim step. I will modify it soon, to account for more general array sizes.

I don't know how fast the code is yet - I haven't timed it.

The code uses single precision and SSE2 but I can easily modify it for double precision later.

Unfortunately I don't have a machine which supports AVX.

I debugged this with FTN95s debugger. The only issue I had is I can't view any of the SSE registers. (I had to "poke" values into Fortran variables to see them Wink )

Don't forget it needs to be compiled in win32 mode. (Not dot net).

Anyway here is the code.

Code:


function asm_dotprod(v1,v2,n)
   integer n
   real v1(*), v2(*), v(4)
   real asm_dotprod

   integer m
   
   m = 4*n

   ! Assembly code is between code, edoc lines
   code
      xorps xmm0%,xmm0%     ; set xmm0 to zero
   
      mov eax%, =v1        ; address of v1 argument
      mov ecx%, =v2        ; address of v2 argument

      mov edx%, 0          ; Initialise loop counter
   
   10 movups xmm1%, [eax%+edx%]  ; move next 4 reals in v1 into xmm1
      movups xmm2%, [ecx%+edx%]  ; move next 4 reals in v2 into xmm2
      mulps xmm1%, xmm2%         ; multiply values and store in xmm1
      addps xmm0%, xmm1%         ; accumulate sum in xmm0
      add edx%, 16                ; increment counter
     
      cmp edx%, m
      jne $10                    ; conditional branch back.
   
      ! Can't get final reduction to work, so will do this in Fortran for now
      movups v, xmm0%            ; move xmm0 to v array
   edoc
   
   ! Final reduction, the result is the sum of the four values in v
   asm_dotprod = sum(v)

end function asm_dotprod


program anon

   ! Note. asm_dotprod only works with arrays of size exact multiples of 4.
   ! Will fix this soon.
   real a(12), b(12)
   
   a = (/ 2.0, 2.0, 3.0, 4.0, 1.0, 3.0, 2.0, 4.0, 1.0, 2.0, 3.0, 1.5 /)
   b = (/ 2.5, 1.5, 2.4, 3.2, 2.0, 1.5, 2.0, 3.0, 1.5, 2.0, 1.5, 3.0 /)

   print *, "Here are the test vectors"
   print *, "a = "
   print *, a
   print *, "b = "
   print *, b

   print *
   print *, "This is the correct value for the dot product, calculated using Fortran"
   print *, dot_product(a,b)

   ! Calculate dot product using assembler and sse2
   s = asm_dotprod(a,b,12)

   print *
   print *, "Here is the value of the dot product calculated using assembly language and sse2"
   print *, s

end program anon


The output is:

Here are the test vectors
a =
2.00000 2.00000 3.00000 4.00000 1.00000
3.00000 2.00000
4.00000 1.00000 2.00000 3.00000 1.50000

b =
2.50000 1.50000 2.40000 3.20000 2.00000
1.50000 2.00000
3.00000 1.50000 2.00000 1.50000 3.00000


This is the correct value for the dot product, calculated using Fortran
65.0000

Here is the value of the dot product calculated using assembly language and sse2
65.0000
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2079
Location: Sydney

PostPosted: Fri May 18, 2012 8:28 am    Post subject: Reply with quote

David,

Based on your earlier post, coding of the following form appears to work:
Code:
      real*8 function vec_sum_4 (a,b,n)
!
      real*8    a(*), b(*)
      integer*4 n
!
      integer*4 i, nn
      real*8    c
!
      nn = (n/4)*4
      c  = 0
      do i = 1,nn,4
         c = c + sum ( a(i:i+3) * b(i:i+3) )
      end do
!
      if (i==n) c = c + c(n)*b(n)
      if (i<n)  c = c + sum ( a(i:n) * b(i:n) )
!
      vec_sum_4 = c
      end

I've tested that it will run, but not yet tested the results or suitability for AVX

John
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
Page 4 of 10

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group