|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Sun May 13, 2012 12:15 pm Post subject: |
|
|
I have run them with FTN95 and an old Lahey compiler on my Core i5 to produce the following results: Code: | Lahey/Fujitsu Fortran 95 Compiler Release 5.50j
Copyright (C) 1994-2000 Lahey Computer Systems. All rights reserved.
Copyright (C) 1998-2000 FUJITSU LIMITED. All rights reserved.
Options: -o1 -tpp -ntrace -nzero
10241 equations
1280 average profile
13116540 coefficients
100.07 storage (mb)
Method 1 CPU_time = 12.886 ops/sec = 680.39E+06 Original DO Loop
Method 2 CPU_time = 29.999 ops/sec = 292.25E+06 F90 syntax
Method 3 CPU_time = 10.530 ops/sec = 832.59E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 10.592 ops/sec = 827.69E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 12.636 ops/sec = 693.83E+06 Paul option
Method 6 CPU_time = 12.558 ops/sec = 698.14E+06 alternate Paul option
13085816 Number of dot_product calls
8767279190 Number of itterations
[FTN95/Win32 Ver. 6.10.0 Copyright (c) Silverfrost Ltd 1993-2011]
/opt /p6 /lgo
Program entered
10146 equations
1292 average profile
13116399 coefficients
100.07 storage (mb)
Method 1 CPU_time = 29.469 ops/sec = 300.97E+06 Original DO Loop
Method 2 CPU_time = 29.687 ops/sec = 298.75E+06 F90 syntax
Method 3 CPU_time = 10.561 ops/sec = 839.77E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 10.561 ops/sec = 839.77E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 29.110 ops/sec = 304.68E+06 Paul option
Method 6 CPU_time = 28.923 ops/sec = 306.65E+06 alternate Paul option
13085960 Number of dot_product calls
8869050198 Number of itterations
|
The best that either do is 840 million itterations per second, for the F77 wrapper.
Both are a single processor, without vector instructions.
Lahey manages to improve the performance of the other DO loops, but both fail with array sections.
Based on this test only, Salford only appears to be able to optimise simple DO loops.
Again, the difference between Method 4 and Method 2 for both compilers is difficult to understand, especially by old rules of floating point operations counts. There is a lot of nothing going on with Method 2.
I'll need to see what performance can be derived from vector or OpenMP instructions. (Any ideas davidb ?)
John |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2816 Location: South Pole, Antarctica
|
Posted: Mon May 14, 2012 6:41 pm Post subject: |
|
|
Can anyone run this on other compilers? |
|
Back to top |
|
|
davidb
Joined: 17 Jul 2009 Posts: 560 Location: UK
|
Posted: Mon May 14, 2012 9:02 pm Post subject: |
|
|
My laptop is a bit slower than John's core I5 so I need to re-run using Silverfrost FTN95 run to provide a reference case.
I used the same options for FTN95 as John. For the other compilers I compiled at -O2.
I haven't had time to investigate using OpenMP yet.
Silverfrost FTN95
10146 equations
1292 average profile
13116399 coefficients
100.07 storage (mb)
Method 1 CPU_time = 41.527 ops/sec = 213.57E+06 Original DO Loop
Method 2 CPU_time = 41.309 ops/sec = 214.70E+06 F90 syntax
Method 3 CPU_time = 17.659 ops/sec = 502.23E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 18.112 ops/sec = 489.69E+06 F77 wrapper for Dot_Produ
ct
Method 5 CPU_time = 41.371 ops/sec = 214.38E+06 Paul option
Method 6 CPU_time = 40.919 ops/sec = 216.75E+06 alternate Paul option
13085960 Number of dot_product calls
8869050198 Number of itterations
NAG Fortran Compiler Windows
10164 equations
1290 average profile
13117181 coefficients
100.08 storage (mb)
Method 1 CPU_time = 25.725 ops/sec = 344.00E+06 Original DO Loop
Method 2 CPU_time = 19.141 ops/sec = 462.31E+06 F90 syntax
Method 3 CPU_time = 19.017 ops/sec = 465.35E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 19.079 ops/sec = 463.83E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 21.747 ops/sec = 406.93E+06 Paul option
Method 6 CPU_time = 19.843 ops/sec = 445.96E+06 alternate Paul option
13086688 Number of dot_product calls
8849310852 Number of itterations
Intel ifort Linux
10220 equations
1283 average profile
13116753 coefficients
100.07 storage (mb)
Method 1 CPU_time = 14.473 ops/sec = 607.65E+06 Original DO Loop
Method 2 CPU_time = 14.485 ops/sec = 607.15E+06 F90 syntax
Method 3 CPU_time = 14.301 ops/sec = 614.96E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 14.301 ops/sec = 614.96E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 14.417 ops/sec = 610.01E+06 Paul option
Method 6 CPU_time = 14.441 ops/sec = 609.00E+06 alternate Paul option
13086093 Number of dot_product calls
8794467678 Number of itterations
gfortran linux
10107 equations
1297 average profile
13116884 coefficients
100.07 storage (mb)
Method 1 CPU_time = 16.669 ops/sec = 534.48E+06 Original DO Loop
Method 2 CPU_time = 16.597 ops/sec = 536.80E+06 F90 syntax
Method 3 CPU_time = 16.613 ops/sec = 536.29E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 16.613 ops/sec = 536.29E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 16.605 ops/sec = 536.54E+06 Paul option
Method 6 CPU_time = 16.637 ops/sec = 535.51E+06 alternate Paul option
13086562 Number of dot_product calls
8909344714 Number of itterations
open64 linux
10187 equations
1287 average profile
13116561 coefficients
100.07 storage (mb)
Method 1 CPU_time = 16.649 ops/sec = 530.03E+06 Original DO Loop
Method 2 CPU_time = 16.501 ops/sec = 534.79E+06 F90 syntax
Method 3 CPU_time = 16.633 ops/sec = 530.54E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 16.613 ops/sec = 531.18E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 16.537 ops/sec = 533.62E+06 Paul option
Method 6 CPU_time = 17.929 ops/sec = 492.19E+06 alternate Paul option
13085999 Number of dot_product calls
8824525977 Number of itterations _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Last edited by davidb on Mon May 14, 2012 9:04 pm; edited 1 time in total |
|
Back to top |
|
|
davidb
Joined: 17 Jul 2009 Posts: 560 Location: UK
|
Posted: Mon May 14, 2012 9:04 pm Post subject: |
|
|
and these two:
oracle studio linux
10101 equations
1298 average profile
13116850 coefficients
100.07 storage (mb)
Method 1 CPU_time = 41.175 ops/sec = 216.37E+06 Original DO Loop
Method 2 CPU_time = 40.757 ops/sec = 218.58E+06 F90 syntax
Method 3 CPU_time = 40.993 ops/sec = 217.33E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 40.874 ops/sec = 217.96E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 41.270 ops/sec = 215.87E+06 Paul option
Method 6 CPU_time = 41.051 ops/sec = 217.02E+06 alternate Paul option
13086546 Number of dot_product calls
8908901037 Number of itterations
Absoft Windows
10248 equations
1279 average profile
13115873 coefficients
100.07 storage (mb)
Method 1 CPU_time = 28.205 ops/sec = 310.75E+06 Original DO Loop
Method 2 CPU_time = 28.642 ops/sec = 306.01E+06 F90 syntax
Method 3 CPU_time = 33.805 ops/sec = 259.27E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 33.509 ops/sec = 261.56E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 35.740 ops/sec = 245.24E+06 Paul option
Method 6 CPU_time = 37.612 ops/sec = 233.03E+06 alternate Paul option
13085128 Number of dot_product calls
8764684555 Number of itterations _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Tue May 15, 2012 1:52 am Post subject: |
|
|
David,
Thanks for your results.
I have tested on a Xeon W3505, which is an old processor, using Salford, Lahey and Intel. Salford and Lahey are similar. Some of the Salford optimised methods fail and non-optimised fail badly.
However for Intel, I get a contra result with the F77 wrapper to the DO loop. This is a worry, as this is my typical (80's) coding style of using libraries of basic routines when writing code.
I have also tested on a Core i5-2540M which supports AVX, however in a skyline solver 50% of arguments are not 16-byte alligned. These 3 options (/o1 /o2 and /QxAVX) show the benefit of vector and AVX instructions.
The most reliable method appears to be 4 : wrapper to Dot-Product, as both Salford and Lahey fail on array sections.
Going back to the 80's approach to optimisation, which was to minimise the number of floating point operations, it is amazing that different options produce such a large spead of run times, in comparison to the time required for the floating point operations ( assuming this to be about 13 seconds on Xeon). The options that fail ( > 20 seconds ) are puzzling.
Code: | The following have been run on a Xeon W3505
It is now Monday, 14 May 2012 at 11:00:50.973
[FTN95/Win32 Ver. 6.10.0 Copyright (c) Silverfrost Ltd 1993-2011]
ftn95 col_test.f95 /opt /lgo
ftn95 col_test.f95 /opt /p6 /lgo
ftn95 col_test.f95 /opt /p6 /SINGLE_Threaded /lgo
ftn95 col_test.f95 /opt /pe /lgo
10146 equations
1292 average profile
13116399 coefficients
100.07 storage (mb)
Method 1 CPU_time = 37.391 ops/sec = 237.20E+06 Original DO Loop
Method 2 CPU_time = 37.219 ops/sec = 238.30E+06 F90 syntax
Method 3 CPU_time = 13.125 ops/sec = 675.74E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 13.141 ops/sec = 674.93E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 37.188 ops/sec = 238.50E+06 Paul option
Method 6 CPU_time = 36.797 ops/sec = 241.03E+06 alternate Paul option
13085960 Number of dot_product calls
8869050198 Number of itterations
ftn95 col_test.f95 /p6 /SINGLE_Threaded /lgo
ftn95 col_test.f95 /p6 /lgo
ftn95 col_test.f95 /lgo
10146 equations
1292 average profile
13116399 coefficients
100.07 storage (mb)
Method 1 CPU_time = 52.172 ops/sec = 170.00E+06 Original DO Loop
Method 2 CPU_time = 55.125 ops/sec = 160.89E+06 F90 syntax
Method 3 CPU_time = 37.484 ops/sec = 236.61E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 37.469 ops/sec = 236.71E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 52.641 ops/sec = 168.48E+06 Paul option
Method 6 CPU_time = 44.688 ops/sec = 198.47E+06 alternate Paul option
13085960 Number of dot_product calls
8869050198 Number of itterations
ifort 64 Ver 11.1 Build 20100414
ifort /source:col_test.f95 /free /o1 /Qvec-
Microsoft (R) Incremental Linker Version 9.00.21022.08
10220 equations
1283 average profile
13116753 coefficients
100.07 storage (mb)
Method 1 CPU_time = 13.031 ops/sec = 674.88E+06 Original DO Loop
Method 2 CPU_time = 12.984 ops/sec = 677.31E+06 F90 syntax
Method 3 CPU_time = 12.781 ops/sec = 688.08E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 12.797 ops/sec = 687.24E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 12.906 ops/sec = 681.41E+06 Paul option
Method 6 CPU_time = 14.594 ops/sec = 602.62E+06 alternate Paul option
13086093 Number of dot_product calls
8794467678 Number of itterations
|
Last edited by JohnCampbell on Tue May 15, 2012 2:11 am; edited 1 time in total |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Tue May 15, 2012 2:00 am Post subject: |
|
|
The other results Code: | ifort /source:col_test.f95 /free /o2
10220 equations
1283 average profile
13116753 coefficients
100.07 storage (mb)
Method 1 CPU_time = 8.781 ops/sec = 1.00E+09 Original DO Loop
Method 2 CPU_time = 8.578 ops/sec = 1.03E+09 F90 syntax
Method 3 CPU_time = 12.812 ops/sec = 686.40E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 8.547 ops/sec = 1.03E+09 F77 wrapper for Dot_Product
Method 5 CPU_time = 8.578 ops/sec = 1.03E+09 Paul option
Method 6 CPU_time = 8.641 ops/sec = 1.02E+09 alternate Paul option
13086093 Number of dot_product calls
8794467678 Number of itterations
lf95 -o1 -tpp -ntrace -nzero col_test.f95
Lahey/Fujitsu Fortran 95 Compiler Release 5.50j
10241 equations
1280 average profile
13116540 coefficients
100.07 storage (mb)
Method 1 CPU_time = 16.859 ops/sec = 520.02E+06 Original DO Loop
Method 2 CPU_time = 38.328 ops/sec = 228.74E+06 F90 syntax
Method 3 CPU_time = 13.203 ops/sec = 664.03E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 13.219 ops/sec = 663.25E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 16.672 ops/sec = 525.87E+06 Paul option
Method 6 CPU_time = 16.656 ops/sec = 526.37E+06 alternate Paul option
13085816 Number of dot_product calls
8767279190 Number of itterations
|
Results on Core i5-2540M showing the benefit of vector instructions.
Code: | /o1 /Qvec-
10220 equations
1283 average profile
13116753 coefficients
100.07 storage (mb)
Method 1 CPU_time = 10.655 ops/sec = 825.39E+06 Original DO Loop
Method 2 CPU_time = 10.436 ops/sec = 842.67E+06 F90 syntax
Method 3 CPU_time = 10.483 ops/sec = 838.91E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 10.405 ops/sec = 845.19E+06 F77 wrapper for Dot_Product
Method 5 CPU_time = 10.312 ops/sec = 852.87E+06 Paul option
Method 6 CPU_time = 10.764 ops/sec = 817.02E+06 alternate Paul option
13086093 Number of dot_product calls
8794467678 Number of itterations
/o2
10220 equations
1283 average profile
13116753 coefficients
100.07 storage (mb)
Method 1 CPU_time = 6.412 ops/sec = 1.37E+09 Original DO Loop
Method 2 CPU_time = 6.380 ops/sec = 1.38E+09 F90 syntax
Method 3 CPU_time = 10.343 ops/sec = 850.29E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 6.380 ops/sec = 1.38E+09 F77 wrapper for Dot_Product
Method 5 CPU_time = 6.380 ops/sec = 1.38E+09 Paul option
Method 6 CPU_time = 6.427 ops/sec = 1.37E+09 alternate Paul option
13086093 Number of dot_product calls
8794467678 Number of itterations
/o2 /QxAVX
10220 equations
1283 average profile
13116753 coefficients
100.07 storage (mb)
Method 1 CPU_time = 6.037 ops/sec = 1.46E+09 Original DO Loop
Method 2 CPU_time = 5.944 ops/sec = 1.48E+09 F90 syntax
Method 3 CPU_time = 14.711 ops/sec = 597.82E+06 F77 wrapper for DO Loop
Method 4 CPU_time = 5.944 ops/sec = 1.48E+09 F77 wrapper for Dot_Product
Method 5 CPU_time = 5.881 ops/sec = 1.50E+09 Paul option
Method 6 CPU_time = 6.068 ops/sec = 1.45E+09 alternate Paul option
13086093 Number of dot_product calls
8794467678 Number of itterations
|
|
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Tue May 15, 2012 5:21 am Post subject: |
|
|
It is interesting to look at the different trends between options and compilers. A number of interesting ones are:
Intel's poor performance for method 3 worries me as this is my preferred style of programing; with libraries of simple routines.
Absoft method 2 vs Method 4 is unusual.
What is the code doing for "> 30 seconds" would be worth understanding.
Is it a poor instruction set or just not suited to the processor "optimisation"?
FTN95 without /opt is a long way off the pace. Is this due to an old x86 instruction set, which hasn't changed since /P6.
I've been looking at this for years, but feel I don't realy understand.
Look forward to any other ideas.
John |
|
Back to top |
|
|
Wilfried Linder
Joined: 14 Nov 2007 Posts: 314 Location: Düsseldorf, Germany
|
Posted: Tue May 15, 2012 12:54 pm Post subject: |
|
|
Last Friday, Dan wrote in this thread "... you can do some parallelization with FTN95 and do that NATIVELY what no other compiler can do". Now I noticed something interesting:
I start the task manager and use the fourth tab ("Performance"? / in German "Leistung"). Here I can see the usage of my 4 processor cores in percent. Before I start my FTN95 program, all of them show nearly zero %. Then I start my program, and immediately all cores are working.
Does this really mean that FTN95 make an automatic parallelisation?
Regards - Wilfried |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2816 Location: South Pole, Antarctica
|
Posted: Tue May 15, 2012 4:01 pm Post subject: |
|
|
General conclusion is that FTN95 is not optimized for some array options. Because several other compilers clearly optimized all 6 John's variants and bring consistently similar speed.
Further optimization of FTN95 plus adding AVX instructions may bring factor of 2 at least. Meantime multithreading may give factor of 4 on 4 cores (John - think how to divide the external loop on 4 )
David - was that older Absoft compiler ?
Wilfried - you've probably made Silverfrost's day today - their dream came through You know, large spontaneous mutation may happen, according to theory of evolution.
Last edited by DanRRight on Wed May 16, 2012 12:41 am; edited 3 times in total |
|
Back to top |
|
|
davidb
Joined: 17 Jul 2009 Posts: 560 Location: UK
|
Posted: Tue May 15, 2012 5:29 pm Post subject: |
|
|
Dan,
Yes. Its a very old version 9.0 of the Absoft compiler. I am sure the newer releases are much faster.
Wilfried,
What you are seeing in task manager is that the operating system is spreading the task of your program across multiple cores and hardware threads. But you don't get more than 100% of a core all together this way. With 2 cores, you may get 50% on one, 50% on another.
With a parallel program it is possible to get 100% on each core, or 200% in total. (Actually, you never get exactly 200% for the full run time of the program, but it can be 199% depending on the program.)
General
I don't fully understand the reason for the difference in speed with FTN95 for methods which are essentially the same. The other compilers show more consistent performance. Paul is having a look at it so we should wait for him to do that.
For John's code, it seems that there is not enough work in the dot product to make parallelisation worthwhile. I agree that the best strategy would be to vectorise the dot product somehow. With FTN95 the only way to do this is to use the inline assembler and write SSE2 or AVX machine code. Or use a different compiler that can vectorise the Fortran code automatically. This should buy you a factor of 2 to 4.
It might be possible to parallelise the outer loop and get another factor of 4 (on a 4 core machine), but I have not studied the code enough to know if this is possible. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Thu May 17, 2012 2:20 am Post subject: |
|
|
Dan and David,
The outer loop is basically:
Code: | A_ptr_0 = 1-IEQ_bot ! virtual pointer to A(0)
!
DO J = JB,JT ! loop over range of equations
{calculate JBAND A_ptr_b and B_ptr_b}
c = Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )
A(A_ptr_0+J) = A(A_ptr_0+J) - c
END DO
|
As A varies with each ittertion and the updated value of A is used in the next itteration, there is a limit to being able to multi-thread.
It might be able to be done by doing:
DO J = JB,JT,4
{perform 4 blocks simultaneously, then the lower 4x4 triangle last}
This is similar to the approach for a multi-block storage requirement.
The most interesting idea is from davidb and "With FTN95 the only way to do this is to use the inline assembler (for Vec_Sum or Dot_Product) and write SSE2 or AVX machine code."
I am not capable of doing this or know the limitations as to how to test if the approach is supported by the processor ?
I'm also trying to understand the issues associated with memory alignment and if this can be better managed. This may also need to be addressed.
Thanks for the ideas. I hope we can all learn something from this exercise.
John |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2816 Location: South Pole, Antarctica
|
Posted: Thu May 17, 2012 2:37 am Post subject: |
|
|
John,
Are threads completely independent in your example above? They do not have to access the same matrix element at the same time
For your task with bandwidth 1500 the most suitable would be using latest NVIDIA Kepler (see today announcement) which has almost exactly such amount of CUDA cores. Your dot product will take less time then you push Enter to complete .
BTW, I hope Silverfrost will add SSE and AVX as an option. Or may be CUDA too to make a true killing machine? |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Thu May 17, 2012 2:56 am Post subject: |
|
|
Or possibly a library interface of some basic array procedures as a first step.
Dot_Product and Vector_A = Vector_A + const * Vector_B would get my vote.
Also matrix multiplication [A] x [B] or [A]transpose x [B] would be a good start.
And put severe limitations on the arguments, to suit the optimisation, say limiting to REAL*8 arguments and not 2D array sections.
Once one of these could be generated, the others might be able to follow, say as a public domain list of compatioble routines for SSE2 or AXV.
Even proving one function works with FTN95 would be a significant step.
John |
|
Back to top |
|
|
davidb
Joined: 17 Jul 2009 Posts: 560 Location: UK
|
Posted: Thu May 17, 2012 10:17 pm Post subject: |
|
|
I have written dot product using the Silverfrost FTN95 inline assembler which uses sse2 instructions to vectorise the contribution from successive sets of four values from each array.
I am still learning and there are probably some optimisations that can be done.
Note that the code only works for arrays whose sizes are exact multiples of 4. This is an interim step. I will modify it soon, to account for more general array sizes.
I don't know how fast the code is yet - I haven't timed it.
The code uses single precision and SSE2 but I can easily modify it for double precision later.
Unfortunately I don't have a machine which supports AVX.
I debugged this with FTN95s debugger. The only issue I had is I can't view any of the SSE registers. (I had to "poke" values into Fortran variables to see them )
Don't forget it needs to be compiled in win32 mode. (Not dot net).
Anyway here is the code.
Code: |
function asm_dotprod(v1,v2,n)
integer n
real v1(*), v2(*), v(4)
real asm_dotprod
integer m
m = 4*n
! Assembly code is between code, edoc lines
code
xorps xmm0%,xmm0% ; set xmm0 to zero
mov eax%, =v1 ; address of v1 argument
mov ecx%, =v2 ; address of v2 argument
mov edx%, 0 ; Initialise loop counter
10 movups xmm1%, [eax%+edx%] ; move next 4 reals in v1 into xmm1
movups xmm2%, [ecx%+edx%] ; move next 4 reals in v2 into xmm2
mulps xmm1%, xmm2% ; multiply values and store in xmm1
addps xmm0%, xmm1% ; accumulate sum in xmm0
add edx%, 16 ; increment counter
cmp edx%, m
jne $10 ; conditional branch back.
! Can't get final reduction to work, so will do this in Fortran for now
movups v, xmm0% ; move xmm0 to v array
edoc
! Final reduction, the result is the sum of the four values in v
asm_dotprod = sum(v)
end function asm_dotprod
program anon
! Note. asm_dotprod only works with arrays of size exact multiples of 4.
! Will fix this soon.
real a(12), b(12)
a = (/ 2.0, 2.0, 3.0, 4.0, 1.0, 3.0, 2.0, 4.0, 1.0, 2.0, 3.0, 1.5 /)
b = (/ 2.5, 1.5, 2.4, 3.2, 2.0, 1.5, 2.0, 3.0, 1.5, 2.0, 1.5, 3.0 /)
print *, "Here are the test vectors"
print *, "a = "
print *, a
print *, "b = "
print *, b
print *
print *, "This is the correct value for the dot product, calculated using Fortran"
print *, dot_product(a,b)
! Calculate dot product using assembler and sse2
s = asm_dotprod(a,b,12)
print *
print *, "Here is the value of the dot product calculated using assembly language and sse2"
print *, s
end program anon
|
The output is:
Here are the test vectors
a =
2.00000 2.00000 3.00000 4.00000 1.00000
3.00000 2.00000
4.00000 1.00000 2.00000 3.00000 1.50000
b =
2.50000 1.50000 2.40000 3.20000 2.00000
1.50000 2.00000
3.00000 1.50000 2.00000 1.50000 3.00000
This is the correct value for the dot product, calculated using Fortran
65.0000
Here is the value of the dot product calculated using assembly language and sse2
65.0000 _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Fri May 18, 2012 8:28 am Post subject: |
|
|
David,
Based on your earlier post, coding of the following form appears to work: Code: | real*8 function vec_sum_4 (a,b,n)
!
real*8 a(*), b(*)
integer*4 n
!
integer*4 i, nn
real*8 c
!
nn = (n/4)*4
c = 0
do i = 1,nn,4
c = c + sum ( a(i:i+3) * b(i:i+3) )
end do
!
if (i==n) c = c + c(n)*b(n)
if (i<n) c = c + sum ( a(i:n) * b(i:n) )
!
vec_sum_4 = c
end |
I've tested that it will run, but not yet tested the results or suitability for AVX
John |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|