
forums.silverfrost.com Welcome to the Silverfrost forums

View previous topic :: View next topic 
Author 
Message 
davidb
Joined: 17 Jul 2009 Posts: 557 Location: UK

Posted: Thu May 31, 2012 11:36 pm Post subject: 


Here is the fast version of the dot product for double precision arrays.
Features: (1) finds alignments and uses fast movapd, mulpd when possible, (2) decreases loop counters, (3) unwraps vectorised loops, (4) out of order execution.
Code: 
!** DOUBLE PRECISION VERSION
function fast_asm_ddotprod(v1,v2,n)
integer n
real*8 v1(*), v2(*)
real*8 fast_asm_ddotprod
integer nn, m
real*8 :: v(2)
! Number of sets of 4
nn = n/4
! Number of residuals
m = n  4*nn
! Assembly code is between code, edoc lines
code
xorps xmm0%, xmm0% ; set xmm0 to zero
xorps xmm1%, xmm1% ; set xmm1 to zero
mov eax%, =v1 ; address of v1
mov ecx%, =v2 ; address of v2
mov edx%, nn ; initialise loop counter for vectorised loop
mov edi%, m ; initialise loop counter for scalar loop
cmp edx%, 0
jz $60 ; no terms in vectorized loop
! Check for alignment of v1(1) and v2(1) on 16 byte boundaries
and eax%, 15 ; low nibble of v1 address
and ecx%, 15 ; low nibble of v2 address
cmp eax%, 0
jnz $10 ; v1(1) is unaligned, check for alignment of v1(2)
cmp ecx%, 0
jnz $10 ; v2(1) is unaligned, check for alignment of v2(2)
! v1(1) and v2(1) are aligned on 16 byte boundaries
mov eax%, =v1 ; address of v1
mov ecx%, =v2 ; address of v2
jmp $40 ; branch to aligned vectorized loop
! Check for alignment of v1(2) and v2(2) on 16 byte boundaries
10 xor eax%, 8
jnz $20 ; branch to unaligned vectorized loop
xor ecx%, 8
jnz $20 ; branch to unaligned vectorized loop
! v1(2) and v2(2) are aligned on 16 byte boundaries
mov eax%, =v1 ; address of v1 argument
mov ecx%, =v2 ; address of v2 argument
movsd xmm0%, [eax%]
mulsd xmm0%, [ecx%] ; initialise with product v1(1)*v2(1)
lea eax%, [eax%+8] ; address of v1(2)
lea ecx%, [ecx%+8] ; address of v2(2)
dec edi% ; decrease loop counter for scalar loop
jge $40 ; branch to aligned vectorized loop
mov edi%, 3 ; adjust loop counter for scalar loop
dec edx% ; adjust loop counter for vectorised loop
jz $60 ; no terms in vectorized loop
jmp $40 ; branch to aligned vectorized loop

_________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Last edited by davidb on Fri Jun 01, 2012 12:04 am; edited 4 times in total 

Back to top 


davidb
Joined: 17 Jul 2009 Posts: 557 Location: UK

Posted: Thu May 31, 2012 11:41 pm Post subject: 


continued ....
Code: 
! Initialise unaligned vectorised loop
20 mov eax%, =v1 ; address of v1 argument
mov ecx%, =v2 ; address of v2 argument
! Unaligned vectorised loop to accumulate dot product of elements 1 to nn
30 movupd xmm2%, [eax%] ; move next 2 doubles in v1 into xmm2
movupd xmm4%, [eax%+16] ; move next 2 doubles in v1 into xmm4
movupd xmm3%, [ecx%] ; move next 2 doubles in v2 into xmm3
movupd xmm5%, [ecx%+16] ; move next 2 doubles in v2 into xmm5
mulpd xmm2%, xmm3% ; multiply values and store in xmm2
mulpd xmm4%, xmm5% ; multiply values and store in xmm4
addpd xmm0%, xmm2% ; accumulate sum in xmm0
addpd xmm1%, xmm4% ; accumulate sum in xmm1
lea eax%, [eax%+32]
lea ecx%, [ecx%+32]
dec edx% ; decrement counter
jnz $30 ; conditional branch back
jmp $50
! Aligned vectorised loop to accumulate dot product of elements 1 to nn
40 movapd xmm2%, [eax%] ; move next 2 doubles in v1 into xmm2
movapd xmm4%, [eax%+16] ; move next 2 doubles in v1 into xmm4
mulpd xmm2%, [ecx%] ; multiply values and store in xmm2
mulpd xmm4%, [ecx%+16] ; multiply values and store in xmm4
addpd xmm0%, xmm2% ; accumulate sum in xmm0
addpd xmm1%, xmm4% ; accumulate sum in xmm1
lea eax%, [eax%+32]
lea ecx%, [ecx%+32]
dec edx% ; decrement counter
jnz $40 ; conditional branch back
50 addpd xmm0%, xmm1% ; add xmm0 and xmm1
! Check for entries in the scalar loop
60 cmp edi%, 0
jz $80 ; conditional skip loop, no terms in scalar loop
! Scalar loop to accumulate dot product of last nn+1 to n elements in low 64 bits of xmm0
70 movsd xmm1%, [eax%]
mulsd xmm1%, [ecx%]
addsd xmm0%, xmm1%
lea eax%, [eax%+8]
lea ecx%, [ecx%+8]
dec edi%
jnz $70
! Can't get final reduction to work, so will do this in Fortran for now
80 movupd v, xmm0% ; move xmm0 to v array
edoc
! Final reduction, the result is the sum of the two values in v
fast_asm_ddotprod = sum(v)
end function fast_asm_ddotprod

_________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2172 Location: Sydney

Posted: Fri Jun 01, 2012 8:26 am Post subject: 


David,
I have tested your revised example and confirmed that it has better performance. I have now tested 3 of the options you proposed. I am including the run test for 4 different approaches, using /opt compilation.
1: Original Vec_Sum function, using a DO loop
2: Modified loop, using your 4 group suggestion
3: asm_ddotprod
4: fast_asm_ddotprod
Code:  ! 4 Group coding
nn = mod (n, 4)
if (nn < 1) c = 0.
if (nn == 1) c = a(1)*b(1)
if (nn > 1) c = sum (a(1:nn)*b(1:nn))
!
do k = nn+1,n,4
c = c + sum ( a(K:K+3) * b(K:K+3) )
end do 
I have run all 4 in my FEA program, using the problem I have been reporting.
The run times are:
Code:  option 1 Vec_sum loop
CPU TIMES FOR EQUATION REDUCTION
Description Num Cpu I/O Avg
1 READ BLOCK A 4 0.97 0.04 0.252
3 MOVE BLOCK B 3 0.44 0.01 0.142
4 REDUCE BLOCK A 4 296.87 0.69 74.389
5 APPLY BLOCK B 3 5.63 0.05 1.892
6 WRITE BLOCK A 4 1.23 0.00 0.309
99 TOTAL REDUCTION 4 305.14 0.77 76.477
option 2 dotprod_4
CPU TIMES FOR EQUATION REDUCTION
Description Num Cpu I/O Avg
1 READ BLOCK A 4 1.01 0.00 0.254
3 MOVE BLOCK B 3 0.47 0.02 0.148
4 REDUCE BLOCK A 4 330.86 2.16 83.255
5 APPLY BLOCK B 3 6.32 0.12 2.146
6 WRITE BLOCK A 4 1.19 0.03 0.303
99 TOTAL REDUCTION 4 339.86 2.27 85.533
option 3 asm_ddotprod
CPU TIMES FOR EQUATION REDUCTION
Description Num Cpu I/O Avg
1 READ BLOCK A 4 0.76 0.01 0.193
3 MOVE BLOCK B 3 0.42 0.01 0.142
4 REDUCE BLOCK A 4 204.36 0.89 51.314
5 APPLY BLOCK B 3 3.82 0.01 1.278
6 WRITE BLOCK A 4 1.05 0.00 0.260
99 TOTAL REDUCTION 4 210.41 0.92 52.833
option 4 fast_asm_ddotprod
CPU TIMES FOR EQUATION REDUCTION
Description Num Cpu I/O Avg
1 READ BLOCK A 4 0.76 0.01 0.194
3 MOVE BLOCK B 3 0.42 0.00 0.140
4 REDUCE BLOCK A 4 188.43 0.50 47.233
5 APPLY BLOCK B 3 3.56 0.01 1.183
6 WRITE BLOCK A 4 0.98 0.00 0.247
99 TOTAL REDUCTION 4 194.17 0.49 48.667

where:
Num is the number of blocks used
Cpu is found from GetProcessTimes
I/O is elapse time difference between QueryPerform and Cpu
These results demonstrate that your approach has significant performance improvement and that SSE vector instructions can be used in FTN95 to improve performance.
John
ps : the other main loop of the equation solver is in the back substitution where :
Code:  SUBROUTINE VECSUB (A, B, FAC, N)
!
! Performs the vector opperation [A] = [A]  [B] * FAC
!
integer*4 n, i
real*8 a(*), b(*), fac, c
!
c = fac
do i = n,1,1
a(i) = a(i) + b(i) * c
end do
return
!
END



Back to top 


davidb
Joined: 17 Jul 2009 Posts: 557 Location: UK

Posted: Fri Jun 01, 2012 4:11 pm Post subject: 


I will post up some sse2 code to do the back substitution bit at the weekend. If anything, it is easier than the dotproduct.
Don't forget that you probably aren't getting full benefit from alignment of data, and you might get some more performance once ALIGN(128) is fixed, but this wouldn't get released until towards the end of the year. In the short term, you might experiment with ALIGN(64) in your declaration in the big array module. This should give alignment on 16 byte boundaries half of the time.
Then longer term, you need to think about how to use all of the cores on your computer, because I still think there's another factor of 24 improved performance to obtain by parallelising the outer loop somehow.
DanRRight has shown one way this might be possible if you're sticking with FTN95. I will have a little play with this method at the weekend to see if I can make it work with your problem. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl 

Back to top 


davidb
Joined: 17 Jul 2009 Posts: 557 Location: UK

Posted: Fri Jun 01, 2012 4:31 pm Post subject: Re: 


JohnCampbell wrote: 
Code:  ! 4 Group coding
nn = mod (n, 4)
if (nn < 1) c = 0.
if (nn == 1) c = a(1)*b(1)
if (nn > 1) c = sum (a(1:nn)*b(1:nn))
!
do k = nn+1,n,4
c = c + sum ( a(K:K+3) * b(K:K+3) )
end do 

You can save the expense of the three tests at the start because sum() works when there is 1 or 0 terms in the array:
Code:  ! 4 Group coding
nn = mod (n, 4)
! sum intrinsic returns 0.0 when array is empty
c = sum (a(1:nn)*b(1:nn))
do k = nn+1,n,4
c = c + sum ( a(K:K+3) * b(K:K+3) )
end do

In Fortran 90, empty arrays are always defined, and it is legal to multiply two together to create another empty array. And the sum of the elements in an empty array is zero.
Its academic anyway, since this approach isn't very fast. The unrolling is basically what is done in fast_asm_ddotprod but there I only unroll 2 times. I tried to do 4 but ran out of registers. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl 

Back to top 


LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2105 Location: Yateley, Hants, UK

Posted: Fri Jun 01, 2012 5:05 pm Post subject: 


Once again, a huge thank you to David for his efforts. This isn't just a working routine, it is a clear demonstration of the speed gains possible with SSE. However, this is a very obvious application. Are there others?
I note that the additional SSE4 registers are only available in 64 bit mode, and with those available, operations such as this could become even faster ...
Eddie 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2172 Location: Sydney

Posted: Sat Jun 02, 2012 1:59 am Post subject: 


Eddie,
I don’t understand why SSE4 registers are not available. Aren’t they in the processor? I would have thought they were available to 32bit application using a 64bit capable processor, such as my core i5 and similar.
I should point out that this test example is a problem I have been testing since 2007. (5 years !)
The stiffness matrix is 2.3gb and in 2007 it took 25 minutes to reduce the stiffness matrix, with a significant proportion of time for disk I/O (7 minutes). The following is the reduce stats from a run at that time. (Not sure of the hardware, but probably only 2gb of memory).
Code:  CPU TIMES FOR EQUATION REDUCTION
Description Num Cpu I/O Avg
1 READ BLOCK A 7 9.06 150.09 22.737
3 MOVE BLOCK B 6 2.73 0.51 0.541
4 REDUCE BLOCK A 7 1006.38 0.05 143.774
5 APPLY BLOCK B 6 76.75 0.04 12.798
6 WRITE BLOCK A 7 10.39 278.42 41.259
7 DIAGONAL ANALYSIS 2 0.03 0.44 0.237
99 TOTAL REDUCTION 7 1105.34 429.55 219.271 
My response back then was to look at a 64bit option to remove the disk I/O.
While that has worked, the combination of more memory and Win7 buffered I/O means that I can get similar results on 32bit.
It is interesting to understand this as:
64bit solutions only works well if the memory required is less than the physical memory, otherwise the wheels fall of with windows virtual memory.
32bit solutions that use disk I/O now have good disk buffering if the file transfer requirement is less than the physical memory.
With Windows 7, the result is that 32bit applications where the numerical algorithm has an optimised outofcore approach can perform well compared to the 64bit alternative.
What I have learnt recently is that although 64bit will be the way of the future, hardware and O/S improvements have extended the life of 32bit applications.
If I look back to 2007, I thought the improvements would have come from reducing the disk I/O, but the main reduction is in computation time.
Back then I also was looking to parallelising the code. However we are seeing significant gains (for my direct solver using dot_product) in SSE2 and more is possible with SSE4/AVX.
Thanks very much David. You have shown that there is potential to significantly improve the runtime performance of FTN95.
I am attaching another simple example of using fast_asm_ddotprod for matrix multiplication.
John 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2172 Location: Sydney

Posted: Sat Jun 02, 2012 2:07 am Post subject: 


I looked at another application of dot_product and wrote a simple test example, which I compiled with default and /opt compilation modes.
The relative performance of each option on my core i5 are dramatic.
Code:  module big_arrays
integer*4, parameter :: L = 1000
integer*4, parameter :: M = 1000
integer*4, parameter :: N = 1000
!
real*8 A(l,m), B(m,n), C(l,n)
end module big_arrays
use big_arrays
real*8 t(6), s(5)
integer*4 i,k
!
call random_number (A)
call random_number (B)
!
do k = 1,2
!
call cpu_time (t(1))
C = MATMUL (A,B)
s(1) = sum(C)
!
call cpu_time (t(2))
call MAT_MUL (A,B,C, l,m,n)
s(2) = sum(C)
!
call cpu_time (t(3))
call MAT_MUL_d (A,B,C, l,m,n)
s(3) = sum(C)
!
call cpu_time (t(4))
call MAT_MUL_c (A,B,C, l,m,n)
s(4) = sum(C)
!
call cpu_time (t(5))
call MAT_MUL_a (A,B,C, l,m,n)
s(5) = sum(C)
!
call cpu_time (t(6))
write (*,'(6f10.4)') (t(i+1)t(i), i=1,5)
write (*,'(6es10.2)') (s(i)s(1), i=1,5)
end do
end
subroutine MAT_MUL (A,B,C, l,m,n)
!
integer*4 l,m,n
real*8 A(l,m), B(m,n), C(l,n)
!
integer*4 i,j,k
real*8 s
!
do i = 1,l
do j = 1,n
s = 0
do k = 1,m
s = s + a(i,k)*b(k,j)
end do
c(i,j) = s
end do
end do
end
subroutine MAT_MUL_d (A,B,C, l,m,n)
!
integer*4 l,m,n
real*8 A(l,m), B(m,n), C(l,n)
!
integer*4 i,j
!
do i = 1,l
do j = 1,n
c(i,j) = dot_product (a(i,:), b(:,j))
end do
end do
end
subroutine MAT_MUL_c (A,B,C, l,m,n)
!
integer*4 l,m,n
real*8 A(l,m), B(m,n), C(l,n)
!
integer*4 i,j
real*8 column(m)
!
do i = 1,l
column = a(i,:)
do j = 1,n
c(i,j) = dot_product (column, b(:,j))
end do
end do
end
subroutine MAT_MUL_a (A,B,C, l,m,n)
!
integer*4 l,m,n
real*8 A(l,m), B(m,n), C(l,n)
!
integer*4 i,j
real*8 fast_asm_ddotprod, column(m)
external fast_asm_ddotprod
!
do i = 1,l
column = a(i,:)
do j = 1,n
c(i,j) = fast_asm_ddotprod (column, b(1,j), m)
end do
end do
end

Run performance I obtained is:
[code:1:8d5838c631][FTN95/Win32 Ver. 6.10.0 Copyright (c) Silverfrost Ltd 19932011]
PROCESSING MODULE [<BIG_ARRAYS> FTN95/Win32 v6.10.0]
NO ERRORS [<BIG_ARRAYS> FTN95/Win32 v6.10.0]
NO ERRORS [<main program> FTN95/Win32 v6.10.0]
NO ERRORS [<MAT_MUL> FTN95/Win32 v6.10.0]
NO ERRORS [<MAT_MUL_D> FTN95/Win32 v6.10.0]
NO ERRORS [<MAT_MUL_C> FTN95/Win32 v6.10.0]
NO ERRORS [<MAT_MUL_A> FTN95/Win32 v6.10.0]
NO ERRORS [<FAST_ASM_DDOTPROD> FTN95/Win32 v6.10.0]
Creating executable: c:\TEMP\FTN95_test\lgotemp@.exe
Program entered
11.7001 7.6596 7.6440 1.1388 0.7020
0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2172 Location: Sydney

Posted: Sat Jun 02, 2012 2:13 am Post subject: 


That's a 17 x improvement using David's new dot_product ! 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2172 Location: Sydney

Posted: Sat Jun 02, 2012 2:37 am Post subject: 


David,
You said,
Quote:  I will post up some sse2 code to do the back substitution bit at the weekend. If anything, it is easier than the dotproduct. 
I have spent more time trying to understand why this simple loop performs so badly in FTN95 than any other code I have.
You will note the back stepping do; this improves performance on some processors, but in general the instruction sequence that FTN95 produces does not appear to suit the optimisation approach in most modern processors.
I have never been able to understand why.
I look forward to what you may be able to produce.
John 

Back to top 


davidb
Joined: 17 Jul 2009 Posts: 557 Location: UK

Posted: Sat Jun 02, 2012 2:37 pm Post subject: Re: 


LitusSaxonicum wrote:  However, this is a very obvious application. Are there others?
Eddie

Basic Linear Algebra (BLAS) is the most obvious application of vectorisation/SPMD techniques.
However, one of my favorite applications is to use SPMD for the fast evaluation of polynomials using Estrin's method. (also see Knuth's "The Art of Computer Programming Vol 2", if you have it).
Given (say) a 7th order polynomial:
Code: 
y = a0 + a1*x + a2*x**2 + a3*x**3 + a4*x**4 + a5*x**5 + a6*x**6 + a7*x**7

Estrin's method is:
Code: 
y = (a0 + a1*x) + (a2 + a3*x)*x**2 + ((a4 + a5*x) + (a6 + a7*x)*x**2)*x**4

How can this be vectorised?
What I do is store the coefficients in an array c(:) in the following order:
Code: 
! Store coefficients for vectorised Estrin's method
c = (/ a0, a4, a2, a6, a1, a5, a3, a7 /)

Often the coefficients are constants, and I can set up c at compile time using a parameter array.
Then Estrin's method can be reduced to a series of SAXPY operations, each of which (apart from the last one which is a scalar operation) can be vectorised to make use of SSE2/AVX.
Using an array V(0:3) to store intermediate results, the vectorised Estrin algorithm is:
Code: 
! Evaluate 7th order polynomial, coefficients in C(:) array
x2 = x*x
x4 = x2*x2
v(0:3) = c(0:3) + c(4:7)*x ! This can be vectorised
v(0:1) = v(0:1) + v(2:3)*x2 ! This can be vectorised
y = v(0) + v(1)*x4

The scope for vectorisation goes up as the power of the polynomial goes up. Note that C and V arrays are declared as DIMENSION(0:3) for convenience.
Of course, it isn't very practical to hand code this in assembler, and I don't do that. I generally just use the Array notation in Fortran (90, 95, 2003) and let a vectorising compiler take care of the details. This is what the array notation is for and why it was put in Fortran 90 in the first place. It guarantees the elements can be calculated in any order so vectorisation can be used.
The approach can be generalised to multiple polynomials, rational polynomial approximations etc. I find that vectorising all my inner loops and array expressions, and using multiple threads for outer loops and high level tasks (usually with OpenMP) really works.
Note that some work and thinking is always needed to find the correct form, a vectorising compiler won't do the transformation for you. If the polynomial had been evaluated using the canonical form (above) or the traditional Horner's method, automatic vectorisation would not occur. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Last edited by davidb on Sat Jun 02, 2012 6:05 pm; edited 3 times in total 

Back to top 


davidb
Joined: 17 Jul 2009 Posts: 557 Location: UK

Posted: Sat Jun 02, 2012 5:05 pm Post subject: 


John,
Here is a SAXPY double precision algorithm using SSE2. The subroutine overwrites Y with Y + A*X where X and Y are two arrays and A is a scalar.
To do backsubstitution, instead of doing:
CALL VECSUB (A, B, FAC, N)
You should do the following with the minus sign in the FAC argument:
CALL fast_asm_dsaxpy(A, B, N, FAC)
Code: 
subroutine fast_asm_dsaxpy(y, x, n, a)
integer n
real*8 x(*), y(*), a
integer nn, m
real*8 v(2)
!
! Does SAXPY update Y = Y + A*X
!
! On exit X is unchanged, Y is overwritten with Y + A*X
!
! Number of sets of 4
nn = n/4
! Number of residuals
m = n  4*nn
! Vector containing (/a, a/)
v = a
! Assembly code is between code, edoc lines
code
movupd xmm7%, v ; move v array to xmm7
mov eax%, =x ; address of x
mov ecx%, =y ; address of y
mov edx%, nn ; initialise loop counter for vectorised loop
mov edi%, m ; initialise loop counter for scalar loop
cmp edx%, 0
jz $60 ; no terms in vectorized loop
! Check for alignment of x(1) and y(1) on 16 byte boundaries
and eax%, 15 ; low nibble of x address
and ecx%, 15 ; low nibble of y address
cmp eax%, 0
jnz $10 ; x(1) is unaligned, check for alignment of x(2)
cmp ecx%, 0
jnz $10 ; y(1) is unaligned, check for alignment of y(2)
! x(1) and y(1) are aligned on 16 byte boundaries
mov eax%, =x ; address of x
mov ecx%, =y ; address of y
jmp $40 ; branch to aligned vectorized loop
! Check for alignment of x(2) and y(2) on 16 byte boundaries
10 xor eax%, 8
jnz $20 ; branch to unaligned vectorized loop
xor ecx%, 8
jnz $20 ; branch to unaligned vectorized loop
! x(2) and y(2) are aligned on 16 byte boundaries
mov eax%, =x ; address of x argument
mov ecx%, =y ; address of y argument
movsd xmm0%, [eax%]
movsd xmm2%, [ecx%]
mulsd xmm0%, xmm7%
addsd xmm0%, xmm2%
movsd [ecx%], xmm0% ; form y(1) = y(1) + a*x(1)
lea eax%, [eax%+8] ; address of x(2)
lea ecx%, [ecx%+8] ; address of y(2)
dec edi% ; decrease loop counter for scalar loop
jge $40 ; branch to aligned vectorized loop
mov edi%, 3 ; adjust loop counter for scalar loop
dec edx% ; adjust loop counter for vectorised loop
jz $60 ; no terms in vectorized loop
jmp $40 ; branch to aigned vectorized loop

_________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Last edited by davidb on Sat Jun 02, 2012 5:18 pm; edited 6 times in total 

Back to top 


davidb
Joined: 17 Jul 2009 Posts: 557 Location: UK

Posted: Sat Jun 02, 2012 5:07 pm Post subject: 


continued ...
Code: 
! Initialise unaligned vectorised loop
20 mov eax%, =x ; address of x argument
mov ecx%, =y ; address of y argument
! Unaligned vectorised loop to calculate y = y + a*x for element sets 1 to nn
30 movupd xmm0%, [eax%] ; move next 2 doubles in x into xmm0
movupd xmm1%, [eax%+16] ; move next 2 doubles in x into xmm1
movupd xmm2%, [ecx%] ; move next 2 doubles in y into xmm2
movupd xmm3%, [ecx%+16] ; move next 2 doubles in y into xmm3
mulpd xmm0%, xmm7% ; multiply x vector by scalar
mulpd xmm1%, xmm7% ; multiply x vector by scalar
addpd xmm0%, xmm2% ; form y + a*x vector
addpd xmm1%, xmm3% ; form y + a*x vector
movupd [ecx%], xmm0% ; move xmm0 into next 2 doubles in y
movupd [ecx%+16], xmm1% ; move xmm1 into next 2 doubles in y
lea eax%, [eax%+32]
lea ecx%, [ecx%+32]
dec edx% ; decrement counter
jnz $30 ; conditional branch back
jmp $60
! Aligned vectorised loop to calculate y = y + a*x for element sets 1 to nn
40 movapd xmm0%, [eax%] ; move next 2 doubles in x into xmm0
movapd xmm1%, [eax%+16] ; move next 2 doubles in x into xmm1
movapd xmm2%, [ecx%] ; move next 2 doubles in y into xmm2
movapd xmm3%, [ecx%+16] ; move next 2 doubles in y into xmm3
mulpd xmm0%, xmm7% ; multiply x vector by scalar
mulpd xmm1%, xmm7% ; multiply x vector by scalar
addpd xmm0%, xmm2% ; form y + a*x vector
addpd xmm1%, xmm3% ; form y + a*x vector
movapd [ecx%], xmm0% ; move xmm0 into next 2 doubles in y
movapd [ecx%+16], xmm1% ; move xmm1 into next 2 doubles in y
lea eax%, [eax%+32]
lea ecx%, [ecx%+32]
dec edx% ; decrement counter
jnz $40 ; conditional branch back
! Check for entries in the scalar loop
60 cmp edi%, 0
jz $80 ; conditional skip loop, no terms in scalar loop
! Scalar loop to calculate y = y + a*x for the last nn+1 to n elements
70 movsd xmm0%, [eax%]
movsd xmm2%, [ecx%]
mulsd xmm0%, xmm7%
addsd xmm0%, xmm2%
movsd [ecx%], xmm0%
lea eax%, [eax%+8]
lea ecx%, [ecx%+8]
dec edi%
jnz $70
! End of assembly
80 nop
edoc
end subroutine fast_asm_dsaxpy

I hope its helpful. I'm off the play with my Raspberry Pi. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Last edited by davidb on Sat Jun 02, 2012 11:08 pm; edited 1 time in total 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 2188 Location: South Pole, Antarctica

Posted: Sat Jun 02, 2012 9:41 pm Post subject: 


Glad users and developers closely look at the options for speeding up and optimizing FTN95. I started to see in latest versions substantial speedup of the order of 1.5 when i use /opt. Before /opt just crashed my code. But there still exist a lot of amazing possibilities as David just showed.
BTW, I have never parallel library created with IVF for linear algebra tasks which unfortunately is only in the LIB form. I have only older CVF which does not make DLLs. With FTN95 it is working with only some routines like dense matrices and some others but is not with the block matrix (not sure why, the author who tried to make native LIB for FTN95 failed by some unknown reasons only he and FTN95 developers know, library complains about missing some system routines). It is 1.6 times faster then older Microsoft library. Anyone here interested to convert it to DLL to use with any compiler? PM me. Intel says how to do that
>Can IVF convert existing lib files into dll ?
Yes, you can, with some limitations. If the library consists of routines only,
and there is no shared data such as COMMON blocks that are expected to be used
by the "end user", then you can do this.
Making a DLL is the best approach I can suggest. In general, objects/libraries from one Fortran compiler are not linkable with another Fortran compiler. The runtime libraries are different and sometimes the calling ABI is different. The issue you are seeing is that the IVFcompiled code has calls to the Intel runtime libraries which are not present with the other compiler. You could add the Intel libraries to the link, but that may create conflicts.
Wrapping your code in a DLL isolates the runtime library issues, though you need to make sure that you don't try to do things such as I/O or ALLOCATE/DEALLOCATE across compilers. There should not be a performance impact of using the DLL approach.
You will need to create a ".DEF"
file which lists all the routines to be exported from the DLL. It is a text
file that has a line for each routine to be exported like this:
EXPORT routine1
EXPORT routine2
EXPORT routine3
You may have to experiment with the names  the case (upper/lower/mixed) must
match and there may be prefixes or suffixes that you have to consider.
Name this file with a .DEF (or .def) file type. This is an input to the
linker, so you can, from the command line, say:
link /dll mylib.lib mydef.def
This will require that the runtime libraries of the other compiler be
available and you may have to add them to the link command.

Steve Lionel
Developer Products Division
Intel Corporation
Nashua, NH
On the other hand i am interested if anyone used BLAS for parallel algebra. No time to look at that. Specifically i need only one subroutine which solves block matrix equations (with blocks residing near the diagonal like Substitute_VAG_S or Substitute_VAG_D in my library).
Last edited by DanRRight on Tue Jun 05, 2012 8:58 am; edited 1 time in total 

Back to top 


LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2105 Location: Yateley, Hants, UK

Posted: Sun Jun 03, 2012 6:15 pm Post subject: 


For John,
I was only parrotting the entry in Wikipedia:
"SSE originally added eight new 128bit registers known as XMM0 through XMM7. The AMD64 extensions from AMD (originally called x8664) added a further eight registers XMM8 through XMM15, and this extension is duplicated in the Intel 64 architecture. There is also a new 32bit control/status register, MXCSR. The registers XMM8 through XMM15 are accessible only in 64bit operating mode."
If true, then a 64bit OS opens up the opportunity to do more operations per cycle.
For David,
My question " are there any more useful applications of this" or words to that effect was a far from complete question, with the simple answer "Yes". One might divide the answers into two:
Applications where a high level programmer might readily detect that there was an opportunity to simply call a more efficient subroutine like David's dot product, and
Applications where it is by no means obvious as above, but after a certain amount of syntactical analysis in the compiler, it becomes evident that there is a faster approach to generating the executable code.
In the former case, the solution is a library (which could easily be supported in whole or part by the user community or by 3rd party software suppliers), in the latter case it is part of the optimisation done by the compiler and the user community has no role. Just reading the Polyhedron benchmarks makes it seem likely that some of the other compilers do precisely this.
A third approach might be to make it clear which language constructs in Fortran and FTN95 compile to faster code than others. For example suppose A has dimensions 1000x1000, then is
Code:  DO 20 I=1,1000
DO 10 J=1,1000
A(I,J) = 0.0
10 CONTINUE
20 CONTINUE 
or
Code:  DO 20 J=1,1000
DO 10 I=1,1000
A(I,J) = 0.0
10 CONTINUE
20 CONTINUE 
or is
or indeed, would EQUIVALENCING to B(1000000) and initialising B in a single loop be better (assuming it needs to be done multiple times)? [This is only an example]. Presumably, one has an opportunity to both help or hinder the optimiser in FTN95. I almost never dabble with /opt, as my applications seem fast enough without, and it used to make them crash, but I may well take comfort from Dan's finding that /opt works better nowadays.
Eddie
EDIT: Re opt, it stops some of my programs working, because they respond differently to mouse clicks. So perhaps the comfort I took was a false comfort. They did seem to work faster before deleting half the dataset. Perhaps sometimes one can be too fast .... 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

Powered by phpBB © 2001, 2005 phpBB Group
