forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Fortran 2003/2008
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Thu May 31, 2012 11:36 pm    Post subject: Reply with quote

Here is the fast version of the dot product for double precision arrays.

Features: (1) finds alignments and uses fast movapd, mulpd when possible, (2) decreases loop counters, (3) unwraps vectorised loops, (4) out of order execution.

Code:

!** DOUBLE PRECISION VERSION
function fast_asm_ddotprod(v1,v2,n)
   integer n
   real*8 v1(*), v2(*)
   real*8 fast_asm_ddotprod

   integer nn, m
   real*8 :: v(2)
     
   ! Number of sets of 4
   nn = n/4

   ! Number of residuals
   m = n - 4*nn

   ! Assembly code is between code, edoc lines
   code
      xorps xmm0%, xmm0%         ; set xmm0 to zero
      xorps xmm1%, xmm1%         ; set xmm1 to zero

      mov eax%, =v1              ; address of v1
      mov ecx%, =v2              ; address of v2

      mov edx%, nn               ; initialise loop counter for vectorised loop
      mov edi%, m                ; initialise loop counter for scalar loop
      cmp edx%, 0
      jz $60                     ; no terms in vectorized loop

      ! Check for alignment of v1(1) and v2(1) on 16 byte boundaries
      and eax%, 15               ; low nibble of v1 address
      and ecx%, 15               ; low nibble of v2 address
      cmp eax%, 0
      jnz $10                    ; v1(1) is unaligned, check for alignment of v1(2)
      cmp ecx%, 0
      jnz $10                    ; v2(1) is unaligned, check for alignment of v2(2)

      ! v1(1) and v2(1) are aligned on 16 byte boundaries
      mov eax%, =v1              ; address of v1
      mov ecx%, =v2              ; address of v2
      jmp $40                    ; branch to aligned vectorized loop

      ! Check for alignment of v1(2) and v2(2) on 16 byte boundaries
  10  xor eax%, 8
      jnz $20                    ; branch to unaligned vectorized loop
      xor ecx%, 8
      jnz $20                    ; branch to unaligned vectorized loop
     
      ! v1(2) and v2(2) are aligned on 16 byte boundaries
      mov eax%, =v1              ; address of v1 argument
      mov ecx%, =v2              ; address of v2 argument
      movsd xmm0%, [eax%]
      mulsd xmm0%, [ecx%]        ; initialise with product v1(1)*v2(1)
      lea eax%, [eax%+8]         ; address of v1(2)
      lea ecx%, [ecx%+8]         ; address of v2(2)
      dec edi%                   ; decrease loop counter for scalar loop
      jge $40                    ; branch to aligned vectorized loop
      mov edi%, 3                ; adjust loop counter for scalar loop
      dec edx%                   ; adjust loop counter for vectorised loop
      jz $60                     ; no terms in vectorized loop
      jmp $40                    ; branch to aligned vectorized loop


_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl


Last edited by davidb on Fri Jun 01, 2012 12:04 am; edited 4 times in total
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Thu May 31, 2012 11:41 pm    Post subject: Reply with quote

continued ....

Code:

      ! Initialise unaligned vectorised loop
  20  mov eax%, =v1              ; address of v1 argument
      mov ecx%, =v2              ; address of v2 argument
   
      ! Unaligned vectorised loop to accumulate dot product of elements 1 to nn
  30  movupd xmm2%, [eax%]       ; move next 2 doubles in v1 into xmm2
      movupd xmm4%, [eax%+16]    ; move next 2 doubles in v1 into xmm4
      movupd xmm3%, [ecx%]       ; move next 2 doubles in v2 into xmm3
      movupd xmm5%, [ecx%+16]    ; move next 2 doubles in v2 into xmm5
      mulpd xmm2%, xmm3%         ; multiply values and store in xmm2
      mulpd xmm4%, xmm5%         ; multiply values and store in xmm4
      addpd xmm0%, xmm2%         ; accumulate sum in xmm0
      addpd xmm1%, xmm4%         ; accumulate sum in xmm1
      lea eax%, [eax%+32]
      lea ecx%, [ecx%+32]
      dec edx%                   ; decrement counter
      jnz $30                    ; conditional branch back
      jmp $50

      ! Aligned vectorised loop to accumulate dot product of elements 1 to nn
  40  movapd xmm2%, [eax%]       ; move next 2 doubles in v1 into xmm2
      movapd xmm4%, [eax%+16]    ; move next 2 doubles in v1 into xmm4   
      mulpd xmm2%, [ecx%]        ; multiply values and store in xmm2
      mulpd xmm4%, [ecx%+16]     ; multiply values and store in xmm4
      addpd xmm0%, xmm2%         ; accumulate sum in xmm0
      addpd xmm1%, xmm4%         ; accumulate sum in xmm1
      lea eax%, [eax%+32]
      lea ecx%, [ecx%+32]
      dec edx%                   ; decrement counter
      jnz $40                    ; conditional branch back

  50  addpd xmm0%, xmm1%         ; add xmm0 and xmm1

      ! Check for entries in the scalar loop
  60  cmp edi%, 0
      jz $80                     ; conditional skip loop, no terms in scalar loop

      ! Scalar loop to accumulate dot product of last nn+1 to n elements in low 64 bits of xmm0
  70  movsd xmm1%, [eax%]
      mulsd xmm1%, [ecx%]
      addsd xmm0%, xmm1%
      lea eax%, [eax%+8]
      lea ecx%, [ecx%+8]
      dec edi%
      jnz $70

      ! Can't get final reduction to work, so will do this in Fortran for now
  80  movupd v, xmm0%            ; move xmm0 to v array
   edoc
   
   ! Final reduction, the result is the sum of the two values in v
   fast_asm_ddotprod = sum(v)

end function fast_asm_ddotprod

_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2144
Location: Sydney

PostPosted: Fri Jun 01, 2012 8:26 am    Post subject: Reply with quote

David,

I have tested your revised example and confirmed that it has better performance. I have now tested 3 of the options you proposed. I am including the run test for 4 different approaches, using /opt compilation.
1: Original Vec_Sum function, using a DO loop
2: Modified loop, using your 4 group suggestion
3: asm_ddotprod
4: fast_asm_ddotprod
Code:
!  4 Group coding
         nn = mod (n, 4)
         if (nn <  1) c = 0.
         if (nn == 1) c = a(1)*b(1)
         if (nn >  1) c = sum (a(1:nn)*b(1:nn))
!
         do k = nn+1,n,4
            c = c + sum ( a(K:K+3) * b(K:K+3) )
         end do

I have run all 4 in my FEA program, using the problem I have been reporting.
The run times are:
Code:
option 1  Vec_sum loop
  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    0.97    0.04   0.252
  3  MOVE BLOCK B             3    0.44   -0.01   0.142
  4  REDUCE BLOCK A           4  296.87    0.69  74.389
  5  APPLY BLOCK B            3    5.63    0.05   1.892
  6  WRITE BLOCK A            4    1.23    0.00   0.309
 99  TOTAL REDUCTION          4  305.14    0.77  76.477

option 2  dotprod_4
  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    1.01    0.00   0.254
  3  MOVE BLOCK B             3    0.47   -0.02   0.148
  4  REDUCE BLOCK A           4  330.86    2.16  83.255
  5  APPLY BLOCK B            3    6.32    0.12   2.146
  6  WRITE BLOCK A            4    1.19    0.03   0.303
 99  TOTAL REDUCTION          4  339.86    2.27  85.533

option 3  asm_ddotprod
  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    0.76    0.01   0.193
  3  MOVE BLOCK B             3    0.42    0.01   0.142
  4  REDUCE BLOCK A           4  204.36    0.89  51.314
  5  APPLY BLOCK B            3    3.82    0.01   1.278
  6  WRITE BLOCK A            4    1.05    0.00   0.260
 99  TOTAL REDUCTION          4  210.41    0.92  52.833

option 4  fast_asm_ddotprod
  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             4    0.76    0.01   0.194
  3  MOVE BLOCK B             3    0.42    0.00   0.140
  4  REDUCE BLOCK A           4  188.43    0.50  47.233
  5  APPLY BLOCK B            3    3.56   -0.01   1.183
  6  WRITE BLOCK A            4    0.98    0.00   0.247
 99  TOTAL REDUCTION          4  194.17    0.49  48.667

where:
Num is the number of blocks used
Cpu is found from GetProcessTimes
I/O is elapse time difference between QueryPerform and Cpu

These results demonstrate that your approach has significant performance improvement and that SSE vector instructions can be used in FTN95 to improve performance.

John

ps : the other main loop of the equation solver is in the back substitution where :
Code:
    SUBROUTINE VECSUB (A, B, FAC, N)
!
!   Performs the vector opperation  [A] = [A] - [B] * FAC
!
    integer*4 n, i
    real*8    a(*), b(*), fac, c
!
    c = -fac
    do i = n,1,-1
       a(i) = a(i) + b(i) * c
    end do
    return
!
    END
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Fri Jun 01, 2012 4:11 pm    Post subject: Reply with quote

I will post up some sse2 code to do the back substitution bit at the weekend. If anything, it is easier than the dot-product.

Don't forget that you probably aren't getting full benefit from alignment of data, and you might get some more performance once ALIGN(128) is fixed, but this wouldn't get released until towards the end of the year. In the short term, you might experiment with ALIGN(64) in your declaration in the big array module. This should give alignment on 16 byte boundaries half of the time.

Then longer term, you need to think about how to use all of the cores on your computer, because I still think there's another factor of 2-4 improved performance to obtain by parallelising the outer loop somehow. Smile

DanRRight has shown one way this might be possible if you're sticking with FTN95. I will have a little play with this method at the weekend to see if I can make it work with your problem.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Fri Jun 01, 2012 4:31 pm    Post subject: Re: Reply with quote

JohnCampbell wrote:

Code:
!  4 Group coding
         nn = mod (n, 4)
         if (nn <  1) c = 0.
         if (nn == 1) c = a(1)*b(1)
         if (nn >  1) c = sum (a(1:nn)*b(1:nn))
!
         do k = nn+1,n,4
            c = c + sum ( a(K:K+3) * b(K:K+3) )
         end do



You can save the expense of the three tests at the start because sum() works when there is 1 or 0 terms in the array:

Code:
!  4 Group coding
         nn = mod (n, 4)
         ! sum intrinsic returns 0.0 when array is empty
         c = sum (a(1:nn)*b(1:nn))
         do k = nn+1,n,4
            c = c + sum ( a(K:K+3) * b(K:K+3) )
         end do


In Fortran 90, empty arrays are always defined, and it is legal to multiply two together to create another empty array. And the sum of the elements in an empty array is zero.

Its academic anyway, since this approach isn't very fast. The unrolling is basically what is done in fast_asm_ddotprod but there I only unroll 2 times. I tried to do 4 but ran out of registers.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2049
Location: Yateley, Hants, UK

PostPosted: Fri Jun 01, 2012 5:05 pm    Post subject: Reply with quote

Once again, a huge thank you to David for his efforts. This isn't just a working routine, it is a clear demonstration of the speed gains possible with SSE. However, this is a very obvious application. Are there others?

I note that the additional SSE4 registers are only available in 64 bit mode, and with those available, operations such as this could become even faster ...

Eddie
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2144
Location: Sydney

PostPosted: Sat Jun 02, 2012 1:59 am    Post subject: Reply with quote

Eddie,
I donít understand why SSE4 registers are not available. Arenít they in the processor? I would have thought they were available to 32-bit application using a 64-bit capable processor, such as my core i5 and similar.
I should point out that this test example is a problem I have been testing since 2007. (5 years !)
The stiffness matrix is 2.3gb and in 2007 it took 25 minutes to reduce the stiffness matrix, with a significant proportion of time for disk I/O (7 minutes). The following is the reduce stats from a run at that time. (Not sure of the hardware, but probably only 2gb of memory).
Code:
  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             7    9.06  150.09  22.737
  3  MOVE BLOCK B             6    2.73    0.51   0.541
  4  REDUCE BLOCK A           7 1006.38    0.05 143.774
  5  APPLY BLOCK B            6   76.75    0.04  12.798
  6  WRITE BLOCK A            7   10.39  278.42  41.259
  7  DIAGONAL ANALYSIS        2    0.03    0.44   0.237
 99  TOTAL REDUCTION          7 1105.34  429.55 219.271

My response back then was to look at a 64-bit option to remove the disk I/O.
While that has worked, the combination of more memory and Win7 buffered I/O means that I can get similar results on 32-bit.
It is interesting to understand this as:
64-bit solutions only works well if the memory required is less than the physical memory, otherwise the wheels fall of with windows virtual memory.
32-bit solutions that use disk I/O now have good disk buffering if the file transfer requirement is less than the physical memory.
With Windows 7, the result is that 32-bit applications where the numerical algorithm has an optimised out-of-core approach can perform well compared to the 64-bit alternative.
What I have learnt recently is that although 64-bit will be the way of the future, hardware and O/S improvements have extended the life of 32-bit applications.
If I look back to 2007, I thought the improvements would have come from reducing the disk I/O, but the main reduction is in computation time.
Back then I also was looking to parallelising the code. However we are seeing significant gains (for my direct solver using dot_product) in SSE2 and more is possible with SSE4/AVX.
Thanks very much David. You have shown that there is potential to significantly improve the run-time performance of FTN95.
I am attaching another simple example of using fast_asm_ddotprod for matrix multiplication.
John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2144
Location: Sydney

PostPosted: Sat Jun 02, 2012 2:07 am    Post subject: Reply with quote

I looked at another application of dot_product and wrote a simple test example, which I compiled with default and /opt compilation modes.
The relative performance of each option on my core i5 are dramatic.
Code:
module big_arrays
      integer*4, parameter :: L = 1000
      integer*4, parameter :: M = 1000
      integer*4, parameter :: N = 1000
!
      real*8    A(l,m), B(m,n), C(l,n)
end module big_arrays

use big_arrays
      real*8    t(6), s(5)
      integer*4 i,k
!
      call random_number (A)
      call random_number (B)
!
      do k = 1,2
!
           call cpu_time (t(1))
         C = MATMUL (A,B)
           s(1) = sum(C)
!
           call cpu_time (t(2))
         call MAT_MUL (A,B,C, l,m,n)
           s(2) = sum(C)
!
           call cpu_time (t(3))
         call MAT_MUL_d (A,B,C, l,m,n)
           s(3) = sum(C)
!
           call cpu_time (t(4))
         call MAT_MUL_c (A,B,C, l,m,n)
           s(4) = sum(C)
!
           call cpu_time (t(5))
         call MAT_MUL_a (A,B,C, l,m,n)
           s(5) = sum(C)
!
           call cpu_time (t(6))
         write (*,'(6f10.4)')  (t(i+1)-t(i), i=1,5)
         write (*,'(6es10.2)') (s(i)-s(1),   i=1,5)
      end do
      end

      subroutine MAT_MUL (A,B,C, l,m,n)
!
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n)
!
      integer*4 i,j,k
      real*8    s
!
      do i = 1,l
        do j = 1,n
          s = 0
          do k = 1,m
            s = s + a(i,k)*b(k,j)
          end do
          c(i,j) = s
        end do
      end do
      end

      subroutine MAT_MUL_d (A,B,C, l,m,n)
!
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n)
!
      integer*4 i,j
!
      do i = 1,l
        do j = 1,n
          c(i,j) = dot_product (a(i,:), b(:,j))
        end do
      end do
      end

      subroutine MAT_MUL_c (A,B,C, l,m,n)
!
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n)
!
      integer*4 i,j
      real*8    column(m)
!
      do i = 1,l
        column = a(i,:)
        do j = 1,n
          c(i,j) = dot_product (column, b(:,j))
        end do
      end do
      end

      subroutine MAT_MUL_a (A,B,C, l,m,n)
!
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n)
!
      integer*4 i,j
      real*8    fast_asm_ddotprod, column(m)
      external  fast_asm_ddotprod
!
      do i = 1,l
        column = a(i,:)
        do j = 1,n
          c(i,j) = fast_asm_ddotprod (column, b(1,j), m)
        end do
      end do
      end

Run performance I obtained is:
[code:1:8d5838c631][FTN95/Win32 Ver. 6.10.0 Copyright (c) Silverfrost Ltd 1993-2011]

PROCESSING MODULE [<BIG_ARRAYS> FTN95/Win32 v6.10.0]
NO ERRORS [<BIG_ARRAYS> FTN95/Win32 v6.10.0]
NO ERRORS [<main program> FTN95/Win32 v6.10.0]
NO ERRORS [<MAT_MUL> FTN95/Win32 v6.10.0]
NO ERRORS [<MAT_MUL_D> FTN95/Win32 v6.10.0]
NO ERRORS [<MAT_MUL_C> FTN95/Win32 v6.10.0]
NO ERRORS [<MAT_MUL_A> FTN95/Win32 v6.10.0]
NO ERRORS [<FAST_ASM_DDOTPROD> FTN95/Win32 v6.10.0]
Creating executable: c:\TEMP\FTN95_test\lgotemp@.exe
Program entered
11.7001 7.6596 7.6440 1.1388 0.7020
0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2144
Location: Sydney

PostPosted: Sat Jun 02, 2012 2:13 am    Post subject: Reply with quote

That's a 17 x improvement using David's new dot_product !
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2144
Location: Sydney

PostPosted: Sat Jun 02, 2012 2:37 am    Post subject: Reply with quote

David,
You said,
Quote:
I will post up some sse2 code to do the back substitution bit at the weekend. If anything, it is easier than the dot-product.

I have spent more time trying to understand why this simple loop performs so badly in FTN95 than any other code I have.
You will note the back stepping do; this improves performance on some processors, but in general the instruction sequence that FTN95 produces does not appear to suit the optimisation approach in most modern processors.
I have never been able to understand why.

I look forward to what you may be able to produce.

John
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Sat Jun 02, 2012 2:37 pm    Post subject: Re: Reply with quote

LitusSaxonicum wrote:
However, this is a very obvious application. Are there others?

Eddie

Basic Linear Algebra (BLAS) is the most obvious application of vectorisation/SPMD techniques.

However, one of my favorite applications is to use SPMD for the fast evaluation of polynomials using Estrin's method. (also see Knuth's "The Art of Computer Programming Vol 2", if you have it).

Given (say) a 7th order polynomial:
Code:

y = a0 + a1*x + a2*x**2 + a3*x**3 + a4*x**4 + a5*x**5 + a6*x**6 + a7*x**7

Estrin's method is:
Code:

y = (a0 + a1*x) + (a2 + a3*x)*x**2 + ((a4 + a5*x) + (a6 + a7*x)*x**2)*x**4

How can this be vectorised?

What I do is store the coefficients in an array c(:) in the following order:
Code:

! Store coefficients for vectorised Estrin's method
c = (/ a0, a4, a2, a6, a1, a5, a3, a7 /)

Often the coefficients are constants, and I can set up c at compile time using a parameter array.

Then Estrin's method can be reduced to a series of SAXPY operations, each of which (apart from the last one which is a scalar operation) can be vectorised to make use of SSE2/AVX.

Using an array V(0:3) to store intermediate results, the vectorised Estrin algorithm is:
Code:

! Evaluate 7th order polynomial, coefficients in C(:) array
x2 = x*x
x4 = x2*x2
v(0:3) = c(0:3) + c(4:7)*x    ! This can be vectorised
v(0:1) = v(0:1) + v(2:3)*x2   ! This can be vectorised
y = v(0) + v(1)*x4

The scope for vectorisation goes up as the power of the polynomial goes up. Note that C and V arrays are declared as DIMENSION(0:3) for convenience.

Of course, it isn't very practical to hand code this in assembler, and I don't do that. I generally just use the Array notation in Fortran (90, 95, 2003) and let a vectorising compiler take care of the details. This is what the array notation is for and why it was put in Fortran 90 in the first place. It guarantees the elements can be calculated in any order so vectorisation can be used.

The approach can be generalised to multiple polynomials, rational polynomial approximations etc. I find that vectorising all my inner loops and array expressions, and using multiple threads for outer loops and high level tasks (usually with OpenMP) really works.

Note that some work and thinking is always needed to find the correct form, a vectorising compiler won't do the transformation for you. If the polynomial had been evaluated using the canonical form (above) or the traditional Horner's method, automatic vectorisation would not occur.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl


Last edited by davidb on Sat Jun 02, 2012 6:05 pm; edited 3 times in total
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Sat Jun 02, 2012 5:05 pm    Post subject: Reply with quote

John,

Here is a SAXPY double precision algorithm using SSE2. The subroutine overwrites Y with Y + A*X where X and Y are two arrays and A is a scalar.

To do backsubstitution, instead of doing:

CALL VECSUB (A, B, FAC, N)

You should do the following with the minus sign in the FAC argument:

CALL fast_asm_dsaxpy(A, B, N, -FAC)

Code:

subroutine fast_asm_dsaxpy(y, x, n, a)
   integer n
   real*8 x(*), y(*), a

   integer nn, m
   real*8 v(2)
   
!
! Does SAXPY update Y = Y + A*X
!
! On exit X is unchanged, Y is overwritten with Y + A*X
!

   ! Number of sets of 4
   nn = n/4

   ! Number of residuals
   m = n - 4*nn

   ! Vector containing (/a, a/)
   v = a

   ! Assembly code is between code, edoc lines
   code

      movupd xmm7%, v            ; move v array to xmm7
     
      mov eax%, =x               ; address of x
      mov ecx%, =y               ; address of y

      mov edx%, nn               ; initialise loop counter for vectorised loop
      mov edi%, m                ; initialise loop counter for scalar loop
      cmp edx%, 0
      jz $60                     ; no terms in vectorized loop

      ! Check for alignment of x(1) and y(1) on 16 byte boundaries
      and eax%, 15               ; low nibble of x address
      and ecx%, 15               ; low nibble of y address
      cmp eax%, 0
      jnz $10                    ; x(1) is unaligned, check for alignment of x(2)
      cmp ecx%, 0
      jnz $10                    ; y(1) is unaligned, check for alignment of y(2)

      ! x(1) and y(1) are aligned on 16 byte boundaries
      mov eax%, =x               ; address of x
      mov ecx%, =y               ; address of y
      jmp $40                    ; branch to aligned vectorized loop

      ! Check for alignment of x(2) and y(2) on 16 byte boundaries
  10  xor eax%, 8
      jnz $20                    ; branch to unaligned vectorized loop
      xor ecx%, 8
      jnz $20                    ; branch to unaligned vectorized loop

      ! x(2) and y(2) are aligned on 16 byte boundaries
      mov eax%, =x               ; address of x argument
      mov ecx%, =y               ; address of y argument
      movsd xmm0%, [eax%]
      movsd xmm2%, [ecx%]
      mulsd xmm0%, xmm7%
      addsd xmm0%, xmm2%
      movsd [ecx%], xmm0%        ; form y(1) = y(1) + a*x(1)
      lea eax%, [eax%+8]         ; address of x(2)
      lea ecx%, [ecx%+8]         ; address of y(2)
      dec edi%                   ; decrease loop counter for scalar loop
      jge $40                    ; branch to aligned vectorized loop
      mov edi%, 3                ; adjust loop counter for scalar loop
      dec edx%                   ; adjust loop counter for vectorised loop
      jz $60                     ; no terms in vectorized loop
      jmp $40                    ; branch to aigned vectorized loop


_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl


Last edited by davidb on Sat Jun 02, 2012 5:18 pm; edited 6 times in total
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Sat Jun 02, 2012 5:07 pm    Post subject: Reply with quote

continued ...

Code:

      ! Initialise unaligned vectorised loop
  20  mov eax%, =x               ; address of x argument
      mov ecx%, =y               ; address of y argument
   
      ! Unaligned vectorised loop to calculate y = y + a*x for element sets 1 to nn
  30  movupd xmm0%, [eax%]       ; move next 2 doubles in x into xmm0
      movupd xmm1%, [eax%+16]    ; move next 2 doubles in x into xmm1
      movupd xmm2%, [ecx%]       ; move next 2 doubles in y into xmm2
      movupd xmm3%, [ecx%+16]    ; move next 2 doubles in y into xmm3
      mulpd xmm0%, xmm7%         ; multiply x vector by scalar
      mulpd xmm1%, xmm7%         ; multiply x vector by scalar
      addpd xmm0%, xmm2%         ; form y + a*x vector
      addpd xmm1%, xmm3%         ; form y + a*x vector
      movupd [ecx%], xmm0%       ; move xmm0 into next 2 doubles in y
      movupd [ecx%+16], xmm1%    ; move xmm1 into next 2 doubles in y
      lea eax%, [eax%+32]
      lea ecx%, [ecx%+32]
      dec edx%                   ; decrement counter
      jnz $30                    ; conditional branch back
      jmp $60

      ! Aligned vectorised loop to calculate y = y + a*x for element sets 1 to nn
  40  movapd xmm0%, [eax%]       ; move next 2 doubles in x into xmm0
      movapd xmm1%, [eax%+16]    ; move next 2 doubles in x into xmm1
      movapd xmm2%, [ecx%]       ; move next 2 doubles in y into xmm2
      movapd xmm3%, [ecx%+16]    ; move next 2 doubles in y into xmm3
      mulpd xmm0%, xmm7%         ; multiply x vector by scalar
      mulpd xmm1%, xmm7%         ; multiply x vector by scalar
      addpd xmm0%, xmm2%         ; form y + a*x vector
      addpd xmm1%, xmm3%         ; form y + a*x vector
      movapd [ecx%], xmm0%       ; move xmm0 into next 2 doubles in y
      movapd [ecx%+16], xmm1%    ; move xmm1 into next 2 doubles in y   
      lea eax%, [eax%+32]
      lea ecx%, [ecx%+32]
      dec edx%                   ; decrement counter
      jnz $40                    ; conditional branch back

      ! Check for entries in the scalar loop
  60  cmp edi%, 0
      jz $80                     ; conditional skip loop, no terms in scalar loop

      ! Scalar loop to calculate y = y + a*x for the last nn+1 to n elements
  70  movsd xmm0%, [eax%]
      movsd xmm2%, [ecx%]
      mulsd xmm0%, xmm7%
      addsd xmm0%, xmm2%
      movsd [ecx%], xmm0%
      lea eax%, [eax%+8]
      lea ecx%, [ecx%+8]
      dec edi%
      jnz $70

      ! End of assembly
  80  nop

   edoc

end subroutine fast_asm_dsaxpy


I hope its helpful. I'm off the play with my Raspberry Pi.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl


Last edited by davidb on Sat Jun 02, 2012 11:08 pm; edited 1 time in total
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2106
Location: South Pole, Antarctica

PostPosted: Sat Jun 02, 2012 9:41 pm    Post subject: Reply with quote

Glad users and developers closely look at the options for speeding up and optimizing FTN95. I started to see in latest versions substantial speedup of the order of 1.5 when i use /opt. Before /opt just crashed my code. But there still exist a lot of amazing possibilities as David just showed.

BTW, I have never parallel library created with IVF for linear algebra tasks which unfortunately is only in the LIB form. I have only older CVF which does not make DLLs. With FTN95 it is working with only some routines like dense matrices and some others but is not with the block matrix (not sure why, the author who tried to make native LIB for FTN95 failed by some unknown reasons only he and FTN95 developers know, library complains about missing some system routines). It is 1.6 times faster then older Microsoft library. Anyone here interested to convert it to DLL to use with any compiler? PM me. Intel says how to do that

>Can IVF convert existing lib files into dll ?

Yes, you can, with some limitations. If the library consists of routines only,
and there is no shared data such as COMMON blocks that are expected to be used
by the "end user", then you can do this.

Making a DLL is the best approach I can suggest. In general, objects/libraries from one Fortran compiler are not linkable with another Fortran compiler. The run-time libraries are different and sometimes the calling ABI is different. The issue you are seeing is that the IVF-compiled code has calls to the Intel run-time libraries which are not present with the other compiler. You could add the Intel libraries to the link, but that may create conflicts.

Wrapping your code in a DLL isolates the run-time library issues, though you need to make sure that you don't try to do things such as I/O or ALLOCATE/DEALLOCATE across compilers. There should not be a performance impact of using the DLL approach.

You will need to create a ".DEF"
file which lists all the routines to be exported from the DLL. It is a text
file that has a line for each routine to be exported like this:

EXPORT routine1
EXPORT routine2
EXPORT routine3

You may have to experiment with the names - the case (upper/lower/mixed) must
match and there may be prefixes or suffixes that you have to consider.

Name this file with a .DEF (or .def) file type. This is an input to the
linker, so you can, from the command line, say:

link /dll mylib.lib mydef.def

This will require that the run-time libraries of the other compiler be
available and you may have to add them to the link command.

--
Steve Lionel
Developer Products Division
Intel Corporation
Nashua, NH

On the other hand i am interested if anyone used BLAS for parallel algebra. No time to look at that. Specifically i need only one subroutine which solves block matrix equations (with blocks residing near the diagonal like Substitute_VAG_S or Substitute_VAG_D in my library).


Last edited by DanRRight on Tue Jun 05, 2012 8:58 am; edited 1 time in total
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2049
Location: Yateley, Hants, UK

PostPosted: Sun Jun 03, 2012 6:15 pm    Post subject: Reply with quote

For John,

I was only parrotting the entry in Wikipedia:

"SSE originally added eight new 128-bit registers known as XMM0 through XMM7. The AMD64 extensions from AMD (originally called x86-64) added a further eight registers XMM8 through XMM15, and this extension is duplicated in the Intel 64 architecture. There is also a new 32-bit control/status register, MXCSR. The registers XMM8 through XMM15 are accessible only in 64-bit operating mode."

If true, then a 64-bit OS opens up the opportunity to do more operations per cycle.

For David,

My question " are there any more useful applications of this" or words to that effect was a far from complete question, with the simple answer "Yes". One might divide the answers into two:

Applications where a high level programmer might readily detect that there was an opportunity to simply call a more efficient subroutine like David's dot product, and

Applications where it is by no means obvious as above, but after a certain amount of syntactical analysis in the compiler, it becomes evident that there is a faster approach to generating the executable code.

In the former case, the solution is a library (which could easily be supported in whole or part by the user community or by 3rd party software suppliers), in the latter case it is part of the optimisation done by the compiler and the user community has no role. Just reading the Polyhedron benchmarks makes it seem likely that some of the other compilers do precisely this.

A third approach might be to make it clear which language constructs in Fortran and FTN95 compile to faster code than others. For example suppose A has dimensions 1000x1000, then is

Code:
      DO 20 I=1,1000
      DO 10 J=1,1000
      A(I,J) = 0.0
   10 CONTINUE
   20 CONTINUE


or

Code:
      DO 20 J=1,1000
      DO 10 I=1,1000
      A(I,J) = 0.0
   10 CONTINUE
   20 CONTINUE


or is

Code:
      A = 0.0


or indeed, would EQUIVALENCING to B(1000000) and initialising B in a single loop be better (assuming it needs to be done multiple times)? [This is only an example]. Presumably, one has an opportunity to both help or hinder the optimiser in FTN95. I almost never dabble with /opt, as my applications seem fast enough without, and it used to make them crash, but I may well take comfort from Dan's finding that /opt works better nowadays.

Eddie

EDIT: Re -opt, it stops some of my programs working, because they respond differently to mouse clicks. So perhaps the comfort I took was a false comfort. They did seem to work faster before deleting half the dataset. Perhaps sometimes one can be too fast ....
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
Page 6 of 10

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group