Topic: Fortran 2003/2008 in General

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

1 Jun 2012 4:05 #10254

Once again, a huge thank you to David for his efforts. This isn't just a working routine, it is a clear demonstration of the speed gains possible with SSE. However, this is a very obvious application. Are there others?

I note that the additional SSE4 registers are only available in 64 bit mode, and with those available, operations such as this could become even faster ...

Eddie

JohnCampbell

Posts: 2526 Sydney

Back to Top

2 Jun 2012 12:59 #10256

Eddie, I don’t understand why SSE4 registers are not available. Aren’t they in the processor? I would have thought they were available to 32-bit application using a 64-bit capable processor, such as my core i5 and similar. I should point out that this test example is a problem I have been testing since 2007. (5 years !) The stiffness matrix is 2.3gb and in 2007 it took 25 minutes to reduce the stiffness matrix, with a significant proportion of time for disk I/O (7 minutes). The following is the reduce stats from a run at that time. (Not sure of the hardware, but probably only 2gb of memory).

  CPU TIMES FOR EQUATION REDUCTION
     Description            Num     Cpu     I/O     Avg
  1  READ BLOCK A             7    9.06  150.09  22.737
  3  MOVE BLOCK B             6    2.73    0.51   0.541
  4  REDUCE BLOCK A           7 1006.38    0.05 143.774
  5  APPLY BLOCK B            6   76.75    0.04  12.798
  6  WRITE BLOCK A            7   10.39  278.42  41.259
  7  DIAGONAL ANALYSIS        2    0.03    0.44   0.237
 99  TOTAL REDUCTION          7 1105.34  429.55 219.271

My response back then was to look at a 64-bit option to remove the disk I/O. While that has worked, the combination of more memory and Win7 buffered I/O means that I can get similar results on 32-bit. It is interesting to understand this as: 64-bit solutions only works well if the memory required is less than the physical memory, otherwise the wheels fall of with windows virtual memory. 32-bit solutions that use disk I/O now have good disk buffering if the file transfer requirement is less than the physical memory. With Windows 7, the result is that 32-bit applications where the numerical algorithm has an optimised out-of-core approach can perform well compared to the 64-bit alternative. What I have learnt recently is that although 64-bit will be the way of the future, hardware and O/S improvements have extended the life of 32-bit applications. If I look back to 2007, I thought the improvements would have come from reducing the disk I/O, but the main reduction is in computation time. Back then I also was looking to parallelising the code. However we are seeing significant gains (for my direct solver using dot_product) in SSE2 and more is possible with SSE4/AVX. Thanks very much David. You have shown that there is potential to significantly improve the run-time performance of FTN95. I am attaching another simple example of using fast_asm_ddotprod for matrix multiplication. John

JohnCampbell

Posts: 2526 Sydney

Back to Top

2 Jun 2012 1:07 #10257

I looked at another application of dot_product and wrote a simple test example, which I compiled with default and /opt compilation modes. The relative performance of each option on my core i5 are dramatic.

module big_arrays
      integer*4, parameter :: L = 1000
      integer*4, parameter :: M = 1000
      integer*4, parameter :: N = 1000
!
      real*8    A(l,m), B(m,n), C(l,n)
end module big_arrays

use big_arrays
      real*8    t(6), s(5)
      integer*4 i,k
!
      call random_number (A)
      call random_number (B)
!
      do k = 1,2
!
           call cpu_time (t(1))
         C = MATMUL (A,B)
           s(1) = sum(C)
!
           call cpu_time (t(2))
         call MAT_MUL (A,B,C, l,m,n)
           s(2) = sum(C)
!
           call cpu_time (t(3))
         call MAT_MUL_d (A,B,C, l,m,n)
           s(3) = sum(C)
!
           call cpu_time (t(4))
         call MAT_MUL_c (A,B,C, l,m,n)
           s(4) = sum(C)
!
           call cpu_time (t(5))
         call MAT_MUL_a (A,B,C, l,m,n)
           s(5) = sum(C)
!
           call cpu_time (t(6))
         write (*,'(6f10.4)')  (t(i+1)-t(i), i=1,5)
         write (*,'(6es10.2)') (s(i)-s(1),   i=1,5)
      end do
      end

      subroutine MAT_MUL (A,B,C, l,m,n)
!
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n)
!
      integer*4 i,j,k
      real*8    s
!
      do i = 1,l
        do j = 1,n
          s = 0
          do k = 1,m
            s = s + a(i,k)*b(k,j)
          end do
          c(i,j) = s
        end do
      end do
      end

      subroutine MAT_MUL_d (A,B,C, l,m,n)
!
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n)
!
      integer*4 i,j
!
      do i = 1,l
        do j = 1,n
          c(i,j) = dot_product (a(i,:), b(:,j))
        end do
      end do
      end

      subroutine MAT_MUL_c (A,B,C, l,m,n)
!
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n)
!
      integer*4 i,j
      real*8    column(m)
!
      do i = 1,l
        column = a(i,:)
        do j = 1,n
          c(i,j) = dot_product (column, b(:,j))
        end do
      end do
      end

      subroutine MAT_MUL_a (A,B,C, l,m,n)
!
      integer*4 l,m,n
      real*8    A(l,m), B(m,n), C(l,n)
!
      integer*4 i,j
      real*8    fast_asm_ddotprod, column(m)
      external  fast_asm_ddotprod
!
      do i = 1,l
        column = a(i,:)
        do j = 1,n
          c(i,j) = fast_asm_ddotprod (column, b(1,j), m)
        end do
      end do
      end

Run performance I obtained is:

[FTN95/Win32 Ver. 6.10.0 Copyright (c) Silverfrost Ltd 1993-2011]

    PROCESSING MODULE  [<BIG_ARRAYS> FTN95/Win32 v6.10.0]
    NO ERRORS  [<BIG_ARRAYS> FTN95/Win32 v6.10.0]
    NO ERRORS  [<main program> FTN95/Win32 v6.10.0]
    NO ERRORS  [<MAT_MUL> FTN95/Win32 v6.10.0]
    NO ERRORS  [<MAT_MUL_D> FTN95/Win32 v6.10.0]
    NO ERRORS  [<MAT_MUL_C> FTN95/Win32 v6.10.0]
    NO ERRORS  [<MAT_MUL_A> FTN95/Win32 v6.10.0]
    NO ERRORS  [<FAST_ASM_DDOTPROD> FTN95/Win32 v6.10.0]
Creating executable: c:\\TEMP\\FTN95_test\\lgotemp@.exe
Program entered
   11.7001    7.6596    7.6440    1.1388    0.7020
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00

Thats a 17 x improvement using David's new dot_product approach.

JohnCampbell

Posts: 2526 Sydney

Back to Top

2 Jun 2012 1:13 #10258

That's a 17 x improvement using David's new dot_product !

JohnCampbell

Posts: 2526 Sydney

Back to Top

2 Jun 2012 1:37 #10259

David, You said,

I will post up some sse2 code to do the back substitution bit at the weekend. If anything, it is easier than the dot-product.

I have spent more time trying to understand why this simple loop performs so badly in FTN95 than any other code I have. You will note the back stepping do; this improves performance on some processors, but in general the instruction sequence that FTN95 produces does not appear to suit the optimisation approach in most modern processors. I have never been able to understand why.

I look forward to what you may be able to produce.

John

davidb

Posts: 555 UK

Back to Top

2 Jun 2012 1:37 (Edited: 2 Jun 2012 5:05) #10260

Quoted from LitusSaxonicum However, this is a very obvious application. Are there others?

Eddie

Basic Linear Algebra (BLAS) is the most obvious application of vectorisation/SPMD techniques.

However, one of my favorite applications is to use SPMD for the fast evaluation of polynomials using Estrin's method. (also see Knuth's 'The Art of Computer Programming Vol 2', if you have it).

Given (say) a 7th order polynomial:

y = a0 + a1*x + a2*x**2 + a3*x**3 + a4*x**4 + a5*x**5 + a6*x**6 + a7*x**7

Estrin's method is:

y = (a0 + a1*x) + (a2 + a3*x)*x**2 + ((a4 + a5*x) + (a6 + a7*x)*x**2)*x**4

How can this be vectorised?

What I do is store the coefficients in an array c(:) in the following order:

! Store coefficients for vectorised Estrin's method
c = (/ a0, a4, a2, a6, a1, a5, a3, a7 /)

Often the coefficients are constants, and I can set up c at compile time using a parameter array.

Then Estrin's method can be reduced to a series of SAXPY operations, each of which (apart from the last one which is a scalar operation) can be vectorised to make use of SSE2/AVX.

Using an array V(0:3) to store intermediate results, the vectorised Estrin algorithm is:

! Evaluate 7th order polynomial, coefficients in C(:) array
x2 = x*x
x4 = x2*x2
v(0:3) = c(0:3) + c(4:7)*x    ! This can be vectorised
v(0:1) = v(0:1) + v(2:3)*x2   ! This can be vectorised
y = v(0) + v(1)*x4

The scope for vectorisation goes up as the power of the polynomial goes up. Note that C and V arrays are declared as DIMENSION(0:3) for convenience.

Of course, it isn't very practical to hand code this in assembler, and I don't do that. I generally just use the Array notation in Fortran (90, 95, 2003) and let a vectorising compiler take care of the details. This is what the array notation is for and why it was put in Fortran 90 in the first place. It guarantees the elements can be calculated in any order so vectorisation can be used.

The approach can be generalised to multiple polynomials, rational polynomial approximations etc. I find that vectorising all my inner loops and array expressions, and using multiple threads for outer loops and high level tasks (usually with OpenMP) really works.

Note that some work and thinking is always needed to find the correct form, a vectorising compiler won't do the transformation for you. If the polynomial had been evaluated using the canonical form (above) or the traditional Horner's method, automatic vectorisation would not occur.

davidb

Posts: 555 UK

Back to Top

2 Jun 2012 4:05 (Edited: 2 Jun 2012 4:18) #10261

John,

Here is a SAXPY double precision algorithm using SSE2. The subroutine overwrites Y with Y + A*X where X and Y are two arrays and A is a scalar.

To do backsubstitution, instead of doing:

CALL VECSUB (A, B, FAC, N)

You should do the following with the minus sign in the FAC argument:

CALL fast_asm_dsaxpy(A, B, N, -FAC)

subroutine fast_asm_dsaxpy(y, x, n, a)
   integer n
   real*8 x(*), y(*), a

   integer nn, m
   real*8 v(2)
   
!
! Does SAXPY update Y = Y + A*X
!
! On exit X is unchanged, Y is overwritten with Y + A*X
!

   ! Number of sets of 4
   nn = n/4

   ! Number of residuals
   m = n - 4*nn

   ! Vector containing (/a, a/)
   v = a

   ! Assembly code is between code, edoc lines
   code

      movupd xmm7%, v            ; move v array to xmm7
      
      mov eax%, =x               ; address of x
      mov ecx%, =y               ; address of y

      mov edx%, nn               ; initialise loop counter for vectorised loop
      mov edi%, m                ; initialise loop counter for scalar loop
      cmp edx%, 0
      jz $60                     ; no terms in vectorized loop

      ! Check for alignment of x(1) and y(1) on 16 byte boundaries
      and eax%, 15               ; low nibble of x address
      and ecx%, 15               ; low nibble of y address
      cmp eax%, 0
      jnz $10                    ; x(1) is unaligned, check for alignment of x(2)
      cmp ecx%, 0
      jnz $10                    ; y(1) is unaligned, check for alignment of y(2)

      ! x(1) and y(1) are aligned on 16 byte boundaries
      mov eax%, =x               ; address of x
      mov ecx%, =y               ; address of y
      jmp $40                    ; branch to aligned vectorized loop

      ! Check for alignment of x(2) and y(2) on 16 byte boundaries
  10  xor eax%, 8
      jnz $20                    ; branch to unaligned vectorized loop
      xor ecx%, 8
      jnz $20                    ; branch to unaligned vectorized loop

      ! x(2) and y(2) are aligned on 16 byte boundaries
      mov eax%, =x               ; address of x argument
      mov ecx%, =y               ; address of y argument
      movsd xmm0%, [eax%]
      movsd xmm2%, [ecx%]
      mulsd xmm0%, xmm7%
      addsd xmm0%, xmm2%
      movsd [ecx%], xmm0%        ; form y(1) = y(1) + a*x(1)
      lea eax%, [eax%+8]         ; address of x(2)
      lea ecx%, [ecx%+8]         ; address of y(2)
      dec edi%                   ; decrease loop counter for scalar loop
      jge $40                    ; branch to aligned vectorized loop
      mov edi%, 3                ; adjust loop counter for scalar loop
      dec edx%                   ; adjust loop counter for vectorised loop
      jz $60                     ; no terms in vectorized loop
      jmp $40                    ; branch to aigned vectorized loop

davidb

Posts: 555 UK

Back to Top

2 Jun 2012 4:07 (Edited: 2 Jun 2012 10:08) #10262

continued ...

      ! Initialise unaligned vectorised loop
  20  mov eax%, =x               ; address of x argument
      mov ecx%, =y               ; address of y argument
   
      ! Unaligned vectorised loop to calculate y = y + a*x for element sets 1 to nn
  30  movupd xmm0%, [eax%]       ; move next 2 doubles in x into xmm0
      movupd xmm1%, [eax%+16]    ; move next 2 doubles in x into xmm1
      movupd xmm2%, [ecx%]       ; move next 2 doubles in y into xmm2
      movupd xmm3%, [ecx%+16]    ; move next 2 doubles in y into xmm3
      mulpd xmm0%, xmm7%         ; multiply x vector by scalar
      mulpd xmm1%, xmm7%         ; multiply x vector by scalar
      addpd xmm0%, xmm2%         ; form y + a*x vector
      addpd xmm1%, xmm3%         ; form y + a*x vector
      movupd [ecx%], xmm0%       ; move xmm0 into next 2 doubles in y
      movupd [ecx%+16], xmm1%    ; move xmm1 into next 2 doubles in y
      lea eax%, [eax%+32]
      lea ecx%, [ecx%+32]
      dec edx%                   ; decrement counter
      jnz $30                    ; conditional branch back
      jmp $60

      ! Aligned vectorised loop to calculate y = y + a*x for element sets 1 to nn
  40  movapd xmm0%, [eax%]       ; move next 2 doubles in x into xmm0
      movapd xmm1%, [eax%+16]    ; move next 2 doubles in x into xmm1
      movapd xmm2%, [ecx%]       ; move next 2 doubles in y into xmm2
      movapd xmm3%, [ecx%+16]    ; move next 2 doubles in y into xmm3
      mulpd xmm0%, xmm7%         ; multiply x vector by scalar
      mulpd xmm1%, xmm7%         ; multiply x vector by scalar
      addpd xmm0%, xmm2%         ; form y + a*x vector
      addpd xmm1%, xmm3%         ; form y + a*x vector
      movapd [ecx%], xmm0%       ; move xmm0 into next 2 doubles in y
      movapd [ecx%+16], xmm1%    ; move xmm1 into next 2 doubles in y    
      lea eax%, [eax%+32]
      lea ecx%, [ecx%+32]
      dec edx%                   ; decrement counter
      jnz $40                    ; conditional branch back

      ! Check for entries in the scalar loop
  60  cmp edi%, 0
      jz $80                     ; conditional skip loop, no terms in scalar loop

      ! Scalar loop to calculate y = y + a*x for the last nn+1 to n elements
  70  movsd xmm0%, [eax%]
      movsd xmm2%, [ecx%]
      mulsd xmm0%, xmm7%
      addsd xmm0%, xmm2%
      movsd [ecx%], xmm0%
      lea eax%, [eax%+8]
      lea ecx%, [ecx%+8]
      dec edi%
      jnz $70

      ! End of assembly
  80  nop

   edoc

end subroutine fast_asm_dsaxpy

I hope its helpful. I'm off the play with my Raspberry Pi.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

2 Jun 2012 8:41 (Edited: 5 Jun 2012 7:58) #10263

Glad users and developers closely look at the options for speeding up and optimizing FTN95. I started to see in latest versions substantial speedup of the order of 1.5 when i use /opt. Before /opt just crashed my code. But there still exist a lot of amazing possibilities as David just showed.

BTW, I have never parallel library created with IVF for linear algebra tasks which unfortunately is only in the LIB form. I have only older CVF which does not make DLLs. With FTN95 it is working with only some routines like dense matrices and some others but is not with the block matrix (not sure why, the author who tried to make native LIB for FTN95 failed by some unknown reasons only he and FTN95 developers know, library complains about missing some system routines). It is 1.6 times faster then older Microsoft library. Anyone here interested to convert it to DLL to use with any compiler? PM me. Intel says how to do that

Can IVF convert existing lib files into dll ?

Yes, you can, with some limitations. If the library consists of routines only, and there is no shared data such as COMMON blocks that are expected to be used by the 'end user', then you can do this.

Making a DLL is the best approach I can suggest. In general, objects/libraries from one Fortran compiler are not linkable with another Fortran compiler. The run-time libraries are different and sometimes the calling ABI is different. The issue you are seeing is that the IVF-compiled code has calls to the Intel run-time libraries which are not present with the other compiler. You could add the Intel libraries to the link, but that may create conflicts.

Wrapping your code in a DLL isolates the run-time library issues, though you need to make sure that you don't try to do things such as I/O or ALLOCATE/DEALLOCATE across compilers. There should not be a performance impact of using the DLL approach.

You will need to create a '.DEF' file which lists all the routines to be exported from the DLL. It is a text file that has a line for each routine to be exported like this:

EXPORT routine1 EXPORT routine2 EXPORT routine3

You may have to experiment with the names - the case (upper/lower/mixed) must match and there may be prefixes or suffixes that you have to consider.

Name this file with a .DEF (or .def) file type. This is an input to the linker, so you can, from the command line, say:

link /dll mylib.lib mydef.def

This will require that the run-time libraries of the other compiler be available and you may have to add them to the link command.

-- Steve Lionel Developer Products Division Intel Corporation Nashua, NH

On the other hand i am interested if anyone used BLAS for parallel algebra. No time to look at that. Specifically i need only one subroutine which solves block matrix equations (with blocks residing near the diagonal like Substitute_VAG_S or Substitute_VAG_D in my library).

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

3 Jun 2012 5:15 #10265

For John,

I was only parrotting the entry in Wikipedia:

'SSE originally added eight new 128-bit registers known as XMM0 through XMM7. The AMD64 extensions from AMD (originally called x86-64) added a further eight registers XMM8 through XMM15, and this extension is duplicated in the Intel 64 architecture. There is also a new 32-bit control/status register, MXCSR. The registers XMM8 through XMM15 are accessible only in 64-bit operating mode.'

If true, then a 64-bit OS opens up the opportunity to do more operations per cycle.

For David,

My question ' are there any more useful applications of this' or words to that effect was a far from complete question, with the simple answer 'Yes'. One might divide the answers into two:

Applications where a high level programmer might readily detect that there was an opportunity to simply call a more efficient subroutine like David's dot product, and

Applications where it is by no means obvious as above, but after a certain amount of syntactical analysis in the compiler, it becomes evident that there is a faster approach to generating the executable code.

In the former case, the solution is a library (which could easily be supported in whole or part by the user community or by 3rd party software suppliers), in the latter case it is part of the optimisation done by the compiler and the user community has no role. Just reading the Polyhedron benchmarks makes it seem likely that some of the other compilers do precisely this.

A third approach might be to make it clear which language constructs in Fortran and FTN95 compile to faster code than others. For example suppose A has dimensions 1000x1000, then is

      DO 20 I=1,1000
      DO 10 J=1,1000
      A(I,J) = 0.0
   10 CONTINUE
   20 CONTINUE

or

      DO 20 J=1,1000
      DO 10 I=1,1000
      A(I,J) = 0.0
   10 CONTINUE
   20 CONTINUE

or is

      A = 0.0

or indeed, would EQUIVALENCING to B(1000000) and initialising B in a single loop be better (assuming it needs to be done multiple times)? [This is only an example]. Presumably, one has an opportunity to both help or hinder the optimiser in FTN95. I almost never dabble with /opt, as my applications seem fast enough without, and it used to make them crash, but I may well take comfort from Dan's finding that /opt works better nowadays.

Eddie

EDIT: Re -opt, it stops some of my programs working, because they respond differently to mouse clicks. So perhaps the comfort I took was a false comfort. They did seem to work faster before deleting half the dataset. Perhaps sometimes one can be too fast ....

JohnCampbell

Posts: 2526 Sydney

Back to Top

4 Jun 2012 1:38 #10266

This has been a rewarding colaborative effort as I think we are getting some useful results, thanks to David's coding efforts. Thanks to Dan's prompting, I have now written 4 different test programs to test Dot_Product performance and the alignment impact. I would be pleased to forward these to anyone who wants them, if they provide me with email addresses, via messages. The simplest is the MATMUL test, which I have now run using both Salford Ver 6.10 and Intel Ver 11.1 on my old Xeon processor. The performance times for 5/6 options I have tested are Test Salford Intel MATMUL Intrinsic 14.12 0.90 Simple 3 DO loops 10.09 10.06 2 DO + Dot_Product 10.09 10.06 Column transpose + Dot_Product 1.42 0.85 Column transpose + Vec_Sum 1.43 1.46 Column transpose + Asm_Sum 0.93

The Intel test was compiled with /o2, which provided the SSE vector instructions. Core i5 gave similar performance, without selecting AXV instructions.

This demonstrates that Intel's MATMUL combines the temporary transpose column and an optimised Dot_Product to achieve their performance. For Salford, by using a temporary column and David's SSE2 assembler code, the run times are comparible to Intel's performance.

This demonstrates (in a simple example) that the performance times being achieved by other fortran compilers can be replicated with Salford by selective use of SSE2 instructions in key inner loops. From my limited experience, I have found that SSE and now AVX instructions are a relatively easy path to performance improvement, certainly easier than parallel coding approaches.

I know for my Finite Element package, there are probably only a few key loops that are critical to performance. David has now produced examples for the 2 key loops I have. The Profile_Test example I posted on 13th May gave 6 similar DO loop coding approaches, which produced significantly different performance times with FTN95. If /opt could address some of these inefficiencies and potentially identify some SSE2 instructions for the iner loop, then FTN95 could achieve relative benchmark performance that this FTN77 achieved.

A useful result for others might be to publish and document a few simpler loop approaches, so that you might be able to utilise SSE2 instructions using FTN95.

Thanks for all the help,

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

11 Oct 2013 1:24 (Edited: 11 Oct 2013 3:43) #13133

Building on the example of Davidb, would it be possible to add some support for SSE2, SSE4 or AVX instructions in FTN95 via a math library ? By placing this in a library, this could limit the changes required to FTN95 to making the instructions available in a CODE / EDOC block.

I have identified two main vector calculations in past posts which I would like supported and it would be good to have these available as they could significantly improve run times when used.

REAL*8 FUNCTION Dot_Product_8 ( A, B, n )
 Real*8 A(*), B(*)
 Integer*4 n

 acum = 0
 do i = 1,n
   acum = acum + A(i)*B(I)
 end do
 Dot_Product_8 = acum
END FUNCTION Dot_Product_8

REAL*8 FUNCTION Dot_Product_SSE2
REAL*8 FUNCTION Dot_Product_SSE4
REAL*8 FUNCTION Dot_Product_AVX

SUBROUTINE Vec_Add_8 ( A, B, Const, n )
 Real*8 A(*), B(*), Const
 Integer*4 n

 do i = 1,n
   A(i) = A(i) + Const * B(i)
 end do
END SUBROUTINE Vec_Add_8

SUBROUTINE Vec_Add_SSE2
SUBROUTINE Vec_Add_SSE4
SUBROUTINE Vec_Add_AVX

(I have kept the argument list as F77 style to maintain simplicity and not include the complexity of array slices or F90 interface syntax)

Is it possible to investigate this enhancement and see if it could be possible ?

Even supplying as a source code, as David has posted would provide an indication of supporting vector instructions in FTN95. Vector instructions can be a simpler path to significant performance improvement using FTN95, without the complexity of OpenMP multi-thread.

I'm sure there are a lot of other vector or matrix calculations that could be supported, but a demonstration of the above two could be a significant start. Certainly a restricted set of MATMUL ( A x B ) or preferably ( A' x B ) as A' is a simpler calculation, based on Dot_Product.

This approach could also provide an opportunity for performance improvement while using FTN95. I would certainly like to understand the opportunities SSE4 and AVX instructions could provide for improving FTN95. Issues associated with alignment could also be investigated, although I understand newer processors have also addresed this issue with AVX. It would be good to be able to test these options !

John

I note there is even newer FMA3 instructions to support the calculation 'd = a + b x c' which would suit the Vec_Add calculation. Can these be accessed via the CODE/EDOC blocks ?

JohnCampbell

Posts: 2526 Sydney

Back to Top

11 Oct 2013 3:39 #13134

Paul,

I also note there is a compiler option /SSE2 listed from FTN95 /?. This is not listed in FTN95.CHM. Is this a new or obsolete option ?

John

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

11 Oct 2013 7:03 #13135

It is old and does not appear to do anything so I have now removed it from the list.

JohnCampbell

Posts: 2526 Sydney

Back to Top

15 Oct 2013 11:40 #13138

Paul,

I sent an email of an example for some of the real*8 math library, which uses the SSE routines that Davidb provided. I am not sure if they are SSE2 or SSE4 instructions. I would like to see this library developed and either provided with FTN95 or in the coding examples. This would provide us users of FTN95 with some access to faster computing, by way of vector instructions for a few basic types of array calculations, using CODE EDOC structures. If it is possible to document these routines, this could provide a vehicle for enlarging the set. My understanding of the change required to FTN95 is that more of the SSE4, AVX and possibly FMA3 instructions need to be supported by CODE/EDOC. Is this the case ?

I am at present using the 2 routines that David provided for [a] . [b] (dot_product) and [a] = [a] + const * [b]

There are two main areas of computing performance change that is leaving FTN95 behind, being multiple processor or larger vector registers. My experience from testing both vector instructions and OpenMP is that vector instructions can be a much easier way of achieving 200% to 300% run time improvement, by targeting key inner loops of computation.

I'd be interested in your comments and those from others who might find this of use.

John

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

16 Oct 2013 6:55 #13139

John

Thanks for the enquiry and information. I have logged it for investigation.

Paul

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

16 Oct 2013 8:30 #13141

John, Post simplest possible benchmarks here demonstrating the effect on multiple common tasks. Dot product or matrix inversion are OK but great would be to have at least few applications like LU-factorization, Gauss or band matrix solvers for AX=B equations. In a bigger picture, after SSE support added and multithreading now working (buy 16-core latest Intel server admittedly very expensive chips and you may see tremendous speedups) this compiler will miss mostly only 64bit support

JohnCampbell

Posts: 2526 Sydney

Back to Top

17 Oct 2013 7:52 #13144

Dan,

I have written a sample program and will try to save it to dropbox.com, so that it can be downloaded.

I have run a number of tests of matrix multiplication, being:

using FTN95 MATMUL for C(l,n) = A(l,m) x B(m,n)
using vector addition using Vec_Add_SSE written by Davidb
using dot_product using Vec_Sum_SSE, by providing A'

The test results are in seconds

Test 1   Test 2    Test 3     Description
 20.38     2.08      1.77     uniform values of  l,m,n = 1600 1000 1400
  2.09     1.10      1.12     test with small m  l,m,n = 4500 100 3500
  3.82     1.89      1.35     test with large l  l,m,n = 12000 100 1500
 12.06     1.21      1.19     test with large m  l,m,n = 550 10000 250
  0.62     0.22      0.72     test with large n  l,m,n = 2500 10 12000

These results are with /opt compilation and show how SSE instructions can provide a significant improvement.

I'll set up the drop box from home tonight and attach the link here.

John

https://dl.dropboxusercontent.com/s/qvftgj1ezddqaqw/veclib.f90?token_hash=AAHcRRJYzeUZNOTECeJ8nVeaTOPHSULo_TAb8x3R2_1D5w&dl=1

This link appears to contain the test program I used to generate the above results. The program consists of: running 5 tests of matrix multiplication, multiplication, using A transpose and then using the MATMUL intrinsic. The optimised tests are based on the 2 routines that Davidb provided. There is also use of a timer based on CPU_clock@, which is RDTSC timer that cycles at the processor clock rate ( 2.66 ghz )

Give it a run and let me know if it works in a similar way for you. It would be good if this could be updated to include AVX routines and could be shared with all FTN95 users who want faster real*8 vector computations.

davidb

Posts: 555 UK

Back to Top

17 Oct 2013 5:37 #13145

It would be good if the alignment issue in FTN95 could be fixed. Then we could at least write some user contributed routines.

In the meantime, my preference has been just to use FTN95 for development, testing and debugging, and compile with gfortran (or ifort) when I want efficient release code. This is easier now that Clearwin+ is supported on other compilers (though I don't use it much personally).

This way I can take advantage of OpenMP, SSE, AVX etc.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

22 Oct 2013 1:35 (Edited: 23 Oct 2013 8:26) #13183

John, davidb,

The speedups I see tell me not too much because it is not clear to me right now if SSE will be actually applied exactly to the bottlenecks of my tasks. We need more tests on general purpose cases. But it is good to ask the authors of my parallel algebra library if they use SSE and if not then ask them to try it on multiple matrix equation solvers they have in there.

Meantime want an exercise of implementing SSE on A*X = B system of linear equations solver? Here is simplest Gauss elimination program modified to solve usual square dense matrix system of equation as well as block matrix ones. The text explains it all. Play with it first (it is written intentionally a bit verbose for others if they not yet started using Clearwin+) and then try to modify the short 18 lines Gauss solver adding SSE case to the last subroutine I left empty. Run the comparison. That way we will see the difference at least on some most commonly used general purpose programs

!...compilation : FTN95 Bench.f95 /opt /link 
   module MajorDeclarations
   use clrwin	

!.... matrix related 
   real*8, dimension(:,:), allocatable :: A    
   real*8, dimension(:),   allocatable :: B, X 
   integer nEquat
   integer*4, dimension(:),allocatable :: IJmax
 
!...  GUI 
   real*8 ElapsedTime, Progress
   integer keySSEon	

   CONTAINS

!......................................
   integer function cbRun ()
   real*4 tStart, tFinish
   real*8 random, SEED    ! , DATE_TIME_SEED@
   integer li(100)

   CALL PERMIT_ANOTHER_callback@()

!...Allocating arrays
   allocate ( A(nEquat,nEquat), stat = li(50))
   allocate ( B(nEquat),        stat = li(51))
   allocate ( X(nEquat),        stat = li(52))
   allocate ( IJmax(nEquat),    stat = li(53))

!...Randomly fill the arrays
!  call DATE_TIME_SEED@
   SEED = 1.234
   call SET_SEED@(SEED)

   do i=1,nEquat
     B(i) = random()
     do j=1,nEquat
       A(i,j) = random()
     enddo
   enddo
   IJmax(:) = nEquat

   ElapsedTime = 0
!...Do the test
   CALL CPU_TIME(tStart)
     if(keySSEon.eq.0) then
       call GAUSS_Square_Block
     else
       call SSE_BlockSolver
     endif		
   CALL CPU_TIME(tFinish)
!...

   ElapsedTime	= tFinish - tStart
   write(*,'(a,i1, i10,a,f10.3)') 'SSE=',keySSEon, nEquat, ':', ElapsedTime
!   call window_update@(ElapsedTime)
!...Deallocating
      deallocate ( A )
      deallocate ( B )
      deallocate ( X )
      deallocate ( IJmax )
!...zeroising progress bar 
      Progress = 0	
      call window_update@(Progress)

   cbRun=1
   end function	

   end module MajorDeclarations
!=================================================================
   Program Bench
   winapp
   use clrwin	
   use MajorDeclarations

   nEquat = 1000
   i=winio@('%ww%ca[A*X=B Matrix Benchmark]%1tl&', 21)
   i=winio@('Number of equations%ta%il%dd%6rd%`il%ff&',1,10000,100,nEquat)
   i=winio@('SSE Method %ta%rb[SSE]%ff %ff&',keySSEon)
   i=winio@('%cn%bc%^bt[Start]%ff %ff&', rgb@(237,177,11), cbRun)
   i=winio@('%ac[Enter]&',cbRun)	
   i=winio@('%ac[esc]&','exit')	
   i=winio@('%cn%30br%ff&', Progress, RGB@(255,0,0))
!  i=winio@('Elapsed time %ta%6rf%ff&', ElapsedTime)
   i=winio@('%30.15cw', 0)

   end