Silverfrost Forums

Welcome to our forums

FTN 95 8.10 Personal Edition

12 Mar 2017 10:12 #19081

Of course it is relevant to FTN95 specifically future FTN XX which should be parallel: have you noticed AMD made 8 core/16 threads processor last week beating Intel in price and performance? For $300+ . All should run in the shops and start thinking 'parallel'.

So this all above is much more then fancy like in here: 'General. General discussions on FTN95, Fortran, Third Party tools...basically anything that takes your fancy!'

12 Mar 2017 11:13 #19082

Dan,

Adapting from Shakespeare : 'methinks you doth protest too much'

I am not going to run the Laipe approach. It just doesn't make sense, for the performance times they are quoting.

Not sure of some of your comments, but for context:

operations count for calculation is 3,688 gflops

my matrix multiply basically uses DAXPY and partitions the matrices to focus on smaller packets.

memory usage is 1.8 gb so there is lots of memory to cache transfers which is the significant bottleneck, especially when lots of threads are operating. This is why a cache blocking strategy is so important.

To explain the testing I have done:

I re-did my test using real*4 arrays and got interesting results. ( I can send you the test program if you wish)

In my (very old) i5-2300, which is a 4-core and 4 thread, ie no hyper-threading, but using SSE instructions. The intrinsic MATMUL takes 660 seconds The single thread cache strategy takes 517 seconds The 4-thread cache strategy takes 145 seconds, which is equivalent to 25.4 gflops

Compare this to the quoted Intel Xeon L7555 performance of 5,678 seconds for a single thread and 204 seconds for 32 threads. How can this be so slow.

In my (now old) i7-4790K, which is a 4-core and 8 thread using 1600 MHz memory and AVX instructions. The intrinsic MATMUL takes 507 seconds The single thread cache strategy takes 286 seconds The 8-thread cache strategy takes 66.5 seconds, which is equivalent to 55.5 gflops

Compare this to the quoted AMD Opteron 6168 performance of 3,494 seconds for a single thread and 88.6 seconds for 48 threads.

Perhaps these multi-core processors are not suited to this type of calculation. I would expect the Xeon to support SSE/AVX instructions ? They appear to be incredibly slow, neither as fast as a basic 4th gen Intel 4-core processor. Strange result!

The Laipe single thread times start with such slow performance, while they may demonstrate good efficiency of the threads, don't demonstrate overcoming some of the important problems associated with multi-thread, such as a memory to cache bottleneck and having data in cache to enable AVX instructions.

I should point out that in these Matrix Multiply tests, the cache strategy works very well and so AVX performance on the i7 is working very well. Most other multi-thread calculations I have do not perform this well. A larger cache and faster memory should make this better, but I am yet to test this.

John

13 Mar 2017 12:05 #19083

Oh my...more words...I have to take some palen'aya Stoli... 😃 I notice recently that i can not explain elementary things to anyone. These gflops are different gflops. They were obtained not in controlled environment on similar setup and hardware. And they are not gflops too because nothing FP is there, mostly memory transfers.

14 Mar 2017 12:02 #19107

Dan,

Rather than more words, here is the test:

To give some FTN95 relevance to my tests, I converted the test program to FTN95 and ran the test program with 1 thread using FTN95 /64. ( the conversion was mainly limiting the multi thread options, changing the non-standard timer routines and including AXPY4@ routines for vector instructions ) These tests use real*4 arrays.

The results are not good, especially for MATMUL !

There are 4 different matrix multiplication approaches being tested in the linked program: FTN95 /64 using MATMUL achieves 0.2 gflops on my i7-4790K FTN95 /64 using array syntax in the inner loop achieves 0.6 gflops FTN95 /64 using AXPY4@ in the inner loop achieves 6 gflops FTN95 /64 using cacheing and AXPY4@ achieves 11 gflops

MATMUL performance with /64 is very poor. Any performance below 1 gflops is not good, which shows the penalty for not using SSE/AVX calculations where they are available.

The following links provide the test program and the batch files I have used. ( you may want to stop at test:3 !) https://www.dropbox.com/s/j1avyv18kvfko4p/laipe4_sf.f90?dl=0 https://www.dropbox.com/s/3w32uns3fihh9rf/do_sf.bat?dl=0 https://www.dropbox.com/s/e1kyhuv598tckjf/run_laipe_sf.bat?dl=0

do_sf.bat is used to perform the tests.

MATMUL is called at line 226 array syntax is stream_matmul_dp : lines 288:303 AXPY4@ in the inner loop is laipe_matmul_dp : lines 305:323 cached + AXPY4@ laipe_matmul_cache : lines 325:356

I tried FTN95 /64 /opt, but this made little change to MATMUL or array syntax performance.

I would recommend the use of laipe_matmul_dp for 'small' arrays while the extension for cacheing is not a large overhead.

The code includes !$OMP OpenMP syntax when it is available and is an example of it's use for matrix multiplication. Matrix multiplication is one of the easiest applications of OpenMP, with little overhead. FTN95 ignores this syntax.

John

14 Mar 2017 1:50 #19108

Tried to download and see the content of files from your Dropbox on my phone because am away from my computer but the phone complains that it can not open the files. Let me ask you now just not to lose the whole day due to differences in time with Australia before you or me go to sleep -- in what form did you get LAIPE ? As a source file or LIB or DLL?

If as a source file then the performance is not expected to be good obviously at least till FTN95 will be fully optimized and parallelized. The only way I used LAIPE so far was to link FTN95 OBJ files with LIB compiled on the fastest compiler. Author has bunch of different LIBs but fastest approximately 8 years ago with my current laipe.lib library was IVF lib. Difference may reach few times between libraries made with different compilers and even between 32 and 64 bits libraries of the same compiler, see the benchmarks on his site.

The question is if gFortran can make 64 bit or at least 32 bit DLL (or may be you or its author can generate 64 bit DLL on Intel Fortran, the author promised but still didn't do that) then it will be compatible with FTN95 or any other compiler and this is how it should be used.

14 Mar 2017 2:38 #19109

Dan,

I don't use Laipe. I have been quoting Laipe performance reported on the equation.com web site. I find their quoted single thread performance to be incredibly slow. If you start from such a slow base, the improvements with multiple threads are not that significant, and ignore the real multi-thread problems that occur when combining AVX and !$OMP. To achieve the single thread results they quote, they must have turned off vector instructions, not used -ffast-math and possibly other delays. Why ?

I can use AVX calculations and !$OMP on an i7-4790K and get better performance than the best equation.com quoted with these many core / many thread processors they have used, so why use Laipe ?

With FTN95, the latest results I posted show that axpy4@ (or axpy8@) give good vector performance. I would certainly recommend this approach where possible. I also demonstrated that a cache smart approach is important for good AVX performance. I have not used multi-thread libraries with FTN95 or FTN95 /64; not sure how robust this would be.

John

14 Mar 2017 8:55 #19110

Quoted from JohnCampbell Dan, I don't use Laipe. I have been quoting Laipe performance reported on the equation.com web site.

then there are just even more words

15 Mar 2017 1:34 #19118

Paul and Robert,

I think what Dan may be asking is could a third party .dll be linked into a FTN95 executable, either 32 or 64 bit ?

The .dll being proposed is a multi-thread computation, generated either from gFortran or ifort, that has !$OMP capabilities. Could the following code (or a subset) be compiled in gFortran with options of -O3 -mavx -ffast-math -fopenmp then linked into a FTN95 calling program ? Basically, we have seen the opposite with clearwin64.

John

   subroutine laipe_matmul_cache (a,b,c,nra,nca,ncb)
     use precision
!    matrix multiplication : multi thread and cacheing strategy

       integer*4 nra,nca,ncb,  j,k, k1,k2
       real(dp) :: a(nra,nca), b(nca,ncb)
       real(dp) :: c(nra,ncb)
!
       integer*4 num_cache_columns, nk
       external  num_cache_columns
!
!   determine columns of A per pass
      nk = num_cache_columns (nra,nca)
!
      do k1 = 1,nca,nk
        k2 = min ( k1+nk-1, nca)
!
!$OMP PARALLEL DO shared (a,b,c,nra,nca,ncb,k1,k2) private (j,k)
        do j = 1,ncb
          if (k1==1) c(:,j) = 0
          do k = k1,k2
!!            c(1:nra,j) = c(1:nra,j) + a(1:nra,k) * b(k,j)
            call vec_add_dp ( c(1,j), a(1,k), b(k,j), nra )
          end do
        end do
!$OMP END PARALLEL DO
!
      end do   ! cache size passes of A
!
   end subroutine laipe_matmul_cache

   subroutine vec_add_dp ( y, x, a, n )
!  DAXPY interface routine
     use precision
     integer*4 :: n
     real(dp)  :: y(n), x(n), a
!
     INTEGER*8 :: n8 
     n8 = n
     call AXPY4@(y,x,n8,a)   ! FTN95 /64 routine
!
!       y = y + x * a    !  array syntax alternative
!
!       do i = 1,n         ! do loop alternative
!        y(i) = y(i) + x(i) * a
!       end do
   end subroutine vec_add_dp

   integer*4 function num_cache_columns (nra,nca)
!
!    matrix multiplication : multi thread and cacheing strategy
!    find the number of columns of A to store in each pass of multiplication
!    number is based on 
!       size of cache and 
!       number of cores (threads) in use
!
     use precision    !  byte_size
     use laipe_test   !  cache_size, use_cores,  nk, ncp

       integer*4 nra,     &    ! number of rows in A
                 nca           ! number of columns of A
!
!  Check that A is cached to 5mb
!     nk  = number of columns per cache pass
!     ncp = number of passes
!
!   Estimate number of columns for cache limit
      nk = (cache_size/byte_size) / nra - use_cores  !  allow 1 column for C for each thread
!
      if ( nk > nca ) then                           ! too many : no cache strategy required
        nk  = nca
        ncp = 1           
!
      else if ( nk <= use_cores ) then               ! too few : no smaller than 1 column per thread
        nk  = use_cores
        ncp = (nca+nk-1)/nk                          ! number of passes
!
      else
        ncp = (nca+nk-1)/nk                          ! number of passes
        nk  = (nca+ncp-1)/ncp                        ! even up columns per pass
        if ( use_cores > 1 )   &                     ! make sure multiple of use_threads
        nk  = ( (nk+use_cores-1)/use_cores ) * use_cores   ! round up to columns as multiple of cores 
!
      end if
!
      write (*,*) ' A is cached to',ncp,' passes of',nk,' for',nca,' columns'
!
      num_cache_columns = nk
!
   end function num_cache_columns
15 Mar 2017 7:09 #19119

I haven't tried this kind of connection. I guess that it depends on whether the routines are exported as 'extern 'C''. It would be worth a try.

15 Mar 2017 8:53 #19121

I know that IVF made LAIPE.LIB mostly works with 32bit FTN95. FTN95 understands its syntax. It did not work with some subroutines and the FTN95 generated EXE complained at run time about some missing system functions. But Intel's Steve Lionel wrote me that LIB file has to be substituted to DLL which gathers all the system functions used into the DLL file.

As to gFortran all who used it can check if DLL made in gFortran is compatible with FTN95. Parallelization job DLL library is doing is gFortran or IVF business

15 Mar 2017 11:19 #19126

Quoted from PaulLaidler I haven't tried this kind of connection. I guess that it depends on whether the routines are exported as 'extern 'C''. It would be worth a try.

I have used FTN95-64 with a couple of 64-bit DLLs intended for use with Intel Fortran or with Intel/MS C.

One of them is the Pardiso library (V4.12 and V5.00). The Pardiso DLLs depend on the Intel OpenMP DLL, but I have that. The FTN95-produced 64-bit EXE ran fine on several large symmetric matrices from the NIST Matrix Market.

On the other hand, the MKL library uses somewhat complicated modules to map simplified interface names of library entry points to highly decorated actual entry point names. I could see that making this work would take considerable work and there is a good chance that it would fail.

In short, if making FTN95-64 work with third party libraries is important for you, it is worth trying out. If the third party library is supplied only as a static library, as DanRight said, first build a DLL from that library, and make the DLL export all the symbols that you wish to use from your FTN95-compiled program.

16 Mar 2017 7:17 #19136

Thanks mecej4 for the info about Pardiso lib, it can be useful. Also for large collection of NIST Matrix Market, it is very interesting.

Now would be great if you'd check that gFortran which you also use has DLLs compatible with FTN95. It is not FTN95 to worry what and how DLL is doing parallelization inside. If it is also compatible with FTN95 then I'd encourage you to try LAIPE (it is really simple) and compare to other parallel algebra packages.

By the way the manual for Pardiso library tells that the Intel Fortran and MS Dev Studio has to be installed for it to work. LAIPE parallel algebra library, both LIB or DLL it does not matter, though, does not need anything else, you just call its subroutines as usual in Fortran and link it with other OBJ, LIB or DLL files with SLINK

16 Mar 2017 11:05 #19142

I don't know what to make of Laipe. I have Gfortran/Gcc 6.2 from Equation.com, and it includes the Laipe libraries. I built the example at the end of Chapter II of the manual ( ftp://ftp.equation.com/laipe/document/laipe_eqsolver.pdf ). The program runs, but the 'decomposed' matrix is the same as the original matrix, and the 'solution' is the same as the input R.H.S. vector. I suspect that the library checks for a license key or file and does a short return when it finds none. The vendor has the right to require licensing, but giving a false impression of doing something fast is not good. I have no intention of buying a Laipe license.

16 Mar 2017 11:56 #19144

Mecej4, Was this Laipe or Laipe2 ? I did not try Laipe2, it may need to link also neuLoop DLL. Site says that it is free, unless i miss something

16 Mar 2017 12:04 #19145

I believe that it is Laipe2+Neuloop4. I found it in the GCC/Gfortran 6.2 distribution from equation.com. Only the static libraries are provided. Here is the example code. The build command is in the first line as a comment.

!     gfortran -fdollar-ok -g laibnd.f -llaipe2 -lneuloop4
      Program XLAIBND
      implicit none
! *** Example program ***
! define variables where the length of A is determined by equation (2.2)
!

      integer*4, parameter :: N = 7
      integer*4, parameter :: LowerBandwidth=2
      real*4 :: A((N-1)*LowerBandwidth+N), X(N)
      integer*4 :: NoGood
      DATA X/21.0,141.0,2.0,9.0,333.0,1.0,3.0/
!
! input the lower triangular part of [A]
!

      CALL Input(A,LowerBandwidth)

!
! decompose in parallel
!

      CALL laipe$decompose_CSP_4(A,N,LowerBandwidth, NoGood)

!
! stop if NoGood=1
!

      IF(NoGood.eq.1) STOP 'Cannot be decomposed'

!
! perform substitutions in parallel
!

      CALL laipe$substitute_CSP_4(A,N,LowerBandwidth,X)

!
! output decomposed matrix
!

      CALL Output(A,N,LowerBandwidth)

!
! output the solution
!

      Write(*,'('' Solution is as:'')')
      Write(*,*) X

!
! end of the program
!

      CALL laipe$done
      STOP
      END

      SUBROUTINE Input(A,LowerBandwidth)
!
!
! routine to demonstrate application of data storage scheme
! (A)FORTRAN CALL: CALL Input(A,LowerBandwidth)
! 1.A: <R4> profile of matrix [A], dimension(*)
! 2.LowerBandwidth: <I4> lower bandwidth
!

!
! dummy arguments
!
      INTEGER*4 :: LowerBandwidth
      REAL*4 :: A(LowerBandwidth,1)

!
! input
!

      A(1,1) = 1.0
      A(2,1) = 4.0
      A(3,1) = 2.0
      A(2,2) = 25.0
      A(3,2) = 29.0
      A(4,2) = 9.0
      A(3,3) = 88.0
      A(4,3) = 34.0
      A(5,3) = 3.0
      A(4,4) = 89.0
      A(5,4) = 23.0
      A(6,4) = 11.0
      A(5,5) = 45.0
      A(6,5) = 7.0
      A(7,5) = 3.0
      A(6,6) = 22.0
      A(7,6) = 2.0
      A(7,7) = 9.0

!
      RETURN
      END

      SUBROUTINE Output(A,N,LowerBandwidth)
!
!
! routine to output the decomposed matrix by data storage scheme
! (A)FORTRAN CALL: CALL Output(A,N,LowerBandwidth)
! 1.A: <R4> profile of matrix [A], dimension(*)
! 2.N: <I4> order of square matrix [A]
! 3.LowerBandwidth: <I4> lower bandwidth
!

!
! dummy arguments
!

      INTEGER*4 :: N,LowerBandwidth
      REAL*4 :: A(LowerBandwidth,1)

!
! local variables
!

      INTEGER*4 :: Column,Row
!
! output the coefficients of decomposed matrix
!

      WRITE(*,'('' Row Column Coefficient'')')
      DO Column = 1,N

          DO Row = Column, MIN0(Column+LowerBandwidth,N)
               WRITE(*,'(I4,I6,F9.3)') Row,Column, A(Row,Column)

         END DO
      END DO
!
      RETURN
      END
16 Mar 2017 4:24 #19150

Yea, something I do not like in this test... For example there was no setting of number of threads or cotes. Matrix is in old array syntax which may need /oldarray option of FTN95. Better to take initially dense matrix case and when it will start working return back to this case again.

31 Jul 2017 8:32 #19919

How do I identify the version of my current FTN95 package?

Dan

31 Jul 2017 8:48 #19920

Hm. Ftn95 /ver gives version 7.20 on the newly installed personal edition. :?:

31 Jul 2017 9:15 #19921

Sorry, I get 8.10 now. Please ignore these messages. 😦

Please login to reply.