Topic: Skyline solver accelerated 42 times on 48 procesdors in General

mecej4

Posts: 1912

Back to Top

23 Mar 2017 10:25 #19228

The 64-bit Laipe2 library that is included with Equation.com's recent GFortran distributions (6.2, 6.3) contains SSE2 instructions. Which version did you test?

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

23 Mar 2017 1:01 #19231

Mecej4, From the Intel's link you have mentioned above out of 4-5 different software packages I installed only Intel MKL 2017 update 2. Was really surprised that Intel offer all of them for free. May try different package version, saw that this specific update had problems linking for some people. Do not see MKL directory \Program Files (x86)\IntelSWTools in the System/Environment Variables/path, may be need a reboot, but I can not reboot now...

mecej4

Posts: 1912

Back to Top

23 Mar 2017 1:10 #19232

That MKL package is fine. Intel's releasing MKL in a 'community' edition is a recent development.

If you do not have some version of Visual Studio 2015 installed, you will not have the support libraries and DLLs needed to make MKL work. My usual suggestion is to

(i) install VS 2015 community edition, if needed

(ii) test that you can build some C programs using VC and only then

(iii) install MKL or Parallel Studio.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

23 Mar 2017 11:15 (Edited: 24 Mar 2017 1:19) #19236

Tried reinstall MKL, install DevStudio, uninstalled, took different version of MKL was not free and required license key, etc etc etc until I realized that nothing that was needed and I just manually put the path ...hell knows where was my damn error...Devilry. Anyway I returned to your initial bat file.

Also, huge mess is this Intel, different versions do things differently with environment variables, their own tests do not work complaining of missing this LIB, missing that DLL...

ftn95 tlapack.f90 /64 /debug /check /free /err /set_error_level error 298  /no_truncate /zeroise >a_FTN95___

slink64  tlapack.obj 'c:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.1.143\windows\redist\intel64\mkl\mkl_rt.dll' /file:tlapack.exe  >a_link___

Same DIR was added to the path in the System/Environment Variables

JohnCampbell

Posts: 2526 Sydney

Back to Top

23 Mar 2017 11:15 #19237

Quoted from mecej4 Which version did you test?

I am comparing equation.com's reported single thread performance of the Intel Xeon and AMD and comparing to what I can achieve on my Intel i5 and i7 processors using basic DO loop code or MATMUL intrinsic. The only way I could get that performance would be to exclude vector instructions. Laipe2 appears to be too slow.

mecej4

Posts: 1912

Back to Top

24 Mar 2017 10:57 (Edited: 11 Aug 2019 8:04) #19239

Quoted from JohnCampbell

I am comparing equation.com's reported single thread performance of the Intel Xeon and AMD and comparing to what I can achieve on my Intel i5 and i7 processors using basic DO loop code or MATMUL intrinsic. The only way I could get that performance would be to exclude vector instructions.

Would it be correct to conclude that you did not actually run programs using Laipe1 or Laipe2, but are estimating what times Laipe might yield on your computer(s)?

Laipe2-64 bit definitely uses SSE2 for floating point operations. Here is proof: The EXE produced by Gfortran 6.2-64-bit for Dan's dense random square matrix problem (Laipe2 static library linked) yields this:

s:\sparse\LAIPE>objdump -d a.exe | findstr /i fadd
  40fadd:       48 8b 84 24 a8 01 00    mov    0x1a8(%rsp),%rax
  42f5b0:       d8 05 a2 dd 01 00       fadds  0x1dda2(%rip)        # 0x44d358
  4305c0:       d8 05 92 cd 01 00       fadds  0x1cd92(%rip)        # 0x44d358

Surely, one cannot implement Gaussian elimination with just two FADD instructions? Furthermore, the benchmark is for double precision matrices, and what you see here are single precision FADDS instructions, probably from some RTL routine.

Here are more timing results, with Dan's timings added for comparison:

               T W O   C O R E I5 - 4200U              I7 - 4770K
       64-bit     32-bit       32-bit       64-bit     32-bit
  N     MKL        MKL         Laipe1       Laipe2     Laipe1 (DanR)
----   -----      -----        ------        -----     ----
 500   0.110      0.000         0.062        0.031
 750   0.000      0.016         0.141        0.110
1000   0.031      0.047         0.328        0.219     0.09
1250   0.047      0.078         0.610        0.407
1500   0.078      0.093         1.000        0.656
1750   0.078      0.125         1.594        1.015
2000   0.110      0.203         2.285        1.469     0.75
2250   0.172      0.250         3.401        2.125
2500   0.234      0.375         4.509        2.875
2750   0.344      0.484         6.547        3.735
3000   0.375      0.641         7.625        4.796     2.44
3250   0.468      0.828        10.283        6.328
3500   0.578      1.054        12.580        7.734
3750   0.703      1.223        15.933        9.609
4000   0.892      1.422        19.148       11.422     5.90

mecej4

Posts: 1912

Back to Top

24 Mar 2017 11:07 #19240

Quoted from DanRRight

Also, huge mess is this Intel, different versions do things differently with environment variables, their own tests do not work complaining of missing this LIB, missing that DLL...
ftn95 tlapack.f90 /64 /debug /check /free /err /set_error_level error 298  /no_truncate /zeroise >a_FTN95___

slink64  tlapack.obj 'c:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017.1.143\windows\redist\intel64\mkl\mkl_rt.dll' /file:tlapack.exe  >a_link___
Same DIR was added to the path in the System/Environment Variables

Dan, did you finally get the program built and did you run it? You redirected the error messages to files, and forgot to post the contents of those files!

For the purposes of this test, you do not need any of the compiler options that you used, especially /check and /zeroise.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

24 Mar 2017 11:59 #19241

Yes, dense matrix test works. Post these results? Settings like /check etc for compilation were of course not needed in this case, I just took some example of BAT file, I recommend to always use these settings when you debug, they catch some very tricky errors specifically if you don't use 'implicit none' like I mostly do

mecej4

Posts: 1912

Back to Top

24 Mar 2017 12:08 #19242

Quoted from DanRRight Yes, dense matrix test works. I recommend to always use these settings when you debug, they catch some very tricky errors specifically if you don't use 'implicit none' like I mostly do

Debug is fine for initial testing. For timing runs, however, in general you do not want checking enabled. In this particular test, since only the time spent in the library is measured, and the library routines do almost no checking, whether you specify /debug may not matter.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

24 Mar 2017 2:55 #19245

Mecej4, Dense matrix treating was literally killing. Do you by chance know, after getting through the forest of thousands of Matrix Market examples, which MKL subroutine is closest by functionality to Laipe's VAG ? Interesting how sparse matrix cases are handled by MKL.

The simplest case of sparse block matrix test would be with two matrix blocks say by 3000x3000 matrices (5999 equations, numbers are in black areas) like that

or in a bit more general way any size with the similar shape starting from smallest with just two 2x2 blocks (total 3 equations))

In the past the move from textbook linear equation solvers to parallel one like LAIPE gave one order of magnitude speed boost. And now move to MKL may give additional order of magnitude. Together with much faster 64bit FTN95 and input/output speed boosts we discussed few months back I may get unbelievable increase of code performance by two orders of magnitude from original. This newsgroup is amazing.

mecej4

Posts: 1912

Back to Top

24 Mar 2017 4:45 #19246

There are, as far as I know, no special routines in MKL for sparse matrices where the nonzeros are confined to recognizable blocks. From matrix theory, we know that many of the properties of matrices of real-number elements also apply to matrix (i.e., block) elements.

I have not had much experience with block matrices. Have you read the book 'Matrix Computations' by Golub and Van Loan? They cover such topics very well.

People who work with splines have developed solvers for ABD (Almost Block Diagonal) matrices.

MKL includes a version of Pardiso. It also contains a DSS interface to Pardiso. I think that it is worth trying out Pardiso on some of your block-sparse matrices before looking for special block solvers. If you send me a couple of big (say, 10K X 10K) matrices in COO or CSR format, I'll try out Pardiso on them and compare with Laipe timings. That should help you decide if Pardiso will suffice for your purposes.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

24 Mar 2017 5:58 #19247

I suggest to use for the matrix A and vector B the same approach like it was used with dense matrix case and fill them just with the random numbers. The code in this case is almost the same too. This will create two squares on the major diagonal like was plotted above

program BlockSparse
 implicit none 
 integer :: i,j,neq,nrhs=1,lda,ldb, info 
 real*8,allocatable :: A(:,:),b(:) 
 integer, allocatable :: piv(:) 
 Integer count_0, count_1, count_rate, count_max 
 ! 
 do neq=500,6000,500 
    lda=neq; ldb=neq 
    allocate(A(neq,neq),b(neq),piv(neq)) 

      A(:,:) = 0
      B(:) = 0

      do j=1,neq/2 	
       do i=1,neq/2
        A(i,j) =random()
       enddo
      enddo

      do j=neq/2, neq
       do i=neq/2,neq
        A(i,j) =random()
       enddo
      enddo
	
    call random_number(B) 

    Call system_clock(count_0, count_rate, count_max) 
    CALL ....
    Call system_clock(count_1, count_rate, count_max) 
    Write (*, '(1x,A,i6,A,2x,F8.3,A)') 'nEqu = ',nEq,' ', & 
         dble(count_1-count_0)/count_rate, ' s' 
    deallocate(A,b,piv) 
 end do 
 end program

mecej4

Posts: 1912

Back to Top

24 Mar 2017 10:25 #19250

With four threads on my i7-2720QM laptop, I get:

   n      run time (s)
-----    ---------
  500     0.125
 1000     0.140
 1500     0.312
 2000     0.547
 2500     0.828
 3000     1.203
 3500     1.640
 4000     2.260
 4500     2.860
 5000     3.618
 5500     4.691
 6000     5.384

Here is the MKL-Pardiso code

program tdan2sq
implicit none
integer :: i,j,k,m,n,nnz,count_0,count_1,count_max,count_rate
integer*8 pt(64)
integer iparm(64),idum,maxfct,mnum,mtype,phase,nrhs,error,msglvl
double precision, allocatable :: A(:),b(:),x(:)
integer, allocatable :: ia(:),ja(:)
double precision dum(1)
!
do n=500,6000,500
   nnz=(n+1)*(n+1)/2
   allocate (A(nnz),b(n),x(n))
   k=1
   call random_number(A)
   call random_number(b)
   allocate(ia(n+1),ja(nnz))
   do i=1,n/2
      ia(i)=k
      do j=1,n/2
         ja(k)=j; k=k+1
      end do
   end do
   k=k-1
   do i=n/2,n
      do j=n/2,n
         ja(k)=j; k=k+1
      end do
      ia(i+1)=k
   end do
   iparm(1:64)=0
   iparm(1:3)=[1, 2, 4]
   iparm(8)=9; iparm(10)=13; iparm(11)=1
   iparm(13)=1
   pt = 0; msglvl=0; mtype=11; maxfct=1
   mnum=1; nrhs=1; phase=13
   Call system_clock(count_0, count_rate, count_max)
   CALL pardiso (pt, maxfct, mnum, mtype, phase, n, a, ia, ja, &
       idum, nrhs, iparm, msglvl, b, x, error)
   phase = -1
   CALL pardiso (pt, maxfct, mnum, mtype, phase, n, a, ia, ja, &
       idum, nrhs, iparm, msglvl, b, x, error)
   Call system_clock(count_1, count_rate, count_max)
   Write (*, '(1x,I4,3x,F7.3)') n, dble(count_1-count_0)/count_rate
   deallocate(A,b,x,ia,ja)
end do
end program

JohnCampbell

Posts: 2526 Sydney

Back to Top

25 Mar 2017 12:07 #19251

Dan,

For your block matrix, I would expect that a skyline approach would be effective, as the operations count would be the same as for any block specific approach.

The disadvantage of a skyline solver is that they typically do not have any pivoting capability. The skyline solvers I am familiar with are for symmetric matrices, and apply artificial constraints to (near) zero diagonals, which is not a problem for the Finite Element matrices I am using.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

25 Mar 2017 5:30 (Edited: 25 Mar 2017 9:54) #19256

Broke again my favorite glasses with UV coating (too fragile), but good that this pushed me to find lost 3 months ago (!) spare ones without coating. The coating specifically made for monitors, it clearly changes tint, that means it starts screening even some far blue. I think it is important to have that coating if we sit 10+ hours in front of larger and larger screens, some of which emit dangerous UV spectrum transforming it to blue which may leak enough to do harm over large time (not clear how much monitor front glass leaks UV, better to do clean experiment and measure the spectrum of LED and OLED monitors)

Mecej4, many thanks again, you have done great job and I owe you now even more. Have you noticed that while MKL dense matrix solver consumes almost 100% of CPU time in Task Manager the sparse one grabs only ~40% ? LAIPE2 dense case grabs even less, why it is slower. Sparse LAIPE2 though get closer to 100%

Anyway, here are preliminary test results (in seconds) for the block sparse matrix. MKL wins on larger size matrices >3000 but loses on smaller ones (processor I7-4770K 4.5Ghz)

        MKLsparse  LAIPE2
  500     0.191     0.006
 1000     0.062     0.027
 1500     0.157     0.083
 2000     0.360     0.189
 2500     0.424     0.398
 3000     0.680     0.673
 3500     0.951     1.158
 4000     1.176     1.745
 4500     1.654     2.476
 5000     2.034     3.537
 5500     2.482     4.652
 6000     3.024     6.048

Dense matrix MKL solver despite of twice larger amount of matrix elements was performing 3-4x faster on larger matrix sizes and 10x faster on smaller then sparse matrix MKL. It also brutally raped LAIPE. Is there any other solver for sparse matrices?

         MKLdense  LAIPE2
 500      0.005     0.019
1000      0.005     0.075
1500      0.024     0.205
2000      0.037     0.442
2500      0.085     0.837
3000      0.112     1.389
3500      0.218     2.197
4000      0.301     3.281
4500      0.395     4.502
5000      0.614     6.121
5500      0.644     8.201
6000      0.865    10.516

John, it might be worth to check skyline solver but that needs extensive testing in real sparsity. My real matrix could consist of block components of different sizes with dozen of blocks as small as 2x2, 100x100 to few larger ones 1000x1000, 2000x2000 and 3000x3000 ( to the total up to 10000x10000 and sometimes 20000x20000).

mecej4

Posts: 1912

Back to Top

25 Mar 2017 9:45 #19257

Dan, I think you will find it interesting to do a log-log plot of solution time against N, perhaps after removing data that took less than 0.1 s to run.

For full matrices, one expects O(N^{3) for Gaussian elimination. Your Laipe2 results give about O(N}2.85), whereas MKL gives O(N^1.45). The difference in the exponents is striking. Perhaps that is not a fair comparison, because Pardiso has been asked to do equation reordering, but Laipe2 was not asked to do the same.

If your ultimate goal is to solve block matrix equations with N = 10^4 or more, Laipe2 may be too slow.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

25 Mar 2017 9:56 #19259

That scaling is indeed good.

       MKLsparse  LAIPE2
10000    9.3       31.1
20000   54.2      326.7

Which parameter controls reordering? I usually do not need that, physics helped here making largest elements on major diagonal. If killing reordering substantially increases speed i'd take that, usually not much. Good would be to get small matrices run faster, as they are used more often.

mecej4

Posts: 1912

Back to Top

25 Mar 2017 11:22 #19260

The value in iparm(2) controls reordering. You cannot turn reordering off in Pardiso -- you can choose which reordering algorithm to use.

When you factorise a banded matrix the factors have the same bandwidths as the original matrix, but the fill-in within the band is affected by reordering. You should probably try Metis reordering, and see if asking Laipe to reorder helps to increase its speed.

JohnCampbell

Posts: 2526 Sydney

Back to Top

26 Mar 2017 12:36 #19261

Dan,

If you are using MKL with reordering, you need to be careful defining the initial matrix, especially the % non-zero within the profile. It is interesting that MKL will attempt re-ordering of a general sparse matrix on a per equation basis. I would have thought that to be unproductive, but the MKL results are impressive. (How does it manage the unknown increase in storage required?) Lower % cpu can be an indicator that there is not good load sharing between the threads, which can be due to the specific sparsity characteristics of the test matrix. You may want to experiment with different %non-zero values within the profile. If you can define the profile of the test matrices, I could run them in my profile solver and report how they compare.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

26 Mar 2017 2:15 #19262

John, Just take simple 2-block matrix as I plotted above first (matrix A in the source below). Later you can make 3 or more such blocks and then takes them of different size. Here is the source for the Laipe VAG_8 I used in comparisons above. Compilation:

>ftn95 TestVAG2017a.f95 /64 /free /err /set_error_level error 298  /no_truncate 
>slink64  TestVAG2017a.obj laipe2_withVAG.dll /file:TestVAG2017a.exe  


module Kinex_Matrix
 real*8, dimension(:,:), allocatable :: A    
 real*8, dimension(:),   allocatable :: B, C, X 
 integer :: NoGood, i, j, neq 
 integer*4, dimension(:), allocatable :: Beginning, Label, Last
 Integer count_0, count_1, count_rate, count_max, ncore 
end module Kinex_Matrix
!..................................................

Program LAIPE_VAG8 
use Kinex_Matrix
include 'laipe2_withVAG.inc'
real*8, dimension(:),   allocatable :: A1

call laipe$getce(ncore) 
call laipe$use(ncore) 

DO neq=500,6000,500   ! DO neq=10000,20000,10000 

   iDimA1 = neq*neq
   allocate( b(neq), x(neq), beginning(neq), last(neq), label(neq)) 
   allocate ( A1(iDimA1), stat = iAllocA1)

   call Input (A1)

   Call system_clock(count_0, count_rate, count_max) 

   CALL laipe$Decompose_VAG_8(A1, Neq, Label, Last, NoGood)
   IF (NoGood /= 0) STOP 'LAIPE: Cannot decompose' 
   CALL laipe$Substitute_VAG_8 (A1, Neq, Label, Last, B)

   Call system_clock(count_1, count_rate, count_max) 
   Write (*, '(1x,A,i6,A,2x,F8.3,A)') 'nEqu = ',nEq,' ', & 
        dble(count_1-count_0)/count_rate, ' s' 
   deallocate(A1,B,X, beginning, last, label) 

end DO ! DO neq=500,6000,500 

call laipe$done() 

end program 

!........................................
Subroutine Input (A1)
use Kinex_Matrix
Real*8 A1(1,1)

   allocate(A(neq,neq))
   A(:,:) = 0 
   B(:) = 0 

   do j=1,neq/2     
     beginning(j) = 1
     last(j)      = neq/2
     do i=1,neq/2
       call random_number(A(i,j))
     enddo 
   enddo 

   do j=neq/2, neq 
     beginning(j) = neq/2
     last(j)      = neq
     do i=neq/2,neq 
       call random_number(A(i,j))
     enddo 
   enddo 
   beginning(neq/2) = 1

   Label(1) = 1
   do i=2,neq
     Label(i) = Label(i-1) + Last(i-1) - Beginning(i) + 1 
   enddo

   call random_number(B)

   call TwoDarray_Into1Darray (A1)
   deallocate(A) 

end Subroutine 

!..............................................
 Subroutine TwoDarray_Into1Darray (A1)
 use Kinex_Matrix
 Real*8 A1(1,1)

  Do j=1,Neq
    do ii = Beginning(j), Last(j)
      A1(ii,Label(j) ) = A(ii,j)
    enddo
  enddo

 End subroutine