replica nfl jerseysreplica nfl jerseyssoccer jerseyreplica nfl jerseys forums.silverfrost.com :: View topic - Skyline solver accelerated 42 times on 48 procesdors
forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Skyline solver accelerated 42 times on 48 procesdors
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
mecej4



Joined: 31 Oct 2006
Posts: 1899

PostPosted: Fri Mar 24, 2017 5:45 pm    Post subject: Reply with quote

There are, as far as I know, no special routines in MKL for sparse matrices where the nonzeros are confined to recognizable blocks. From matrix theory, we know that many of the properties of matrices of real-number elements also apply to matrix (i.e., block) elements.

I have not had much experience with block matrices. Have you read the book "Matrix Computations" by Golub and Van Loan? They cover such topics very well.

People who work with splines have developed solvers for ABD (Almost Block Diagonal) matrices.

MKL includes a version of Pardiso. It also contains a DSS interface to Pardiso. I think that it is worth trying out Pardiso on some of your block-sparse matrices before looking for special block solvers. If you send me a couple of big (say, 10K X 10K) matrices in COO or CSR format, I'll try out Pardiso on them and compare with Laipe timings. That should help you decide if Pardiso will suffice for your purposes.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Fri Mar 24, 2017 6:58 pm    Post subject: Reply with quote

I suggest to use for the matrix A and vector B the same approach like it was used with dense matrix case and fill them just with the random numbers. The code in this case is almost the same too. This will create two squares on the major diagonal like was plotted above

Code:
program BlockSparse
 implicit none
 integer :: i,j,neq,nrhs=1,lda,ldb, info
 real*8,allocatable :: A(:,:),b(:)
 integer, allocatable :: piv(:)
 Integer count_0, count_1, count_rate, count_max
 !
 do neq=500,6000,500
    lda=neq; ldb=neq
    allocate(A(neq,neq),b(neq),piv(neq))

      A(:,:) = 0
      B(:) = 0

      do j=1,neq/2    
       do i=1,neq/2
        A(i,j) =random()
       enddo
      enddo

      do j=neq/2, neq
       do i=neq/2,neq
        A(i,j) =random()
       enddo
      enddo
   
    call random_number(B)

    Call system_clock(count_0, count_rate, count_max)
    CALL ....
    Call system_clock(count_1, count_rate, count_max)
    Write (*, '(1x,A,i6,A,2x,F8.3,A)') 'nEqu = ',nEq,' ', &
         dble(count_1-count_0)/count_rate, ' s'
    deallocate(A,b,piv)
 end do
 end program
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1899

PostPosted: Fri Mar 24, 2017 11:25 pm    Post subject: Reply with quote

With four threads on my i7-2720QM laptop, I get:

Code:

   n      run time (s)
-----    ---------
  500     0.125
 1000     0.140
 1500     0.312
 2000     0.547
 2500     0.828
 3000     1.203
 3500     1.640
 4000     2.260
 4500     2.860
 5000     3.618
 5500     4.691
 6000     5.384

Here is the MKL-Pardiso code
Code:

program tdan2sq
implicit none
integer :: i,j,k,m,n,nnz,count_0,count_1,count_max,count_rate
integer*8 pt(64)
integer iparm(64),idum,maxfct,mnum,mtype,phase,nrhs,error,msglvl
double precision, allocatable :: A(:),b(:),x(:)
integer, allocatable :: ia(:),ja(:)
double precision dum(1)
!
do n=500,6000,500
   nnz=(n+1)*(n+1)/2
   allocate (A(nnz),b(n),x(n))
   k=1
   call random_number(A)
   call random_number(b)
   allocate(ia(n+1),ja(nnz))
   do i=1,n/2
      ia(i)=k
      do j=1,n/2
         ja(k)=j; k=k+1
      end do
   end do
   k=k-1
   do i=n/2,n
      do j=n/2,n
         ja(k)=j; k=k+1
      end do
      ia(i+1)=k
   end do
   iparm(1:64)=0
   iparm(1:3)=[1, 2, 4]
   iparm(8)=9; iparm(10)=13; iparm(11)=1
   iparm(13)=1
   pt = 0; msglvl=0; mtype=11; maxfct=1
   mnum=1; nrhs=1; phase=13
   Call system_clock(count_0, count_rate, count_max)
   CALL pardiso (pt, maxfct, mnum, mtype, phase, n, a, ia, ja, &
       idum, nrhs, iparm, msglvl, b, x, error)
   phase = -1
   CALL pardiso (pt, maxfct, mnum, mtype, phase, n, a, ia, ja, &
       idum, nrhs, iparm, msglvl, b, x, error)
   Call system_clock(count_1, count_rate, count_max)
   Write (*, '(1x,I4,3x,F7.3)') n, dble(count_1-count_0)/count_rate
   deallocate(A,b,x,ia,ja)
end do
end program
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2615
Location: Sydney

PostPosted: Sat Mar 25, 2017 1:07 am    Post subject: Reply with quote

Dan,

For your block matrix, I would expect that a skyline approach would be effective, as the operations count would be the same as for any block specific approach.

The disadvantage of a skyline solver is that they typically do not have any pivoting capability. The skyline solvers I am familiar with are for symmetric matrices, and apply artificial constraints to (near) zero diagonals, which is not a problem for the Finite Element matrices I am using.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Sat Mar 25, 2017 6:30 pm    Post subject: Reply with quote

Broke again my favorite glasses with UV coating (too fragile), but good that this pushed me to find lost 3 months ago (!) spare ones without coating. The coating specifically made for monitors, it clearly changes tint, that means it starts screening even some far blue. I think it is important to have that coating if we sit 10+ hours in front of larger and larger screens, some of which emit dangerous UV spectrum transforming it to blue which may leak enough to do harm over large time (not clear how much monitor front glass leaks UV, better to do clean experiment and measure the spectrum of LED and OLED monitors)

Mecej4, many thanks again, you have done great job and I owe you now even more. Have you noticed that while MKL dense matrix solver consumes almost 100% of CPU time in Task Manager the sparse one grabs only ~40% ? LAIPE2 dense case grabs even less, why it is slower. Sparse LAIPE2 though get closer to 100%

Anyway, here are preliminary test results (in seconds) for the block sparse matrix. MKL wins on larger size matrices >3000 but loses on smaller ones (processor I7-4770K 4.5Ghz)
Code:

        MKLsparse  LAIPE2
  500     0.191     0.006
 1000     0.062     0.027
 1500     0.157     0.083
 2000     0.360     0.189
 2500     0.424     0.398
 3000     0.680     0.673
 3500     0.951     1.158
 4000     1.176     1.745
 4500     1.654     2.476
 5000     2.034     3.537
 5500     2.482     4.652
 6000     3.024     6.048


Dense matrix MKL solver despite of twice larger amount of matrix elements was performing 3-4x faster on larger matrix sizes and 10x faster on smaller then sparse matrix MKL. It also brutally raped LAIPE. Is there any other solver for sparse matrices?

Code:
 
         MKLdense  LAIPE2
 500      0.005     0.019
1000      0.005     0.075
1500      0.024     0.205
2000      0.037     0.442
2500      0.085     0.837
3000      0.112     1.389
3500      0.218     2.197
4000      0.301     3.281
4500      0.395     4.502
5000      0.614     6.121
5500      0.644     8.201
6000      0.865    10.516


John, it might be worth to check skyline solver but that needs extensive testing in real sparsity. My real matrix could consist of block components of different sizes with dozen of blocks as small as 2x2, 100x100 to few larger ones 1000x1000, 2000x2000 and 3000x3000 ( to the total up to 10000x10000 and sometimes 20000x20000).


Last edited by DanRRight on Sat Mar 25, 2017 10:54 pm; edited 3 times in total
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1899

PostPosted: Sat Mar 25, 2017 10:45 pm    Post subject: Reply with quote

Dan, I think you will find it interesting to do a log-log plot of solution time against N, perhaps after removing data that took less than 0.1 s to run.

For full matrices, one expects O(N^3) for Gaussian elimination. Your Laipe2 results give about O(N^2.85), whereas MKL gives O(N^1.45). The difference in the exponents is striking. Perhaps that is not a fair comparison, because Pardiso has been asked to do equation reordering, but Laipe2 was not asked to do the same.

If your ultimate goal is to solve block matrix equations with N = 10^4 or more, Laipe2 may be too slow.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Sat Mar 25, 2017 10:56 pm    Post subject: Reply with quote

That scaling is indeed good.
Code:

       MKLsparse  LAIPE2
10000    9.3       31.1
20000   54.2      326.7


Which parameter controls reordering? I usually do not need that, physics helped here making largest elements on major diagonal. If killing reordering substantially increases speed i'd take that, usually not much. Good would be to get small matrices run faster, as they are used more often.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1899

PostPosted: Sun Mar 26, 2017 12:22 am    Post subject: Reply with quote

The value in iparm(2) controls reordering. You cannot turn reordering off in Pardiso -- you can choose which reordering algorithm to use.

When you factorise a banded matrix the factors have the same bandwidths as the original matrix, but the fill-in within the band is affected by reordering. You should probably try Metis reordering, and see if asking Laipe to reorder helps to increase its speed.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2615
Location: Sydney

PostPosted: Sun Mar 26, 2017 1:36 am    Post subject: Reply with quote

Dan,

If you are using MKL with reordering, you need to be careful defining the initial matrix, especially the % non-zero within the profile.
It is interesting that MKL will attempt re-ordering of a general sparse matrix on a per equation basis. I would have thought that to be unproductive, but the MKL results are impressive. (How does it manage the unknown increase in storage required?)
Lower % cpu can be an indicator that there is not good load sharing between the threads, which can be due to the specific sparsity characteristics of the test matrix. You may want to experiment with different %non-zero values within the profile.
If you can define the profile of the test matrices, I could run them in my profile solver and report how they compare.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Sun Mar 26, 2017 3:15 am    Post subject: Reply with quote

John, Just take simple 2-block matrix as I plotted above first (matrix A in the source below). Later you can make 3 or more such blocks and then takes them of different size.
Here is the source for the Laipe VAG_8 I used in comparisons above. Compilation:
Code:
>ftn95 TestVAG2017a.f95 /64 /free /err /set_error_level error 298  /no_truncate
>slink64  TestVAG2017a.obj laipe2_withVAG.dll /file:TestVAG2017a.exe 


Code:
module Kinex_Matrix
 real*8, dimension(:,:), allocatable :: A   
 real*8, dimension(:),   allocatable :: B, C, X
 integer :: NoGood, i, j, neq
 integer*4, dimension(:), allocatable :: Beginning, Label, Last
 Integer count_0, count_1, count_rate, count_max, ncore
end module Kinex_Matrix
!..................................................

Program LAIPE_VAG8
use Kinex_Matrix
include 'laipe2_withVAG.inc'
real*8, dimension(:),   allocatable :: A1

call laipe$getce(ncore)
call laipe$use(ncore)

DO neq=500,6000,500   ! DO neq=10000,20000,10000

   iDimA1 = neq*neq
   allocate( b(neq), x(neq), beginning(neq), last(neq), label(neq))
   allocate ( A1(iDimA1), stat = iAllocA1)

   call Input (A1)

   Call system_clock(count_0, count_rate, count_max)

   CALL laipe$Decompose_VAG_8(A1, Neq, Label, Last, NoGood)
   IF (NoGood /= 0) STOP 'LAIPE: Cannot decompose'
   CALL laipe$Substitute_VAG_8 (A1, Neq, Label, Last, B)

   Call system_clock(count_1, count_rate, count_max)
   Write (*, '(1x,A,i6,A,2x,F8.3,A)') 'nEqu = ',nEq,' ', &
        dble(count_1-count_0)/count_rate, ' s'
   deallocate(A1,B,X, beginning, last, label)

end DO ! DO neq=500,6000,500

call laipe$done()

end program

!........................................
Subroutine Input (A1)
use Kinex_Matrix
Real*8 A1(1,1)

   allocate(A(neq,neq))
   A(:,:) = 0
   B(:) = 0

   do j=1,neq/2     
     beginning(j) = 1
     last(j)      = neq/2
     do i=1,neq/2
       call random_number(A(i,j))
     enddo
   enddo

   do j=neq/2, neq
     beginning(j) = neq/2
     last(j)      = neq
     do i=neq/2,neq
       call random_number(A(i,j))
     enddo
   enddo
   beginning(neq/2) = 1

   Label(1) = 1
   do i=2,neq
     Label(i) = Label(i-1) + Last(i-1) - Beginning(i) + 1
   enddo

   call random_number(B)

   call TwoDarray_Into1Darray (A1)
   deallocate(A)

end Subroutine

!..............................................
 Subroutine TwoDarray_Into1Darray (A1)
 use Kinex_Matrix
 Real*8 A1(1,1)

  Do j=1,Neq
    do ii = Beginning(j), Last(j)
      A1(ii,Label(j) ) = A(ii,j)
    enddo
  enddo

 End subroutine
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Sun Mar 26, 2017 3:55 am    Post subject: Reply with quote

John,
The code above is a bit messy, so I will probably repeat how the matrix was set by simpler way
Code:
 integer :: neq
  real*8,allocatable :: A(:,:), B (:)

  do neq=500,6000,500

     allocate(A(neq,neq),b(neq)

       A(:,:) = 0
       B(:) = 0

       do j=1,neq/2     
        do i=1,neq/2
         A(i,j) =random()
        enddo
       enddo

       do j=neq/2, neq
        do i=neq/2,neq
         A(i,j) =random()
        enddo
       enddo
     
     call random_number(B)
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2615
Location: Sydney

PostPosted: Sun Mar 26, 2017 4:14 am    Post subject: Reply with quote

Dan,

Trying to review the solver times, I stumbled at the first hurdle, as you have used laipe$Decompose_VAG_8, while I can only perform laipe$Decompose_VSG_8 solutions.
With Variable-Bandwidth and Asymmetric Systems, I presume the solver first checks that you have provided for non-zero infill during reduction.
I haven't used MKL, so I presume non-zero infill is managed there also. Does the MKL sparse input format require this allowance ?
In the 32-bit days, allocating sufficient temporary space could be a significant issue ?

Are all your tests assuming non-symmetric matrices ?
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Sun Mar 26, 2017 5:39 am    Post subject: Reply with quote

Blocks inside the matrix are squares which go along the major diagonal, so the geometrically the shape of matrix is symmetric, but matrix elements all are different, so formally it is not symmetric (not equal to transposed as a definition of symmetric).

Dense 20000x20000 real*8 matrix contains 3GB of data. My block sparse one 1/2 of that
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2923
Location: South Pole, Antarctica

PostPosted: Sun Mar 26, 2017 5:48 pm    Post subject: Reply with quote

Here is the test with single precision using LAIPE block sparse solver. I substituted VAG_8 by VAG_4 and all arrays with real*4. It shows twice the speed of VAG_8 above
Code:
            VAG_4   VAG_8   MKL_8   MKL_4
   500      0.005   0.006   0.191   0.117
  1000      0.020   0.027   0.062   0.061
  1500      0.051   0.083   0.157   0.140
  2000      0.113   0.189   0.360   0.267
  2500      0.222   0.398   0.424   0.487
  3000      0.373   0.673   0.680   0.591
  3500      0.571   1.158   0.951   0.797
  4000      0.846   1.745   1.176   1.257
  4500      1.250   2.476   1.654   1.514
  5000      1.704   3.537   2.034   1.775
  5500      2.277   4.652   2.482   2.413
  6000      2.950   6.048   3.024   2.615
 10000      15.43   31.10   9.268   9.326
 20000      151.5   326.7   54.21   40.99


Freaking "fun" is that I can not create the same test with older LAIPE circa 1997 made by the Microsoft Fortran which I used all this time with no problems. It crashes and crashes from the start. Can I say again and again: "programming is full of @#%% devilry"? Just one step aside and you are burned for hours where you expected less. Attention to tiny details, good memory and good mood/health are literally a must for programmers or you will do and redo things 1000 times...Am I the only who are experiencing this type of hiccups with some ups but often downs all the way? I never heard any complaints from colleagues of all ages to the same extreme extent which bothers me, and I'd like to believe that am not the most stupid person, or completely not suitable for this kind of work which I do all life and like it...


Last edited by DanRRight on Tue Mar 28, 2017 7:47 am; edited 7 times in total
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1899

PostPosted: Mon Mar 27, 2017 2:49 am    Post subject: Reply with quote

If by "older LAIPE circa 1997 made by the Microsoft Fortran" you mean a library made for use with Fortran Powerstation (FPS) 4, to call routines in the library you need to use the STDCALL convention. In the caller, you may need to add declarations for this purpose, or use compiler options, depending on whether your code is compiled with IFort or FTN95.

Using the FPS-4 compiler, I ran the code that I posted in my post of March 22, 2017 after changing the subroutine names to LAIPE-1 names and providing interface blocks for the three LAIPE subroutines that are called. The run times were roughly the same as with Intel Fortran, because most of the run time is spent in the LAIPE-1 library.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4  Next
Page 3 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group