
forums.silverfrost.com Welcome to the Silverfrost forums

View previous topic :: View next topic 
Author 
Message 
DanRRight
Joined: 10 Mar 2008 Posts: 1482 Location: South Pole, Antarctica

Posted: Fri Mar 24, 2017 12:59 pm Post subject: 


Yes, dense matrix test works. Post these results?
Settings like /check etc for compilation were of course not needed in this case, I just took some example of BAT file, I recommend to always use these settings when you debug, they catch some very tricky errors specifically if you don't use "implicit none" like I mostly do 

Back to top 


mecej4
Joined: 31 Oct 2006 Posts: 656

Posted: Fri Mar 24, 2017 1:08 pm Post subject: Re: 


DanRRight wrote:  Yes, dense matrix test works.
I recommend to always use these settings when you debug, they catch some very tricky errors specifically if you don't use "implicit none" like I mostly do 
Debug is fine for initial testing. For timing runs, however, in general you do not want checking enabled.
In this particular test, since only the time spent in the library is measured, and the library routines do almost no checking, whether you specify /debug may not matter. 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1482 Location: South Pole, Antarctica

Posted: Fri Mar 24, 2017 3:55 pm Post subject: 


Mecej4, Dense matrix treating was literally killing. Do you by chance know, after getting through the forest of thousands of Matrix Market examples, which MKL subroutine is closest by functionality to Laipe's VAG ? Interesting how sparse matrix cases are handled by MKL.
The simplest case of sparse block matrix test would be with two matrix blocks say by 3000x3000 matrices (5999 equations, numbers are in black areas) like that
or in a bit more general way any size with the similar shape starting from smallest with just two 2x2 blocks (total 3 equations))
In the past the move from textbook linear equation solvers to parallel one like LAIPE gave one order of magnitude speed boost. And now move to MKL may give additional order of magnitude. Together with much faster 64bit FTN95 and input/output speed boosts we discussed few months back I may get unbelievable increase of code performance by two orders of magnitude from original. This newsgroup is amazing. 

Back to top 


mecej4
Joined: 31 Oct 2006 Posts: 656

Posted: Fri Mar 24, 2017 5:45 pm Post subject: 


There are, as far as I know, no special routines in MKL for sparse matrices where the nonzeros are confined to recognizable blocks. From matrix theory, we know that many of the properties of matrices of realnumber elements also apply to matrix (i.e., block) elements.
I have not had much experience with block matrices. Have you read the book "Matrix Computations" by Golub and Van Loan? They cover such topics very well.
People who work with splines have developed solvers for ABD (Almost Block Diagonal) matrices.
MKL includes a version of Pardiso. It also contains a DSS interface to Pardiso. I think that it is worth trying out Pardiso on some of your blocksparse matrices before looking for special block solvers. If you send me a couple of big (say, 10K X 10K) matrices in COO or CSR format, I'll try out Pardiso on them and compare with Laipe timings. That should help you decide if Pardiso will suffice for your purposes. 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1482 Location: South Pole, Antarctica

Posted: Fri Mar 24, 2017 6:58 pm Post subject: 


I suggest to use for the matrix A and vector B the same approach like it was used with dense matrix case and fill them just with the random numbers. The code in this case is almost the same too. This will create two squares on the major diagonal like was plotted above
Code:  program BlockSparse
implicit none
integer :: i,j,neq,nrhs=1,lda,ldb, info
real*8,allocatable :: A(:,:),b(:)
integer, allocatable :: piv(:)
Integer count_0, count_1, count_rate, count_max
!
do neq=500,6000,500
lda=neq; ldb=neq
allocate(A(neq,neq),b(neq),piv(neq))
A(:,:) = 0
B(:) = 0
do j=1,neq/2
do i=1,neq/2
A(i,j) =random()
enddo
enddo
do j=neq/2, neq
do i=neq/2,neq
A(i,j) =random()
enddo
enddo
call random_number(B)
Call system_clock(count_0, count_rate, count_max)
CALL ....
Call system_clock(count_1, count_rate, count_max)
Write (*, '(1x,A,i6,A,2x,F8.3,A)') 'nEqu = ',nEq,' ', &
dble(count_1count_0)/count_rate, ' s'
deallocate(A,b,piv)
end do
end program 


Back to top 


mecej4
Joined: 31 Oct 2006 Posts: 656

Posted: Fri Mar 24, 2017 11:25 pm Post subject: 


With four threads on my i72720QM laptop, I get:
Code: 
n run time (s)
 
500 0.125
1000 0.140
1500 0.312
2000 0.547
2500 0.828
3000 1.203
3500 1.640
4000 2.260
4500 2.860
5000 3.618
5500 4.691
6000 5.384

Here is the MKLPardiso code
Code: 
program tdan2sq
implicit none
integer :: i,j,k,m,n,nnz,count_0,count_1,count_max,count_rate
integer*8 pt(64)
integer iparm(64),idum,maxfct,mnum,mtype,phase,nrhs,error,msglvl
double precision, allocatable :: A(:),b(:),x(:)
integer, allocatable :: ia(:),ja(:)
double precision dum(1)
!
do n=500,6000,500
nnz=(n+1)*(n+1)/2
allocate (A(nnz),b(n),x(n))
k=1
call random_number(A)
call random_number(b)
allocate(ia(n+1),ja(nnz))
do i=1,n/2
ia(i)=k
do j=1,n/2
ja(k)=j; k=k+1
end do
end do
k=k1
do i=n/2,n
do j=n/2,n
ja(k)=j; k=k+1
end do
ia(i+1)=k
end do
iparm(1:64)=0
iparm(1:3)=[1, 2, 4]
iparm(8)=9; iparm(10)=13; iparm(11)=1
iparm(13)=1
pt = 0; msglvl=0; mtype=11; maxfct=1
mnum=1; nrhs=1; phase=13
Call system_clock(count_0, count_rate, count_max)
CALL pardiso (pt, maxfct, mnum, mtype, phase, n, a, ia, ja, &
idum, nrhs, iparm, msglvl, b, x, error)
phase = 1
CALL pardiso (pt, maxfct, mnum, mtype, phase, n, a, ia, ja, &
idum, nrhs, iparm, msglvl, b, x, error)
Call system_clock(count_1, count_rate, count_max)
Write (*, '(1x,I4,3x,F7.3)') n, dble(count_1count_0)/count_rate
deallocate(A,b,x,ia,ja)
end do
end program



Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 1731 Location: Sydney

Posted: Sat Mar 25, 2017 1:07 am Post subject: 


Dan,
For your block matrix, I would expect that a skyline approach would be effective, as the operations count would be the same as for any block specific approach.
The disadvantage of a skyline solver is that they typically do not have any pivoting capability. The skyline solvers I am familiar with are for symmetric matrices, and apply artificial constraints to (near) zero diagonals, which is not a problem for the Finite Element matrices I am using. 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1482 Location: South Pole, Antarctica

Posted: Sat Mar 25, 2017 6:30 pm Post subject: 


Broke again my favorite glasses with UV coating (too fragile), but good that this pushed me to find lost 3 months ago (!) spare ones without coating. The coating specifically made for monitors, it clearly changes tint, that means it starts screening even some far blue. I think it is important to have that coating if we sit 10+ hours in front of larger and larger screens, some of which emit dangerous UV spectrum transforming it to blue which may leak enough to do harm over large time (not clear how much monitor front glass leaks UV, better to do clean experiment and measure the spectrum of LED and OLED monitors)
Mecej4, many thanks again, you have done great job and I owe you now even more. Have you noticed that while MKL dense matrix solver consumes almost 100% of CPU time in Task Manager the sparse one grabs only ~40% ? LAIPE2 dense case grabs even less, why it is slower. Sparse LAIPE2 though get closer to 100%
Anyway, here are preliminary test results (in seconds) for the block sparse matrix. MKL wins on larger size matrices >3000 but loses on smaller ones (processor I74770K 4.5Ghz)
Code: 
MKLsparse LAIPE2
500 0.191 0.006
1000 0.062 0.027
1500 0.157 0.083
2000 0.360 0.189
2500 0.424 0.398
3000 0.680 0.673
3500 0.951 1.158
4000 1.176 1.745
4500 1.654 2.476
5000 2.034 3.537
5500 2.482 4.652
6000 3.024 6.048

Dense matrix MKL solver despite of twice larger amount of matrix elements was performing 34x faster on larger matrix sizes and 10x faster on smaller then sparse matrix MKL. It also brutally raped LAIPE. Is there any other solver for sparse matrices?
Code: 
MKLdense LAIPE2
500 0.005 0.019
1000 0.005 0.075
1500 0.024 0.205
2000 0.037 0.442
2500 0.085 0.837
3000 0.112 1.389
3500 0.218 2.197
4000 0.301 3.281
4500 0.395 4.502
5000 0.614 6.121
5500 0.644 8.201
6000 0.865 10.516 
John, it might be worth to check skyline solver but that needs extensive testing in real sparsity. My real matrix could consist of block components of different sizes with dozen of blocks as small as 2x2, 100x100 to few larger ones 1000x1000, 2000x2000 and 3000x3000 ( to the total up to 10000x10000 and sometimes 20000x20000).
Last edited by DanRRight on Sat Mar 25, 2017 10:54 pm; edited 3 times in total 

Back to top 


mecej4
Joined: 31 Oct 2006 Posts: 656

Posted: Sat Mar 25, 2017 10:45 pm Post subject: 


Dan, I think you will find it interesting to do a loglog plot of solution time against N, perhaps after removing data that took less than 0.1 s to run.
For full matrices, one expects O(N^3) for Gaussian elimination. Your Laipe2 results give about O(N^2.85), whereas MKL gives O(N^1.45). The difference in the exponents is striking. Perhaps that is not a fair comparison, because Pardiso has been asked to do equation reordering, but Laipe2 was not asked to do the same.
If your ultimate goal is to solve block matrix equations with N = 10^4 or more, Laipe2 may be too slow. 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1482 Location: South Pole, Antarctica

Posted: Sat Mar 25, 2017 10:56 pm Post subject: 


That scaling is indeed good.
Code: 
MKLsparse LAIPE2
10000 9.3 31.1
20000 54.2 326.7

Which parameter controls reordering? I usually do not need that, physics helped here making largest elements on major diagonal. If killing reordering substantially increases speed i'd take that, usually not much. Good would be to get small matrices run faster, as they are used more often. 

Back to top 


mecej4
Joined: 31 Oct 2006 Posts: 656

Posted: Sun Mar 26, 2017 12:22 am Post subject: 


The value in iparm(2) controls reordering. You cannot turn reordering off in Pardiso  you can choose which reordering algorithm to use.
When you factorise a banded matrix the factors have the same bandwidths as the original matrix, but the fillin within the band is affected by reordering. You should probably try Metis reordering, and see if asking Laipe to reorder helps to increase its speed. 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 1731 Location: Sydney

Posted: Sun Mar 26, 2017 1:36 am Post subject: 


Dan,
If you are using MKL with reordering, you need to be careful defining the initial matrix, especially the % nonzero within the profile.
It is interesting that MKL will attempt reordering of a general sparse matrix on a per equation basis. I would have thought that to be unproductive, but the MKL results are impressive. (How does it manage the unknown increase in storage required?)
Lower % cpu can be an indicator that there is not good load sharing between the threads, which can be due to the specific sparsity characteristics of the test matrix. You may want to experiment with different %nonzero values within the profile.
If you can define the profile of the test matrices, I could run them in my profile solver and report how they compare. 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1482 Location: South Pole, Antarctica

Posted: Sun Mar 26, 2017 3:15 am Post subject: 


John, Just take simple 2block matrix as I plotted above first (matrix A in the source below). Later you can make 3 or more such blocks and then takes them of different size.
Here is the source for the Laipe VAG_8 I used in comparisons above. Compilation:
Code:  >ftn95 TestVAG2017a.f95 /64 /free /err /set_error_level error 298 /no_truncate
>slink64 TestVAG2017a.obj laipe2_withVAG.dll /file:TestVAG2017a.exe 
Code:  module Kinex_Matrix
real*8, dimension(:,:), allocatable :: A
real*8, dimension(:), allocatable :: B, C, X
integer :: NoGood, i, j, neq
integer*4, dimension(:), allocatable :: Beginning, Label, Last
Integer count_0, count_1, count_rate, count_max, ncore
end module Kinex_Matrix
!..................................................
Program LAIPE_VAG8
use Kinex_Matrix
include 'laipe2_withVAG.inc'
real*8, dimension(:), allocatable :: A1
call laipe$getce(ncore)
call laipe$use(ncore)
DO neq=500,6000,500 ! DO neq=10000,20000,10000
iDimA1 = neq*neq
allocate( b(neq), x(neq), beginning(neq), last(neq), label(neq))
allocate ( A1(iDimA1), stat = iAllocA1)
call Input (A1)
Call system_clock(count_0, count_rate, count_max)
CALL laipe$Decompose_VAG_8(A1, Neq, Label, Last, NoGood)
IF (NoGood /= 0) STOP 'LAIPE: Cannot decompose'
CALL laipe$Substitute_VAG_8 (A1, Neq, Label, Last, B)
Call system_clock(count_1, count_rate, count_max)
Write (*, '(1x,A,i6,A,2x,F8.3,A)') 'nEqu = ',nEq,' ', &
dble(count_1count_0)/count_rate, ' s'
deallocate(A1,B,X, beginning, last, label)
end DO ! DO neq=500,6000,500
call laipe$done()
end program
!........................................
Subroutine Input (A1)
use Kinex_Matrix
Real*8 A1(1,1)
allocate(A(neq,neq))
A(:,:) = 0
B(:) = 0
do j=1,neq/2
beginning(j) = 1
last(j) = neq/2
do i=1,neq/2
call random_number(A(i,j))
enddo
enddo
do j=neq/2, neq
beginning(j) = neq/2
last(j) = neq
do i=neq/2,neq
call random_number(A(i,j))
enddo
enddo
beginning(neq/2) = 1
Label(1) = 1
do i=2,neq
Label(i) = Label(i1) + Last(i1)  Beginning(i) + 1
enddo
call random_number(B)
call TwoDarray_Into1Darray (A1)
deallocate(A)
end Subroutine
!..............................................
Subroutine TwoDarray_Into1Darray (A1)
use Kinex_Matrix
Real*8 A1(1,1)
Do j=1,Neq
do ii = Beginning(j), Last(j)
A1(ii,Label(j) ) = A(ii,j)
enddo
enddo
End subroutine 


Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1482 Location: South Pole, Antarctica

Posted: Sun Mar 26, 2017 3:55 am Post subject: 


John,
The code above is a bit messy, so I will probably repeat how the matrix was set by simpler way
Code:  integer :: neq
real*8,allocatable :: A(:,:), B (:)
do neq=500,6000,500
allocate(A(neq,neq),b(neq)
A(:,:) = 0
B(:) = 0
do j=1,neq/2
do i=1,neq/2
A(i,j) =random()
enddo
enddo
do j=neq/2, neq
do i=neq/2,neq
A(i,j) =random()
enddo
enddo
call random_number(B) 


Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 1731 Location: Sydney

Posted: Sun Mar 26, 2017 4:14 am Post subject: 


Dan,
Trying to review the solver times, I stumbled at the first hurdle, as you have used laipe$Decompose_VAG_8, while I can only perform laipe$Decompose_VSG_8 solutions.
With VariableBandwidth and Asymmetric Systems, I presume the solver first checks that you have provided for nonzero infill during reduction.
I haven't used MKL, so I presume nonzero infill is managed there also. Does the MKL sparse input format require this allowance ?
In the 32bit days, allocating sufficient temporary space could be a significant issue ?
Are all your tests assuming nonsymmetric matrices ? 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

Powered by phpBB © 2001, 2005 phpBB Group
