forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Skyline solver accelerated 42 times on 48 procesdors
Goto page Previous  1, 2, 3, 4
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Sun Mar 26, 2017 5:39 am    Post subject: Reply with quote

Blocks inside the matrix are squares which go along the major diagonal, so the geometrically the shape of matrix is symmetric, but matrix elements all are different, so formally it is not symmetric (not equal to transposed as a definition of symmetric).

Dense 20000x20000 real*8 matrix contains 3GB of data. My block sparse one 1/2 of that
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Sun Mar 26, 2017 5:48 pm    Post subject: Reply with quote

Here is the test with single precision using LAIPE block sparse solver. I substituted VAG_8 by VAG_4 and all arrays with real*4. It shows twice the speed of VAG_8 above
Code:
            VAG_4   VAG_8   MKL_8   MKL_4
   500      0.005   0.006   0.191   0.117
  1000      0.020   0.027   0.062   0.061
  1500      0.051   0.083   0.157   0.140
  2000      0.113   0.189   0.360   0.267
  2500      0.222   0.398   0.424   0.487
  3000      0.373   0.673   0.680   0.591
  3500      0.571   1.158   0.951   0.797
  4000      0.846   1.745   1.176   1.257
  4500      1.250   2.476   1.654   1.514
  5000      1.704   3.537   2.034   1.775
  5500      2.277   4.652   2.482   2.413
  6000      2.950   6.048   3.024   2.615
 10000      15.43   31.10   9.268   9.326
 20000      151.5   326.7   54.21   40.99


Freaking "fun" is that I can not create the same test with older LAIPE circa 1997 made by the Microsoft Fortran which I used all this time with no problems. It crashes and crashes from the start. Can I say again and again: "programming is full of @#%% devilry"? Just one step aside and you are burned for hours where you expected less. Attention to tiny details, good memory and good mood/health are literally a must for programmers or you will do and redo things 1000 times...Am I the only who are experiencing this type of hiccups with some ups but often downs all the way? I never heard any complaints from colleagues of all ages to the same extreme extent which bothers me, and I'd like to believe that am not the most stupid person, or completely not suitable for this kind of work which I do all life and like it...


Last edited by DanRRight on Tue Mar 28, 2017 7:47 am; edited 7 times in total
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Mon Mar 27, 2017 2:49 am    Post subject: Reply with quote

If by "older LAIPE circa 1997 made by the Microsoft Fortran" you mean a library made for use with Fortran Powerstation (FPS) 4, to call routines in the library you need to use the STDCALL convention. In the caller, you may need to add declarations for this purpose, or use compiler options, depending on whether your code is compiled with IFort or FTN95.

Using the FPS-4 compiler, I ran the code that I posted in my post of March 22, 2017 after changing the subroutine names to LAIPE-1 names and providing interface blocks for the three LAIPE subroutines that are called. The run times were roughly the same as with Intel Fortran, because most of the run time is spent in the LAIPE-1 library.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Mar 27, 2017 3:22 am    Post subject: Reply with quote

Dan,
Using Laipe’s laipe$decompose_VAG_x, I would recommend two utility routines to check your solution:
1) A routine that, given LAST and LABEL, checks the profile definition to confirm compliance with in-fill below the diagonal (say laipe$check_profile_VA_x) and,
2) A routine that, given the calculated solution {X}, calculates the error vector {E} = {B} – [A]{X} and reports on the maximum error (say laipe$check_error_VA_8)
If nothing else, these checks would help with understanding the data structure and accuracy of the solution.

The use of REAL*8 :: A(1,1) and A(i,LABEL(j)) is an interesting approach. I am sure a lot of Fortran inspectors would complain.
Code:
   Subroutine laipe$Check_Profile_VAG_x (N_i, Label_i, Last_i, NoGood_o)
!
!  routine to check profile structure for VA matrix
!    Beginning(I) must be in range (1:I)
!    Last(I)      must be in the range (I:N) and (Last(I-1):N)
!
     INTEGER*4 :: N_i             ! number of equations
     INTEGER*4 :: Label_i(N_i)    ! column storage offset in A(1,1)
     INTEGER*4 :: Last_i(N_i)     ! last active equation in column
     INTEGER*4 :: NoGood_o        ! count of errors ( =0 for OK )
!
     integer*4 :: Ieq, Begin_i
!
     NoGood_o = 0
!
!!   Beginning(I) = Label(I-1) + Last(I-1) - Label(I) + 1        (9.2)
     Begin_i  = 1
     DO Ieq = 1, N_i
       if ( Ieq > 1 ) then
         Begin_i = Last_i(Ieq-1) - ( Label_i(Ieq) - Label_i(Ieq-1) ) + 1
         if ( Last_i(Ieq) < Last_i(Ieq-1) ) NoGood_o = NoGood_o + 1
       end if
       if ( Begin_i     < 1   ) NoGood_o = NoGood_o + 1
       if ( Begin_i     > Ieq ) NoGood_o = NoGood_o + 1
       if ( Last_i(Ieq) < Ieq ) NoGood_o = NoGood_o + 1
       if ( Last_i(Ieq) > N_i ) NoGood_o = NoGood_o + 1
     END DO
!
     if ( NoGood_o > 0 ) Write (*,*) NoGood_o,' problems with VAG_8 profile'

   End Subroutine laipe$Check_Profile_VAG_x
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Mar 27, 2017 3:24 am    Post subject: Reply with quote

.ctd
Code:
   REAL*8 Function laipe$Max_Error_VAG_8 (A_i, B_i, X_i, N_i, Label_i, Last_i)
!
!  routine to calculate both Max error and RMS error for the solution of [A].{X} = {B}
!    [A] and {B} are original inputs
!    {X} is the solution obtained
!
     INTEGER*4 :: N_i             ! number of equations
     REAL*8    :: A_i(*)          ! original matrix [A], before being decomposed
     REAL*8    :: B_i(N_i)        ! {B} : original rhs
     REAL*8    :: X_i(N_i)        ! {X} : available solution to [A].{X} = {B}
     INTEGER*4 :: Label_i(N_i)    ! column storage offset in A(1,1)
     INTEGER*4 :: Last_i(N_i)     ! last active equation in column
!
     integer*4 :: Ieq, Begin_i, End_i, N_eq, Start_i
     real*8    :: err_i, max_err, rms_err
     REAL*8, allocatable :: B_new(:)
!
     ALLOCATE ( B_new(N_i) )
     B_new = 0
!
!    Begin_i is first active equation in column i
!    Start_i is storage location of this first coefficient in A(*) storage
!    last_i  is last active equation of column i
!
!        Begin_i = Last_i(Ieq-1) - ( Label_i(Ieq) - Label_i(Ieq-1) ) + 1
!        Start_i = Begin_i + (Label_i(ieq)-1)*1
!
     Begin_i  = 1
     DO Ieq = 1, N_i
       if ( Ieq > 1 ) then
         Begin_i = Last_i(Ieq-1) - ( Label_i(Ieq) - Label_i(Ieq-1) ) + 1
       end if
       End_i = Last_i(Ieq)
       N_eq = End_i - Begin_i + 1
!
!!     B_new(Begin_i:End_i) = B_new(Begin_i:End_i) + A_i(Begin_i:End_i,Label_i(Ieq)) * X_i(Ieq)
!
       Start_i = Begin_i + Label_i(ieq)-1
       call daxpy ( B_new(Begin_i), A_i(Start_i), X_i(Ieq), N_eq )
!
     END DO
!   
     max_err = 0
     rms_err = 0
     DO Ieq = 1, N_i
       err_i = abs ( B_new(Ieq) - B_i(Ieq) )
       if ( err_i > max_err ) max_err = err_i
       rms_err = rms_err + err_i**2
     END DO
     rms_err = sqrt ( rms_err / N_i )
!
     write (*,*) 'RMS error = ', rms_err
     write (*,*) 'Max error = ', max_err

     laipe$Max_Error_VAG_8 = max_err
!
   End Function laipe$Max_Error_VAG_8

I have not tested them, so hopefully the idea is there, even if there are a few errors !!
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Mon Mar 27, 2017 4:06 am    Post subject: Reply with quote

Mecej4, There exist one more library besides the ones you have. It does not need anything to be called from FTN95. And it worked for 20 years with no hiccup. Until today...I tried few things but just have no time to find the reason.

Does this pardiso subroutine above have single precision version?

John, I may play with checking of solution but what gives to the user this extra checking? Have you ever seen big problems here?
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Mon Mar 27, 2017 11:37 am    Post subject: Re: Reply with quote

DanRRight wrote:
Does this pardiso subroutine above have single precision version?


Yes, and you really have to read the documentation before attempting to use that and other options. Unlike most of the Lapack routines in MKL, for calling Pardiso you specify single-precision by setting one of the elements of the iparm() array, and not by calling a different subroutine name. Similarly, Pardiso does different things based on control variables that you pass to it.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Mar 27, 2017 12:30 pm    Post subject: Reply with quote

Dan wrote:
John, I may play with checking of solution but what gives to the user this extra checking? Have you ever seen big problems here?

In my analyses I always do a calculation of the form Laipe$Max_Error_VAG_8.
Often [A].{x} can be written as SUM ( [Ai].{x} ) where [Ai] are much smaller matrices. This also confirms the assembly of [A].
The calculation of errors can help to confirm the correctness of the solution and also identifies possible round-off where the matrix/problem is poorly defined. I carry out this check wherever possible in FE analysis, especially in static analysis or eigenvector extraction.
Remember that pivoting has not been selected, which can be a problem for a skyline storage solution.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Tue Mar 28, 2017 1:58 am    Post subject: Reply with quote

John, please note that an unsymmetric sparse matrix need not satisfy the tests
Code:

       if ( Begin_i     > Ieq ) NoGood_o = NoGood_o + 1
       if ( Last_i(Ieq) < Ieq ) NoGood_o = NoGood_o + 1

that you have in your laipe$Check_Profile_VAG_x() subroutine. For example, the matrix IMPCOL_B at the Matrix Market has five columns that begin below the diagonal, yet there is no problem obtaining a solution to A.x = b with that matrix using the SuperLU solver (the Laipe solver fails to factorize the matrix, as least with my driver code for obtaining the solution using the VAG subroutines).
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue Mar 28, 2017 3:15 am    Post subject: Reply with quote

Mecej4,

I was trying to identify some warnings for a Profile_VAG approach.
If it fails those tests, then there is no use using a profile solver that doesn't use any pivoting. However there are many practical equation systems where pivoting is not required.
I find it difficult to understand how you could use partial pivoting with a skyline/column storage model.
How do MKL and Paradiso work with the sparse matrix definition and manage the storage required during reduction ? I would expect that the Profile_VAG approach requires minimal additional storage, (until pivoting is attempted).
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Tue Mar 28, 2017 3:33 am    Post subject: Reply with quote

Pardiso does not provide for any special treatment of skyline profile or even banded matrices, and I suspect that for skyline profile matrices there may be other packages that do better.

Only non-zero off-diagonal elements need to be supplied. In that aspect, I dislike the design decision taken by Laipe as to unsymmetric matrices -- for the matrix Sherman2 from the Matrix Market, N = 1080, nnz = 23,094 but, after filling in all the zeros that Laipe expects, the matrix has 401,023 elements, i.e, 18 times larger.

Pardiso is a modern routine, and allocates any storage that it needs without getting the user involved. A terminal call is needed after the solution has been obtained to tell Pardiso to release internally allocated storage.

Pivoting is handled internally, but you can choose between MMD and Metis reordering.


Last edited by mecej4 on Wed Mar 29, 2017 7:54 pm; edited 1 time in total
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Tue Mar 28, 2017 7:54 am    Post subject: Re: Reply with quote

mecej4 wrote:
If by "older LAIPE circa 1997 made by the Microsoft Fortran" you mean a library made for use with Fortran Powerstation (FPS) 4, to call routines in the library you need to use the STDCALL convention. In the caller, you may need to add declarations for this purpose, or use compiler options, depending on whether your code is compiled with IFort or FTN95.

Using the FPS-4 compiler, I ran the code that I posted in my post of March 22, 2017 after changing the subroutine names to LAIPE-1 names and providing interface blocks for the three LAIPE subroutines that are called. The run times were roughly the same as with Intel Fortran, because most of the run time is spent in the LAIPE-1 library.


There exist one more older LAIPE library besides LAIPE1 circa 2008, let's call it LAIPE0 or LAIPE1997, the LIB file of which is compatible with FTN95-32 because was done by MS Fortran. No need to add anything. But it does not work with this small test by some reason despite of working with my programs for 20 years
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Tue Mar 28, 2017 10:34 pm    Post subject: Reply with quote

"The road to hell is paved with good intentions".

I finally succeeded to reproduce the same test with the 1997 LAIPE or LAIPE0.

Well, that was classical devilry pawed with lures and good intensions.

We have
- LAIPE0 circa 1997 made with MS Fortran
- LAIPE1 circa 2008 made with Intel Fortan
- LAIPE2 of recent years compiled with the gFortran from which Mecej4 made DLL to be compiled with FTN95

Dense matrices we started with are used exclusively for their simplicity, they have zero practical applicability in my case.

Dense LAIPE1 showed factor of 1.2 speedup from Dense LAIPE0. And Dense LAIPE2 showed 2x speedup over Dense LAIPE1. Lure is clear: let's try LAIPE2 for Sparse and then go drink Champaign!

We tried Sparse LAIPE2 versus Sparse LAIPE0 and got ... only 20% speed increase versus 20 years older libraries. Here is LAIPE0...
Code:

 nEqu =    500      0.005 s
 nEqu =   1000      0.023 s
 nEqu =   1500      0.061 s
 nEqu =   2000      0.137 s
 nEqu =   2500      0.269 s
 nEqu =   3000      0.459 s
 nEqu =   3500      0.724 s
 nEqu =   4000      1.079 s
 nEqu =   4500      1.526 s
 nEqu =   5000      2.079 s
 nEqu =   5500      2.855 s
 nEqu =   6000      3.646 s
 nEqu =  10000      17.13 s
 nEqu =  20000      fail


That is for Real*4 data which go 2x faster then Real*8 with LAIPE why I mostly use it. I summarized most of test results in older post few days back.

Mecej4 tried dense MKL solver and it showed so great promis of more then an order of magnitude speedup over LAIPE that I thought sparse solver will also deliver such speedup. Well, sparse MKL-Pardiso only did up to 3x speedup and only for super-large matrices 20000 size which good but very rarely used. During the simulation matrix could be very small too and changing methods is possible but potentially troublesome. On typically smallish sizes <6000 it is not faster then LAIPE and on small ones ~1000 it's up to 3 times slower then LAIPE Real*4. Oh, yea, forgot to mention that, and at the end here is kick below the waist line: single precision MKL-Pardiso gets no visible speedup over real*8 MKL-Pardiso, see the columns called MKL_8 and MKL_4 in the post of few days back

Double devil... Which is how called my place in some foreign language where I moved recently by the way. Beautiful place of course like it should be Smile

ADDITION: After being disappointed with current sparse matrix testing results for some time I think there is one more chance: to ask author of LAIPLE to compile it with latest Intel Fortran using SSE and AWX.


Last edited by DanRRight on Thu Mar 30, 2017 1:23 am; edited 2 times in total
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Wed Mar 29, 2017 7:02 am    Post subject: Reply with quote

Addition 2
Wow! Finally! Look what I've got with real*8 block sparse matrix: WAY faster then the fastest before

Code:

 1000      0.007 s
 3000      0.115
 6000      0.742
10000      3.408
20000      27.21

And fasten your seat belts now. Same with Real*4
Code:

 1000      0.003 s
 2000      0.020
 3000      0.063
 4000      0.135
 5000      0.273
 6000      0.421
10000      1.743
20000      14.12


Killing speed!
To Intel Pardiso: "You are fired" (c)
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4
Page 4 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group