[petsc-users] What is the right way to implement a (block) Diagonal ILU as PC?

hong at aspiritech.org hong at aspiritech.org
Wed Feb 5 15:26:10 CST 2020


Hao:
Try '-pc_sub_type lu -ksp_type gmres -ksp_monitor'
Hong

Thanks a lot for your suggestions, Hong and Barry -
>
> As you suggested, I first tried the LU direct solvers (built-in and MUMPs)
> out this morning, which work perfectly, albeit slow. As my problem itself
> is a part of a PDE based optimization, the exact solution in the
> searching procedure is not necessary (I often set a relative tolerance of
> 1E-7/1E-8, etc. for Krylov subspace methods). Also tried BJACOBI with exact
> LU, the KSP just converges in one or two iterations, as expected.
>
> I added -kspview option for the D-ILU test (still with Block Jacobi as
> preconditioner and bcgs as KSP solver). The KSPview output from one of the
> examples in a toy model looks like:
>
> KSP Object: 1 MPI processes
>    type: bcgs
>    maximum iterations=120, nonzero initial guess
>    tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
>    left preconditioning
>    using PRECONDITIONED norm type for convergence test
>  PC Object: 1 MPI processes
>    type: bjacobi
>      number of blocks = 3
>      Local solve is same for all blocks, in the following KSP and PC
> objects:
>      KSP Object: (sub_) 1 MPI processes
>        type: preonly
>        maximum iterations=10000, initial guess is zero
>        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>        left preconditioning
>        using NONE norm type for convergence test
>      PC Object: (sub_) 1 MPI processes
>        type: ilu
>          out-of-place factorization
>          0 levels of fill
>          tolerance for zero pivot 2.22045e-14
>          matrix ordering: natural
>          factor fill ratio given 1., needed 1.
>            Factored matrix follows:
>              Mat Object: 1 MPI processes
>                type: seqaij
>                rows=11294, cols=11294
>                package used to perform factorization: petsc
>                total: nonzeros=76008, allocated nonzeros=76008
>                total number of mallocs used during MatSetValues calls=0
>                  not using I-node routines
>        linear system matrix = precond matrix:
>        Mat Object: 1 MPI processes
>          type: seqaij
>          rows=11294, cols=11294
>          total: nonzeros=76008, allocated nonzeros=76008
>          total number of mallocs used during MatSetValues calls=0
>            not using I-node routines
>    linear system matrix = precond matrix:
>    Mat Object: 1 MPI processes
>      type: mpiaij
>      rows=33880, cols=33880
>      total: nonzeros=436968, allocated nonzeros=436968
>      total number of mallocs used during MatSetValues calls=0
>        not using I-node (on process 0) routines
>
> do you see something wrong with my setup?
>
> I also tried a quick performance test with a small 278906 by 278906
> matrix (3850990 nnz) with the following parameters:
>
> -ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type
> ilu -ksp_view
>
> Reducing the relative residual to 1E-7
>
> Took 4.08s with 41 bcgs iterations.
>
> Merely changing the -pc_bjacobi_local_blocks to 6
>
> Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s with
> 101 bcgs iterations.
>
> As a reference, my home-brew Fortran code solves the same problem (3-block
> D-ILU0) in
>
> 1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)…
>
>
>
> Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if a
> user is allowed to provide his own (separate) L and U Mat for
> preconditioning (like how it works in Matlab solvers), e.g.
>
> x = qmr(A,b,Tol,MaxIter,L,U,x)
>
> As I already explicitly constructed the L and U matrices in Fortran, it is
> not hard to convert them to Mat in Petsc to test Petsc and my Fortran code
> head-to-head. In that case, the A, b, x, and L/U are all identical, it
> would be easier to see where the problem came from.
>
>
>
> BTW, there is another thing I wondered - is there a way to output residual
> in unpreconditioned norm? I tried to
>
> *call* KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr)
>
> But always get an error that current ksp does not support unpreconditioned
> in LEFT/RIGHT (either way). Is it possible to do that (output
> unpreconditioned residual) in PETSc at all?
>
> Cheers,
> Hao
>
>
> ------------------------------
> *From:* Smith, Barry F. <bsmith at mcs.anl.gov>
> *Sent:* Tuesday, February 4, 2020 8:27 PM
> *To:* Hao DONG <dong-hao at outlook.com>
> *Cc:* petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Subject:* Re: [petsc-users] What is the right way to implement a (block)
> Diagonal ILU as PC?
>
>
>
> > On Feb 4, 2020, at 12:41 PM, Hao DONG <dong-hao at outlook.com> wrote:
> >
> > Dear all,
> >
> >
> > I have a few questions about the implementation of diagonal ILU PC in
> PETSc. I want to solve a very simple system with KSP (in parallel), the
> nature of the system (finite difference time-harmonic Maxwell) is probably
> not important to the question itself. Long story short, I just need to
> solve a set of equations of Ax = b with a block diagonal system matrix,
> like (not sure if the mono font works):
> >
> >    |X    |
> > A =|  Y  |
> >    |    Z|
> >
> > Note that A is not really block-diagonal, it’s just a multi-diagonal
> matrix determined by the FD mesh, where most elements are close to
> diagonal. So instead of a full ILU decomposition, a D-ILU is a good
> approximation as a preconditioner, and the number of blocks should not
> matter too much:
> >
> >     |Lx      |         |Ux      |
> > L = |   Ly   | and U = |   Uy   |
> >     |      Lz|         |      Uz|
> >
> > Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N
> blocks) is quite efficient with Krylov subspace methods like BiCGstab or
> QMR in my sequential Fortran/Matlab code.
> >
> > So like most users, I am looking for a parallel implement with this
> problem in PETSc. After looking through the manual, it seems that the most
> straightforward way to do it is through PCBJACOBI. Not sure I understand it
> right, I just setup a 3-block PCJACOBI and give each of the block a KSP
> with PCILU. Is this supposed to be equivalent to my D-ILU preconditioner?
> My little block of fortran code would look like:
> > ...
> >       call PCBJacobiSetTotalBlocks(pc_local,Ntotal,                   &
> >      &     isubs,ierr)
> >       call PCBJacobiSetLocalBlocks(pc_local, Nsub,                    &
> >      &    isubs(istart:iend),ierr)
> >       ! set up the block jacobi structure
> >       call KSPSetup(ksp_local,ierr)
> >       ! allocate sub ksps
> >       allocate(ksp_sub(Nsub))
> >       call PCBJacobiGetSubKSP(pc_local,Nsub,istart,                   &
> >      &     ksp_sub,ierr)
> >       do i=1,Nsub
> >           call KSPGetPC(ksp_sub(i),pc_sub,ierr)
> >           !ILU preconditioner
> >           call PCSetType(pc_sub,ptype,ierr)
> >           call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here
> >           call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)]
> >       end do
> >       call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, &
> >      &     PETSC_DEFAULT_REAL,KSSiter%maxit,ierr)
> > …
>
>      This code looks essentially right. You should call with -ksp_view to
> see exactly what PETSc is using for a solver.
>
> >
> > I understand that the parallel performance may not be comparable, so I
> first set up a one-process test (with MPIAij, but all the rows are local
> since there is only one process). The system is solved without any problem
> (identical results within error). But the performance is actually a lot
> worse (code built without debugging flags in performance tests) than my own
> home-brew implementation in Fortran (I wrote my own ILU0 in CSR sparse
> matrix format), which is hard to believe. I suspect the difference is from
> the PC as the PETSc version took much more BiCGstab iterations (60-ish vs
> 100-ish) to converge to the same relative tol.
>
>    PETSc uses GMRES by default with a restart of 30 and left
> preconditioning. It could be different exact choices in the solver (which
> is why -ksp_view is so useful) can explain the differences in the runs
> between your code and PETSc's
> >
> > This is further confirmed when I change the setup of D-ILU (using 6 or 9
> blocks instead of 3). While my Fortran/Matlab codes see minimal performance
> difference (<5%) when I play with the D-ILU setup, increasing the number of
> D-ILU blocks from 3 to 9 caused the ksp setup with PCBJACOBI to suffer a
> performance decrease of more than 50% in sequential test.
>
>    This is odd, the more blocks the smaller each block so the quicker the
> ILU set up should be. You can run various cases with -log_view and send us
> the output to see what is happening at each part of the computation in time.
>
> > So my implementation IS somewhat different in PETSc. Do I miss something
> in the PCBJACOBI setup? Or do I have some fundamental misunderstanding of
> how PCBJACOBI works in PETSc?
>
>    Probably not.
> >
> > If this is not the right way to implement a block diagonal ILU as
> (parallel) PC, please kindly point me to the right direction. I searched
> through the mail list to find some answers, only to find a couple of
> similar questions... An example would be nice.
>
>    You approach seems fundamentally right but I cannot be sure of possible
> bugs.
> >
> > On the other hand, does PETSc support a simple way to use explicit L/U
> matrix as a preconditioner? I can import the  D-ILU matrix (I already
> converted my A matrix into Mat) constructed in my Fortran code to make a
> better comparison. Or do I have to construct my own PC using PCshell? If
> so, is there a good tutorial/example to learn about how to use PCSHELL (in
> a more advanced sense, like how to setup pc side and transpose)?
>
>    Not sure what you mean by explicit L/U matrix as a preconditioner. As
> Hong said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or
> Pastix as the solver. You do not need any shell code. You simply need to
> set the PCType to lu
>
>    You can also set all this options from the command line and don't need
> to write the code specifically. So call KSPSetFromOptions() and then for
> example
>
>     -pc_type bjacobi  -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this
> last one is applied to each block so you could use -pc_type lu and it would
> use lu on each block.)
>
>    -ksp_type_none  -pc_type lu -pc_factor_mat_solver_type mumps  (do
> parallel LU with mumps)
>
> By not hardwiring in the code and just using options you can test out
> different cases much quicker
>
> Use -ksp_view to make sure that is using the solver the way you expect.
>
> Barry
>
>
>
>    Barry
>
> >
> > Thanks in advance,
> >
> > Hao
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200205/b714b2cb/attachment-0001.html>


More information about the petsc-users mailing list