[petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver

Wed Oct 13 20:25:17 CDT 2021

On Wed, Oct 13, 2021 at 9:04 PM Chang Liu <cliu at pppl.gov> wrote:

> Hi Mark,
>
> Thank you for sharing this. I totally agree that factorization and
> triangular solve can be slow on GPUs.

Note, factorizations have much more potential on a GPU because there is
much more work and arithmetic intensity (BLAS3 vs BLAS2 (or 1)) than the
forward and backward solve (the solve) phases.
The work complexity of a PDE sparse factorization is about O(N^2) and
O(N^3/2) for the solve. That is a big difference.

> However, I also find that other
> operations such as matrix matrix multiplication can be very fast on GPU,
> so some iterative solver may perform well on GPUs, depending on the
> density and structure of matrix.
>
> In my tests, I found that sometimes GPU can gives 2-3 speedup for GMRES.
>
> Also, I think the SuperLU group has made significant progress on porting
> their code to GPU recently, and impressive speedup (not published yet).
>
> Chang
>
> On 10/13/21 8:29 PM, Mark Adams wrote:
> >
> >
> > On Wed, Oct 13, 2021 at 1:53 PM Barry Smith <bsmith at petsc.dev
> > <mailto:bsmith at petsc.dev>> wrote:
> >
> >
> >        Chang,
> >
> >          You are correct there is no MPI + GPU direct solvers that
> >     currently do the triangular solves with MPI + GPU parallelism that I
> >     am aware of.
> >
> >
> > So SuperLU and  MUMPS do MPI solves on the CPU. That is reasonable. I
> > have not been able to get decent performance with GPU solves. Complex
> > code and low AI is not a good fit for GPUs. No work and all latency.
> >
> > Chang, you would find that GPU solves suck and, anyway, machines these
> > days are configured with significant (high quality) CPU resources. I
> > think you would find that you can't get GPU solves to beat CPU solves,
> > except if you have enormous problems to solve, perhaps.
> >
> >     You are limited that individual triangular solves be done on a
> >     single GPU. I can only suggest making each subdomain as big as
> >     possible to utilize each GPU as much as possible for the direct
> >     triangular solves.
> >
> >         Barry
> >
> >
> >      > On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users
> >     <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
> >      >
> >      > Hi Mark,
> >      >
> >      > '-mat_type aijcusparse' works with mpiaijcusparse with other
> >     solvers, but with -pc_factor_mat_solver_type cusparse, it will give
> >     an error.
> >      >
> >      > Yes what I want is to have mumps or superlu to do the
> >     factorization, and then do the rest, including GMRES solver, on gpu.
> >     Is that possible?
> >      >
> >      > I have tried to use aijcusparse with superlu_dist, it runs but
> >     the iterative solver is still running on CPUs. I have contacted the
> >     superlu group and they confirmed that is the case right now. But if
> >     I set -pc_factor_mat_solver_type cusparse, it seems that the
> >     iterative solver is running on GPU.
> >      >
> >      > Chang
> >      >
> >      > On 10/13/21 12:03 PM, Mark Adams wrote:
> >      >> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <cliu at pppl.gov
> >     <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
> >     <mailto:cliu at pppl.gov>>> wrote:
> >      >>    Thank you Junchao for explaining this. I guess in my case the
> >     code is
> >      >>    just calling a seq solver like superlu to do factorization on
> >     GPUs.
> >      >>    My idea is that I want to have a traditional MPI code to
> >     utilize GPUs
> >      >>    with cusparse. Right now cusparse does not support mpiaij
> >     matrix, Sure it does: '-mat_type aijcusparse' will give you an
> >     mpiaijcusparse matrix with > 1 processes.
> >      >> (-mat_type mpiaijcusparse might also work with >1 proc).
> >      >> However, I see in grepping the repo that all the mumps and
> >     superlu tests use aij or sell matrix type.
> >      >> MUMPS and SuperLU provide their own solves, I assume .... but
> >     you might want to do other matrix operations on the GPU. Is that the
> >     issue?
> >      >> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have
> >     a problem? (no test with it so it probably does not work)
> >      >> Thanks,
> >      >> Mark
> >      >>    so I
> >      >>    want the code to have a mpiaij matrix when adding all the
> >     matrix terms,
> >      >>    and then transform the matrix to seqaij when doing the
> >     factorization
> >      >>    and
> >      >>    solve. This involves sending the data to the master process,
> >     and I
> >      >>    think
> >      >>    the petsc mumps solver have something similar already.
> >      >>    Chang
> >      >>    On 10/13/21 10:18 AM, Junchao Zhang wrote:
> >      >>     >
> >      >>     >
> >      >>     >
> >      >>     > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams
> >     <mfadams at lbl.gov <mailto:mfadams at lbl.gov>
> >      >>    <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>
> >      >>     > <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>
> >     <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>>> wrote:
> >      >>     >
> >      >>     >
> >      >>     >
> >      >>     >     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu
> >     <cliu at pppl.gov <mailto:cliu at pppl.gov>
> >      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >      >>     >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>> wrote:
> >      >>     >
> >      >>     >         Hi Mark,
> >      >>     >
> >      >>     >         The option I use is like
> >      >>     >
> >      >>     >         -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type
> >     fgmres
> >      >>    -mat_type
> >      >>     >         aijcusparse *-sub_pc_factor_mat_solver_type
> cusparse
> >      >>    *-sub_ksp_type
> >      >>     >         preonly *-sub_pc_type lu* -ksp_max_it 2000
> >     -ksp_rtol 1.e-300
> >      >>     >         -ksp_atol 1.e-300
> >      >>     >
> >      >>     >
> >      >>     >     Note, If you use -log_view the last column (rows are
> the
> >      >>    method like
> >      >>     >     MatFactorNumeric) has the percent of work in the GPU.
> >      >>     >
> >      >>     >     Junchao: *This* implies that we have a cuSparse LU
> >      >>    factorization. Is
> >      >>     >     that correct? (I don't think we do)
> >      >>     >
> >      >>     > No, we don't have cuSparse LU factorization.  If you check
> >      >>     > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls
> >      >>     > MatLUFactorSymbolic_SeqAIJ() instead.
> >      >>     > So I don't understand Chang's idea. Do you want to make
> bigger
> >      >>    blocks?
> >      >>     >
> >      >>     >
> >      >>     >         I think this one do both factorization and solve
> >     on gpu.
> >      >>     >
> >      >>     >         You can check the runex72_aijcusparse.sh file in
> petsc
> >      >>    install
> >      >>     >         directory, and try it your self (this is only lu
> >      >>    factorization
> >      >>     >         without
> >      >>     >         iterative solve).
> >      >>     >
> >      >>     >         Chang
> >      >>     >
> >      >>     >         On 10/12/21 1:17 PM, Mark Adams wrote:
> >      >>     >          >
> >      >>     >          >
> >      >>     >          > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu
> >      >>    <cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
> >     <mailto:cliu at pppl.gov>>
> >      >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >      >>     >          > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>> wrote:
> >      >>     >          >
> >      >>     >          >     Hi Junchao,
> >      >>     >          >
> >      >>     >          >     No I only needs it to be transferred within
> a
> >      >>    node. I use
> >      >>     >         block-Jacobi
> >      >>     >          >     method and GMRES to solve the sparse
> >     matrix, so each
> >      >>     >         direct solver will
> >      >>     >          >     take care of a sub-block of the whole
> >     matrix. In this
> >      >>     >         way, I can use
> >      >>     >          >     one
> >      >>     >          >     GPU to solve one sub-block, which is stored
> >     within
> >      >>    one node.
> >      >>     >          >
> >      >>     >          >     It was stated in the documentation that
> >     cusparse
> >      >>    solver
> >      >>     >         is slow.
> >      >>     >          >     However, in my test using ex72.c, the
> cusparse
> >      >>    solver is
> >      >>     >         faster than
> >      >>     >          >     mumps or superlu_dist on CPUs.
> >      >>     >          >
> >      >>     >          >
> >      >>     >          > Are we talking about the factorization, the
> >     solve, or
> >      >>    both?
> >      >>     >          >
> >      >>     >          > We do not have an interface to cuSparse's LU
> >      >>    factorization (I
> >      >>     >         just
> >      >>     >          > learned that it exists a few weeks ago).
> >      >>     >          > Perhaps your fast "cusparse solver" is
> '-pc_type lu
> >      >>    -mat_type
> >      >>     >          > aijcusparse' ? This would be the CPU
> factorization,
> >      >>    which is the
> >      >>     >          > dominant cost.
> >      >>     >          >
> >      >>     >          >
> >      >>     >          >     Chang
> >      >>     >          >
> >      >>     >          >     On 10/12/21 10:24 AM, Junchao Zhang wrote:
> >      >>     >          >      > Hi, Chang,
> >      >>     >          >      >     For the mumps solver, we usually
> >     transfers
> >      >>    matrix
> >      >>     >         and vector
> >      >>     >          >     data
> >      >>     >          >      > within a compute node.  For the idea you
> >      >>    propose, it
> >      >>     >         looks like
> >      >>     >          >     we need
> >      >>     >          >      > to gather data within MPI_COMM_WORLD,
> right?
> >      >>     >          >      >
> >      >>     >          >      >     Mark, I remember you said cusparse
> >     solve is
> >      >>    slow
> >      >>     >         and you would
> >      >>     >          >      > rather do it on CPU. Is it right?
> >      >>     >          >      >
> >      >>     >          >      > --Junchao Zhang
> >      >>     >          >      >
> >      >>     >          >      >
> >      >>     >          >      > On Mon, Oct 11, 2021 at 10:25 PM Chang
> >     Liu via
> >      >>    petsc-users
> >      >>     >          >      > <petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>
> >      >>    <mailto:petsc-users at mcs.anl.gov <mailto:
> petsc-users at mcs.anl.gov>>
> >      >>     >         <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>
> >      >>    <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>>> <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>
> >      >>    <mailto:petsc-users at mcs.anl.gov <mailto:
> petsc-users at mcs.anl.gov>>
> >      >>     >         <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>
> >      >>    <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>>>>
> >      >>     >          >     <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>
> >      >>    <mailto:petsc-users at mcs.anl.gov <mailto:
> petsc-users at mcs.anl.gov>>
> >      >>     >         <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>
> >      >>    <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>>> <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>
> >      >>    <mailto:petsc-users at mcs.anl.gov <mailto:
> petsc-users at mcs.anl.gov>>
> >      >>     >         <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>
> >      >>    <mailto:petsc-users at mcs.anl.gov
> >     <mailto:petsc-users at mcs.anl.gov>>>>>>
> >      >>     >          >     wrote:
> >      >>     >          >      >
> >      >>     >          >      >     Hi,
> >      >>     >          >      >
> >      >>     >          >      >     Currently, it is possible to use
> mumps
> >      >>    solver in
> >      >>     >         PETSC with
> >      >>     >          >      >     -mat_mumps_use_omp_threads option,
> >     so that
> >      >>     >         multiple MPI
> >      >>     >          >     processes will
> >      >>     >          >      >     transfer the matrix and rhs data to
> >     the master
> >      >>     >         rank, and then
> >      >>     >          >     master
> >      >>     >          >      >     rank will call mumps with OpenMP to
> >     solve
> >      >>    the matrix.
> >      >>     >          >      >
> >      >>     >          >      >     I wonder if someone can develop
> similar
> >      >>    option for
> >      >>     >         cusparse
> >      >>     >          >     solver.
> >      >>     >          >      >     Right now, this solver does not work
> >     with
> >      >>     >         mpiaijcusparse. I
> >      >>     >          >     think a
> >      >>     >          >      >     possible workaround is to transfer
> >     all the
> >      >>    matrix
> >      >>     >         data to one MPI
> >      >>     >          >      >     process, and then upload the data to
> >     GPU to
> >      >>    solve.
> >      >>     >         In this
> >      >>     >          >     way, one can
> >      >>     >          >      >     use cusparse solver for a MPI
> program.
> >      >>     >          >      >
> >      >>     >          >      >     Chang
> >      >>     >          >      >     --
> >      >>     >          >      >     Chang Liu
> >      >>     >          >      >     Staff Research Physicist
> >      >>     >          >      >     +1 609 243 3438
> >      >>     >          >      > cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >      >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
> >      >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >      >>     >          >     <mailto:cliu at pppl.gov
> >     <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>>
> >      >>     >          >      >     Princeton Plasma Physics Laboratory
> >      >>     >          >      >     100 Stellarator Rd, Princeton NJ
> >     08540, USA
> >      >>     >          >      >
> >      >>     >          >
> >      >>     >          >     --
> >      >>     >          >     Chang Liu
> >      >>     >          >     Staff Research Physicist
> >      >>     >          >     +1 609 243 3438
> >      >>     >          > cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> <mailto:cliu at pppl.gov
> >     <mailto:cliu at pppl.gov>
> >      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >      >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
> >      >>     >          >     Princeton Plasma Physics Laboratory
> >      >>     >          >     100 Stellarator Rd, Princeton NJ 08540, USA
> >      >>     >          >
> >      >>     >
> >      >>     >         --
> >      >>     >         Chang Liu
> >      >>     >         Staff Research Physicist
> >      >>     >         +1 609 243 3438
> >      >>     > cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
> >     <mailto:cliu at pppl.gov>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >      >>     >         Princeton Plasma Physics Laboratory
> >      >>     >         100 Stellarator Rd, Princeton NJ 08540, USA
> >      >>     >
> >      >>    --     Chang Liu
> >      >>    Staff Research Physicist
> >      >>    +1 609 243 3438
> >      >> cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
> >     <mailto:cliu at pppl.gov>>
> >      >>    Princeton Plasma Physics Laboratory
> >      >>    100 Stellarator Rd, Princeton NJ 08540, USA
> >      >
> >      > --
> >      > Chang Liu
> >      > Staff Research Physicist
> >      > +1 609 243 3438
> >      > cliu at pppl.gov <mailto:cliu at pppl.gov>
> >      > Princeton Plasma Physics Laboratory
> >      > 100 Stellarator Rd, Princeton NJ 08540, USA
> >
>
> --
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> cliu at pppl.gov
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20211013/40a3cbd3/attachment-0001.html>