[petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver

Wed Oct 13 20:04:30 CDT 2021

Hi Mark,

Thank you for sharing this. I totally agree that factorization and 
triangular solve can be slow on GPUs. However, I also find that other 
operations such as matrix matrix multiplication can be very fast on GPU, 
so some iterative solver may perform well on GPUs, depending on the 
density and structure of matrix.

In my tests, I found that sometimes GPU can gives 2-3 speedup for GMRES.

Also, I think the SuperLU group has made significant progress on porting 
their code to GPU recently, and impressive speedup (not published yet).

Chang

On 10/13/21 8:29 PM, Mark Adams wrote:
> 
> 
> On Wed, Oct 13, 2021 at 1:53 PM Barry Smith <bsmith at petsc.dev 
> <mailto:bsmith at petsc.dev>> wrote:
> 
> 
>        Chang,
> 
>          You are correct there is no MPI + GPU direct solvers that
>     currently do the triangular solves with MPI + GPU parallelism that I
>     am aware of. 
> 
> 
> So SuperLU and  MUMPS do MPI solves on the CPU. That is reasonable. I 
> have not been able to get decent performance with GPU solves. Complex 
> code and low AI is not a good fit for GPUs. No work and all latency.
> 
> Chang, you would find that GPU solves suck and, anyway, machines these 
> days are configured with significant (high quality) CPU resources. I 
> think you would find that you can't get GPU solves to beat CPU solves, 
> except if you have enormous problems to solve, perhaps.
> 
>     You are limited that individual triangular solves be done on a
>     single GPU. I can only suggest making each subdomain as big as
>     possible to utilize each GPU as much as possible for the direct
>     triangular solves.
> 
>         Barry
> 
> 
>      > On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users
>     <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>      >
>      > Hi Mark,
>      >
>      > '-mat_type aijcusparse' works with mpiaijcusparse with other
>     solvers, but with -pc_factor_mat_solver_type cusparse, it will give
>     an error.
>      >
>      > Yes what I want is to have mumps or superlu to do the
>     factorization, and then do the rest, including GMRES solver, on gpu.
>     Is that possible?
>      >
>      > I have tried to use aijcusparse with superlu_dist, it runs but
>     the iterative solver is still running on CPUs. I have contacted the
>     superlu group and they confirmed that is the case right now. But if
>     I set -pc_factor_mat_solver_type cusparse, it seems that the
>     iterative solver is running on GPU.
>      >
>      > Chang
>      >
>      > On 10/13/21 12:03 PM, Mark Adams wrote:
>      >> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <cliu at pppl.gov
>     <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>>> wrote:
>      >>    Thank you Junchao for explaining this. I guess in my case the
>     code is
>      >>    just calling a seq solver like superlu to do factorization on
>     GPUs.
>      >>    My idea is that I want to have a traditional MPI code to
>     utilize GPUs
>      >>    with cusparse. Right now cusparse does not support mpiaij
>     matrix, Sure it does: '-mat_type aijcusparse' will give you an
>     mpiaijcusparse matrix with > 1 processes.
>      >> (-mat_type mpiaijcusparse might also work with >1 proc).
>      >> However, I see in grepping the repo that all the mumps and
>     superlu tests use aij or sell matrix type.
>      >> MUMPS and SuperLU provide their own solves, I assume .... but
>     you might want to do other matrix operations on the GPU. Is that the
>     issue?
>      >> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have
>     a problem? (no test with it so it probably does not work)
>      >> Thanks,
>      >> Mark
>      >>    so I
>      >>    want the code to have a mpiaij matrix when adding all the
>     matrix terms,
>      >>    and then transform the matrix to seqaij when doing the
>     factorization
>      >>    and
>      >>    solve. This involves sending the data to the master process,
>     and I
>      >>    think
>      >>    the petsc mumps solver have something similar already.
>      >>    Chang
>      >>    On 10/13/21 10:18 AM, Junchao Zhang wrote:
>      >>     >
>      >>     >
>      >>     >
>      >>     > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams
>     <mfadams at lbl.gov <mailto:mfadams at lbl.gov>
>      >>    <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>
>      >>     > <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>
>     <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>>> wrote:
>      >>     >
>      >>     >
>      >>     >
>      >>     >     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu
>     <cliu at pppl.gov <mailto:cliu at pppl.gov>
>      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>     >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>> wrote:
>      >>     >
>      >>     >         Hi Mark,
>      >>     >
>      >>     >         The option I use is like
>      >>     >
>      >>     >         -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type
>     fgmres
>      >>    -mat_type
>      >>     >         aijcusparse *-sub_pc_factor_mat_solver_type cusparse
>      >>    *-sub_ksp_type
>      >>     >         preonly *-sub_pc_type lu* -ksp_max_it 2000
>     -ksp_rtol 1.e-300
>      >>     >         -ksp_atol 1.e-300
>      >>     >
>      >>     >
>      >>     >     Note, If you use -log_view the last column (rows are the
>      >>    method like
>      >>     >     MatFactorNumeric) has the percent of work in the GPU.
>      >>     >
>      >>     >     Junchao: *This* implies that we have a cuSparse LU
>      >>    factorization. Is
>      >>     >     that correct? (I don't think we do)
>      >>     >
>      >>     > No, we don't have cuSparse LU factorization.  If you check
>      >>     > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls
>      >>     > MatLUFactorSymbolic_SeqAIJ() instead.
>      >>     > So I don't understand Chang's idea. Do you want to make bigger
>      >>    blocks?
>      >>     >
>      >>     >
>      >>     >         I think this one do both factorization and solve
>     on gpu.
>      >>     >
>      >>     >         You can check the runex72_aijcusparse.sh file in petsc
>      >>    install
>      >>     >         directory, and try it your self (this is only lu
>      >>    factorization
>      >>     >         without
>      >>     >         iterative solve).
>      >>     >
>      >>     >         Chang
>      >>     >
>      >>     >         On 10/12/21 1:17 PM, Mark Adams wrote:
>      >>     >          >
>      >>     >          >
>      >>     >          > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu
>      >>    <cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>>
>      >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>      >>     >          > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>> wrote:
>      >>     >          >
>      >>     >          >     Hi Junchao,
>      >>     >          >
>      >>     >          >     No I only needs it to be transferred within a
>      >>    node. I use
>      >>     >         block-Jacobi
>      >>     >          >     method and GMRES to solve the sparse
>     matrix, so each
>      >>     >         direct solver will
>      >>     >          >     take care of a sub-block of the whole
>     matrix. In this
>      >>     >         way, I can use
>      >>     >          >     one
>      >>     >          >     GPU to solve one sub-block, which is stored
>     within
>      >>    one node.
>      >>     >          >
>      >>     >          >     It was stated in the documentation that
>     cusparse
>      >>    solver
>      >>     >         is slow.
>      >>     >          >     However, in my test using ex72.c, the cusparse
>      >>    solver is
>      >>     >         faster than
>      >>     >          >     mumps or superlu_dist on CPUs.
>      >>     >          >
>      >>     >          >
>      >>     >          > Are we talking about the factorization, the
>     solve, or
>      >>    both?
>      >>     >          >
>      >>     >          > We do not have an interface to cuSparse's LU
>      >>    factorization (I
>      >>     >         just
>      >>     >          > learned that it exists a few weeks ago).
>      >>     >          > Perhaps your fast "cusparse solver" is '-pc_type lu
>      >>    -mat_type
>      >>     >          > aijcusparse' ? This would be the CPU factorization,
>      >>    which is the
>      >>     >          > dominant cost.
>      >>     >          >
>      >>     >          >
>      >>     >          >     Chang
>      >>     >          >
>      >>     >          >     On 10/12/21 10:24 AM, Junchao Zhang wrote:
>      >>     >          >      > Hi, Chang,
>      >>     >          >      >     For the mumps solver, we usually
>     transfers
>      >>    matrix
>      >>     >         and vector
>      >>     >          >     data
>      >>     >          >      > within a compute node.  For the idea you
>      >>    propose, it
>      >>     >         looks like
>      >>     >          >     we need
>      >>     >          >      > to gather data within MPI_COMM_WORLD, right?
>      >>     >          >      >
>      >>     >          >      >     Mark, I remember you said cusparse
>     solve is
>      >>    slow
>      >>     >         and you would
>      >>     >          >      > rather do it on CPU. Is it right?
>      >>     >          >      >
>      >>     >          >      > --Junchao Zhang
>      >>     >          >      >
>      >>     >          >      >
>      >>     >          >      > On Mon, Oct 11, 2021 at 10:25 PM Chang
>     Liu via
>      >>    petsc-users
>      >>     >          >      > <petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>      >>     >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>    <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>> <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>      >>     >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>    <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>>>
>      >>     >          >     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>      >>     >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>    <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>> <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>      >>     >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>    <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>>>>>
>      >>     >          >     wrote:
>      >>     >          >      >
>      >>     >          >      >     Hi,
>      >>     >          >      >
>      >>     >          >      >     Currently, it is possible to use mumps
>      >>    solver in
>      >>     >         PETSC with
>      >>     >          >      >     -mat_mumps_use_omp_threads option,
>     so that
>      >>     >         multiple MPI
>      >>     >          >     processes will
>      >>     >          >      >     transfer the matrix and rhs data to
>     the master
>      >>     >         rank, and then
>      >>     >          >     master
>      >>     >          >      >     rank will call mumps with OpenMP to
>     solve
>      >>    the matrix.
>      >>     >          >      >
>      >>     >          >      >     I wonder if someone can develop similar
>      >>    option for
>      >>     >         cusparse
>      >>     >          >     solver.
>      >>     >          >      >     Right now, this solver does not work
>     with
>      >>     >         mpiaijcusparse. I
>      >>     >          >     think a
>      >>     >          >      >     possible workaround is to transfer
>     all the
>      >>    matrix
>      >>     >         data to one MPI
>      >>     >          >      >     process, and then upload the data to
>     GPU to
>      >>    solve.
>      >>     >         In this
>      >>     >          >     way, one can
>      >>     >          >      >     use cusparse solver for a MPI program.
>      >>     >          >      >
>      >>     >          >      >     Chang
>      >>     >          >      >     --
>      >>     >          >      >     Chang Liu
>      >>     >          >      >     Staff Research Physicist
>      >>     >          >      >     +1 609 243 3438
>      >>     >          >      > cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>      >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>      >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>      >>     >          >     <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>>
>      >>     >          >      >     Princeton Plasma Physics Laboratory
>      >>     >          >      >     100 Stellarator Rd, Princeton NJ
>     08540, USA
>      >>     >          >      >
>      >>     >          >
>      >>     >          >     --
>      >>     >          >     Chang Liu
>      >>     >          >     Staff Research Physicist
>      >>     >          >     +1 609 243 3438
>      >>     >          > cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>
>      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>      >>     >          >     Princeton Plasma Physics Laboratory
>      >>     >          >     100 Stellarator Rd, Princeton NJ 08540, USA
>      >>     >          >
>      >>     >
>      >>     >         --
>      >>     >         Chang Liu
>      >>     >         Staff Research Physicist
>      >>     >         +1 609 243 3438
>      >>     > cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>      >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>      >>     >         Princeton Plasma Physics Laboratory
>      >>     >         100 Stellarator Rd, Princeton NJ 08540, USA
>      >>     >
>      >>    --     Chang Liu
>      >>    Staff Research Physicist
>      >>    +1 609 243 3438
>      >> cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>>
>      >>    Princeton Plasma Physics Laboratory
>      >>    100 Stellarator Rd, Princeton NJ 08540, USA
>      >
>      > --
>      > Chang Liu
>      > Staff Research Physicist
>      > +1 609 243 3438
>      > cliu at pppl.gov <mailto:cliu at pppl.gov>
>      > Princeton Plasma Physics Laboratory
>      > 100 Stellarator Rd, Princeton NJ 08540, USA
> 

-- 
Chang Liu
Staff Research Physicist
+1 609 243 3438
cliu at pppl.gov
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA