[petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver

Mark Adams mfadams at lbl.gov
Wed Oct 13 19:29:41 CDT 2021


On Wed, Oct 13, 2021 at 1:53 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>   Chang,
>
>     You are correct there is no MPI + GPU direct solvers that currently do
> the triangular solves with MPI + GPU parallelism that I am aware of.


So SuperLU and  MUMPS do MPI solves on the CPU. That is reasonable. I have
not been able to get decent performance with GPU solves. Complex code and
low AI is not a good fit for GPUs. No work and all latency.

Chang, you would find that GPU solves suck and, anyway, machines these days
are configured with significant (high quality) CPU resources. I think you
would find that you can't get GPU solves to beat CPU solves, except if you
have enormous problems to solve, perhaps.


> You are limited that individual triangular solves be done on a single GPU.
> I can only suggest making each subdomain as big as possible to utilize each
> GPU as much as possible for the direct triangular solves.
>
>    Barry
>
>
> > On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
> >
> > Hi Mark,
> >
> > '-mat_type aijcusparse' works with mpiaijcusparse with other solvers,
> but with -pc_factor_mat_solver_type cusparse, it will give an error.
> >
> > Yes what I want is to have mumps or superlu to do the factorization, and
> then do the rest, including GMRES solver, on gpu. Is that possible?
> >
> > I have tried to use aijcusparse with superlu_dist, it runs but the
> iterative solver is still running on CPUs. I have contacted the superlu
> group and they confirmed that is the case right now. But if I set
> -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is
> running on GPU.
> >
> > Chang
> >
> > On 10/13/21 12:03 PM, Mark Adams wrote:
> >> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <cliu at pppl.gov <mailto:
> cliu at pppl.gov>> wrote:
> >>    Thank you Junchao for explaining this. I guess in my case the code is
> >>    just calling a seq solver like superlu to do factorization on GPUs.
> >>    My idea is that I want to have a traditional MPI code to utilize GPUs
> >>    with cusparse. Right now cusparse does not support mpiaij matrix,
> Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse
> matrix with > 1 processes.
> >> (-mat_type mpiaijcusparse might also work with >1 proc).
> >> However, I see in grepping the repo that all the mumps and superlu
> tests use aij or sell matrix type.
> >> MUMPS and SuperLU provide their own solves, I assume .... but you might
> want to do other matrix operations on the GPU. Is that the issue?
> >> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a
> problem? (no test with it so it probably does not work)
> >> Thanks,
> >> Mark
> >>    so I
> >>    want the code to have a mpiaij matrix when adding all the matrix
> terms,
> >>    and then transform the matrix to seqaij when doing the factorization
> >>    and
> >>    solve. This involves sending the data to the master process, and I
> >>    think
> >>    the petsc mumps solver have something similar already.
> >>    Chang
> >>    On 10/13/21 10:18 AM, Junchao Zhang wrote:
> >>     >
> >>     >
> >>     >
> >>     > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams <mfadams at lbl.gov
> >>    <mailto:mfadams at lbl.gov>
> >>     > <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>> wrote:
> >>     >
> >>     >
> >>     >
> >>     >     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu <cliu at pppl.gov
> >>    <mailto:cliu at pppl.gov>
> >>     >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> wrote:
> >>     >
> >>     >         Hi Mark,
> >>     >
> >>     >         The option I use is like
> >>     >
> >>     >         -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres
> >>    -mat_type
> >>     >         aijcusparse *-sub_pc_factor_mat_solver_type cusparse
> >>    *-sub_ksp_type
> >>     >         preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol
> 1.e-300
> >>     >         -ksp_atol 1.e-300
> >>     >
> >>     >
> >>     >     Note, If you use -log_view the last column (rows are the
> >>    method like
> >>     >     MatFactorNumeric) has the percent of work in the GPU.
> >>     >
> >>     >     Junchao: *This* implies that we have a cuSparse LU
> >>    factorization. Is
> >>     >     that correct? (I don't think we do)
> >>     >
> >>     > No, we don't have cuSparse LU factorization.  If you check
> >>     > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls
> >>     > MatLUFactorSymbolic_SeqAIJ() instead.
> >>     > So I don't understand Chang's idea. Do you want to make bigger
> >>    blocks?
> >>     >
> >>     >
> >>     >         I think this one do both factorization and solve on gpu.
> >>     >
> >>     >         You can check the runex72_aijcusparse.sh file in petsc
> >>    install
> >>     >         directory, and try it your self (this is only lu
> >>    factorization
> >>     >         without
> >>     >         iterative solve).
> >>     >
> >>     >         Chang
> >>     >
> >>     >         On 10/12/21 1:17 PM, Mark Adams wrote:
> >>     >          >
> >>     >          >
> >>     >          > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu
> >>    <cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>     >          > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>> wrote:
> >>     >          >
> >>     >          >     Hi Junchao,
> >>     >          >
> >>     >          >     No I only needs it to be transferred within a
> >>    node. I use
> >>     >         block-Jacobi
> >>     >          >     method and GMRES to solve the sparse matrix, so
> each
> >>     >         direct solver will
> >>     >          >     take care of a sub-block of the whole matrix. In
> this
> >>     >         way, I can use
> >>     >          >     one
> >>     >          >     GPU to solve one sub-block, which is stored within
> >>    one node.
> >>     >          >
> >>     >          >     It was stated in the documentation that cusparse
> >>    solver
> >>     >         is slow.
> >>     >          >     However, in my test using ex72.c, the cusparse
> >>    solver is
> >>     >         faster than
> >>     >          >     mumps or superlu_dist on CPUs.
> >>     >          >
> >>     >          >
> >>     >          > Are we talking about the factorization, the solve, or
> >>    both?
> >>     >          >
> >>     >          > We do not have an interface to cuSparse's LU
> >>    factorization (I
> >>     >         just
> >>     >          > learned that it exists a few weeks ago).
> >>     >          > Perhaps your fast "cusparse solver" is '-pc_type lu
> >>    -mat_type
> >>     >          > aijcusparse' ? This would be the CPU factorization,
> >>    which is the
> >>     >          > dominant cost.
> >>     >          >
> >>     >          >
> >>     >          >     Chang
> >>     >          >
> >>     >          >     On 10/12/21 10:24 AM, Junchao Zhang wrote:
> >>     >          >      > Hi, Chang,
> >>     >          >      >     For the mumps solver, we usually transfers
> >>    matrix
> >>     >         and vector
> >>     >          >     data
> >>     >          >      > within a compute node.  For the idea you
> >>    propose, it
> >>     >         looks like
> >>     >          >     we need
> >>     >          >      > to gather data within MPI_COMM_WORLD, right?
> >>     >          >      >
> >>     >          >      >     Mark, I remember you said cusparse solve is
> >>    slow
> >>     >         and you would
> >>     >          >      > rather do it on CPU. Is it right?
> >>     >          >      >
> >>     >          >      > --Junchao Zhang
> >>     >          >      >
> >>     >          >      >
> >>     >          >      > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via
> >>    petsc-users
> >>     >          >      > <petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov>
> >>     >         <mailto:petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov>> <mailto:petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov>
> >>     >         <mailto:petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov>>>
> >>     >          >     <mailto:petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov>
> >>     >         <mailto:petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov>> <mailto:petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov>
> >>     >         <mailto:petsc-users at mcs.anl.gov
> >>    <mailto:petsc-users at mcs.anl.gov>>>>>
> >>     >          >     wrote:
> >>     >          >      >
> >>     >          >      >     Hi,
> >>     >          >      >
> >>     >          >      >     Currently, it is possible to use mumps
> >>    solver in
> >>     >         PETSC with
> >>     >          >      >     -mat_mumps_use_omp_threads option, so that
> >>     >         multiple MPI
> >>     >          >     processes will
> >>     >          >      >     transfer the matrix and rhs data to the
> master
> >>     >         rank, and then
> >>     >          >     master
> >>     >          >      >     rank will call mumps with OpenMP to solve
> >>    the matrix.
> >>     >          >      >
> >>     >          >      >     I wonder if someone can develop similar
> >>    option for
> >>     >         cusparse
> >>     >          >     solver.
> >>     >          >      >     Right now, this solver does not work with
> >>     >         mpiaijcusparse. I
> >>     >          >     think a
> >>     >          >      >     possible workaround is to transfer all the
> >>    matrix
> >>     >         data to one MPI
> >>     >          >      >     process, and then upload the data to GPU to
> >>    solve.
> >>     >         In this
> >>     >          >     way, one can
> >>     >          >      >     use cusparse solver for a MPI program.
> >>     >          >      >
> >>     >          >      >     Chang
> >>     >          >      >     --
> >>     >          >      >     Chang Liu
> >>     >          >      >     Staff Research Physicist
> >>     >          >      >     +1 609 243 3438
> >>     >          >      > cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>     >          >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
> >>     >          >      >     Princeton Plasma Physics Laboratory
> >>     >          >      >     100 Stellarator Rd, Princeton NJ 08540, USA
> >>     >          >      >
> >>     >          >
> >>     >          >     --
> >>     >          >     Chang Liu
> >>     >          >     Staff Research Physicist
> >>     >          >     +1 609 243 3438
> >>     >          > cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>> <mailto:cliu at pppl.gov
> >>    <mailto:cliu at pppl.gov>
> >>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >>     >          >     Princeton Plasma Physics Laboratory
> >>     >          >     100 Stellarator Rd, Princeton NJ 08540, USA
> >>     >          >
> >>     >
> >>     >         --
> >>     >         Chang Liu
> >>     >         Staff Research Physicist
> >>     >         +1 609 243 3438
> >>     > cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
> >>    <mailto:cliu at pppl.gov>>
> >>     >         Princeton Plasma Physics Laboratory
> >>     >         100 Stellarator Rd, Princeton NJ 08540, USA
> >>     >
> >>    --     Chang Liu
> >>    Staff Research Physicist
> >>    +1 609 243 3438
> >>    cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>    Princeton Plasma Physics Laboratory
> >>    100 Stellarator Rd, Princeton NJ 08540, USA
> >
> > --
> > Chang Liu
> > Staff Research Physicist
> > +1 609 243 3438
> > cliu at pppl.gov
> > Princeton Plasma Physics Laboratory
> > 100 Stellarator Rd, Princeton NJ 08540, USA
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20211013/315d62cd/attachment-0001.html>


More information about the petsc-users mailing list