[petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver

Wed Oct 13 12:53:27 CDT 2021

  Chang,

    You are correct there is no MPI + GPU direct solvers that currently do the triangular solves with MPI + GPU parallelism that I am aware of. You are limited that individual triangular solves be done on a single GPU. I can only suggest making each subdomain as big as possible to utilize each GPU as much as possible for the direct triangular solves.

   Barry


> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users <petsc-users at mcs.anl.gov> wrote:
> 
> Hi Mark,
> 
> '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, but with -pc_factor_mat_solver_type cusparse, it will give an error.
> 
> Yes what I want is to have mumps or superlu to do the factorization, and then do the rest, including GMRES solver, on gpu. Is that possible?
> 
> I have tried to use aijcusparse with superlu_dist, it runs but the iterative solver is still running on CPUs. I have contacted the superlu group and they confirmed that is the case right now. But if I set -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is running on GPU.
> 
> Chang
> 
> On 10/13/21 12:03 PM, Mark Adams wrote:
>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <cliu at pppl.gov <mailto:cliu at pppl.gov>> wrote:
>>    Thank you Junchao for explaining this. I guess in my case the code is
>>    just calling a seq solver like superlu to do factorization on GPUs.
>>    My idea is that I want to have a traditional MPI code to utilize GPUs
>>    with cusparse. Right now cusparse does not support mpiaij matrix, Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse matrix with > 1 processes.
>> (-mat_type mpiaijcusparse might also work with >1 proc).
>> However, I see in grepping the repo that all the mumps and superlu tests use aij or sell matrix type.
>> MUMPS and SuperLU provide their own solves, I assume .... but you might want to do other matrix operations on the GPU. Is that the issue?
>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a problem? (no test with it so it probably does not work)
>> Thanks,
>> Mark
>>    so I
>>    want the code to have a mpiaij matrix when adding all the matrix terms,
>>    and then transform the matrix to seqaij when doing the factorization
>>    and
>>    solve. This involves sending the data to the master process, and I
>>    think
>>    the petsc mumps solver have something similar already.
>>    Chang
>>    On 10/13/21 10:18 AM, Junchao Zhang wrote:
>>     >
>>     >
>>     >
>>     > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams <mfadams at lbl.gov
>>    <mailto:mfadams at lbl.gov>
>>     > <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>> wrote:
>>     >
>>     >
>>     >
>>     >     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu <cliu at pppl.gov
>>    <mailto:cliu at pppl.gov>
>>     >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> wrote:
>>     >
>>     >         Hi Mark,
>>     >
>>     >         The option I use is like
>>     >
>>     >         -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres
>>    -mat_type
>>     >         aijcusparse *-sub_pc_factor_mat_solver_type cusparse
>>    *-sub_ksp_type
>>     >         preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300
>>     >         -ksp_atol 1.e-300
>>     >
>>     >
>>     >     Note, If you use -log_view the last column (rows are the
>>    method like
>>     >     MatFactorNumeric) has the percent of work in the GPU.
>>     >
>>     >     Junchao: *This* implies that we have a cuSparse LU
>>    factorization. Is
>>     >     that correct? (I don't think we do)
>>     >
>>     > No, we don't have cuSparse LU factorization.  If you check
>>     > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls
>>     > MatLUFactorSymbolic_SeqAIJ() instead.
>>     > So I don't understand Chang's idea. Do you want to make bigger
>>    blocks?
>>     >
>>     >
>>     >         I think this one do both factorization and solve on gpu.
>>     >
>>     >         You can check the runex72_aijcusparse.sh file in petsc
>>    install
>>     >         directory, and try it your self (this is only lu
>>    factorization
>>     >         without
>>     >         iterative solve).
>>     >
>>     >         Chang
>>     >
>>     >         On 10/12/21 1:17 PM, Mark Adams wrote:
>>     >          >
>>     >          >
>>     >          > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu
>>    <cliu at pppl.gov <mailto:cliu at pppl.gov>
>>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>     >          > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>> wrote:
>>     >          >
>>     >          >     Hi Junchao,
>>     >          >
>>     >          >     No I only needs it to be transferred within a
>>    node. I use
>>     >         block-Jacobi
>>     >          >     method and GMRES to solve the sparse matrix, so each
>>     >         direct solver will
>>     >          >     take care of a sub-block of the whole matrix. In this
>>     >         way, I can use
>>     >          >     one
>>     >          >     GPU to solve one sub-block, which is stored within
>>    one node.
>>     >          >
>>     >          >     It was stated in the documentation that cusparse
>>    solver
>>     >         is slow.
>>     >          >     However, in my test using ex72.c, the cusparse
>>    solver is
>>     >         faster than
>>     >          >     mumps or superlu_dist on CPUs.
>>     >          >
>>     >          >
>>     >          > Are we talking about the factorization, the solve, or
>>    both?
>>     >          >
>>     >          > We do not have an interface to cuSparse's LU
>>    factorization (I
>>     >         just
>>     >          > learned that it exists a few weeks ago).
>>     >          > Perhaps your fast "cusparse solver" is '-pc_type lu
>>    -mat_type
>>     >          > aijcusparse' ? This would be the CPU factorization,
>>    which is the
>>     >          > dominant cost.
>>     >          >
>>     >          >
>>     >          >     Chang
>>     >          >
>>     >          >     On 10/12/21 10:24 AM, Junchao Zhang wrote:
>>     >          >      > Hi, Chang,
>>     >          >      >     For the mumps solver, we usually transfers
>>    matrix
>>     >         and vector
>>     >          >     data
>>     >          >      > within a compute node.  For the idea you
>>    propose, it
>>     >         looks like
>>     >          >     we need
>>     >          >      > to gather data within MPI_COMM_WORLD, right?
>>     >          >      >
>>     >          >      >     Mark, I remember you said cusparse solve is
>>    slow
>>     >         and you would
>>     >          >      > rather do it on CPU. Is it right?
>>     >          >      >
>>     >          >      > --Junchao Zhang
>>     >          >      >
>>     >          >      >
>>     >          >      > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via
>>    petsc-users
>>     >          >      > <petsc-users at mcs.anl.gov
>>    <mailto:petsc-users at mcs.anl.gov>
>>     >         <mailto:petsc-users at mcs.anl.gov
>>    <mailto:petsc-users at mcs.anl.gov>> <mailto:petsc-users at mcs.anl.gov
>>    <mailto:petsc-users at mcs.anl.gov>
>>     >         <mailto:petsc-users at mcs.anl.gov
>>    <mailto:petsc-users at mcs.anl.gov>>>
>>     >          >     <mailto:petsc-users at mcs.anl.gov
>>    <mailto:petsc-users at mcs.anl.gov>
>>     >         <mailto:petsc-users at mcs.anl.gov
>>    <mailto:petsc-users at mcs.anl.gov>> <mailto:petsc-users at mcs.anl.gov
>>    <mailto:petsc-users at mcs.anl.gov>
>>     >         <mailto:petsc-users at mcs.anl.gov
>>    <mailto:petsc-users at mcs.anl.gov>>>>>
>>     >          >     wrote:
>>     >          >      >
>>     >          >      >     Hi,
>>     >          >      >
>>     >          >      >     Currently, it is possible to use mumps
>>    solver in
>>     >         PETSC with
>>     >          >      >     -mat_mumps_use_omp_threads option, so that
>>     >         multiple MPI
>>     >          >     processes will
>>     >          >      >     transfer the matrix and rhs data to the master
>>     >         rank, and then
>>     >          >     master
>>     >          >      >     rank will call mumps with OpenMP to solve
>>    the matrix.
>>     >          >      >
>>     >          >      >     I wonder if someone can develop similar
>>    option for
>>     >         cusparse
>>     >          >     solver.
>>     >          >      >     Right now, this solver does not work with
>>     >         mpiaijcusparse. I
>>     >          >     think a
>>     >          >      >     possible workaround is to transfer all the
>>    matrix
>>     >         data to one MPI
>>     >          >      >     process, and then upload the data to GPU to
>>    solve.
>>     >         In this
>>     >          >     way, one can
>>     >          >      >     use cusparse solver for a MPI program.
>>     >          >      >
>>     >          >      >     Chang
>>     >          >      >     --
>>     >          >      >     Chang Liu
>>     >          >      >     Staff Research Physicist
>>     >          >      >     +1 609 243 3438
>>     >          >      > cliu at pppl.gov <mailto:cliu at pppl.gov>
>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>     >          >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>>     >          >      >     Princeton Plasma Physics Laboratory
>>     >          >      >     100 Stellarator Rd, Princeton NJ 08540, USA
>>     >          >      >
>>     >          >
>>     >          >     --
>>     >          >     Chang Liu
>>     >          >     Staff Research Physicist
>>     >          >     +1 609 243 3438
>>     >          > cliu at pppl.gov <mailto:cliu at pppl.gov>
>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>> <mailto:cliu at pppl.gov
>>    <mailto:cliu at pppl.gov>
>>     >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>>     >          >     Princeton Plasma Physics Laboratory
>>     >          >     100 Stellarator Rd, Princeton NJ 08540, USA
>>     >          >
>>     >
>>     >         --
>>     >         Chang Liu
>>     >         Staff Research Physicist
>>     >         +1 609 243 3438
>>     > cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>>    <mailto:cliu at pppl.gov>>
>>     >         Princeton Plasma Physics Laboratory
>>     >         100 Stellarator Rd, Princeton NJ 08540, USA
>>     >
>>    --     Chang Liu
>>    Staff Research Physicist
>>    +1 609 243 3438
>>    cliu at pppl.gov <mailto:cliu at pppl.gov>
>>    Princeton Plasma Physics Laboratory
>>    100 Stellarator Rd, Princeton NJ 08540, USA
> 
> -- 
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> cliu at pppl.gov
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA