[petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver

Thu Oct 14 16:15:56 CDT 2021

  You need to use the PCTELESCOPE inside the block Jacobi, not outside it. So something like -pc_type bjacobi -sub_pc_type telescope -sub_telescope_pc_type lu 

> On Oct 14, 2021, at 4:14 PM, Chang Liu <cliu at pppl.gov> wrote:
> 
> Hi Pierre,
> 
> I wonder if the trick of PCTELESCOPE only works for preconditioner and not for the solver. I have done some tests, and find that for solving a small matrix using -telescope_ksp_type preonly, it does work for GPU with multiple MPI processes. However, for bjacobi and gmres, it does not work.
> 
> The command line options I used for small matrix is like
> 
> mpiexec -n 4 --oversubscribe ./ex7 -m 100 -ksp_monitor_short -pc_type telescope -mat_type aijcusparse -telescope_pc_type lu -telescope_pc_factor_mat_solver_type cusparse -telescope_ksp_type preonly -pc_telescope_reduction_factor 4
> 
> which gives the correct output. For iterative solver, I tried
> 
> mpiexec -n 16 --oversubscribe ./ex7 -m 400 -ksp_monitor_short -pc_type bjacobi -pc_bjacobi_blocks 4 -ksp_type fgmres -mat_type aijcusparse -sub_pc_type telescope -sub_ksp_type preonly -sub_telescope_ksp_type preonly -sub_telescope_pc_type lu -sub_telescope_pc_factor_mat_solver_type cusparse -sub_pc_telescope_reduction_factor 4 -ksp_max_it 2000 -ksp_rtol 1.e-9 -ksp_atol 1.e-20
> 
> for large matrix. The output is like
> 
>  0 KSP Residual norm 40.1497
>  1 KSP Residual norm < 1.e-11
> Norm of error 400.999 iterations 1
> 
> So it seems to call a direct solver instead of an iterative one.
> 
> Can you please help check these options?
> 
> Chang
> 
> On 10/14/21 10:04 AM, Pierre Jolivet wrote:
>>> On 14 Oct 2021, at 3:50 PM, Chang Liu <cliu at pppl.gov> wrote:
>>> 
>>> Thank you Pierre. I was not aware of PCTELESCOPE before. This sounds exactly what I need. I wonder if PCTELESCOPE can transform a mpiaijcusparse to seqaircusparse? Or I have to do it manually?
>> PCTELESCOPE uses MatCreateMPIMatConcatenateSeqMat().
>> 1) I’m not sure this is implemented for cuSparse matrices, but it should be;
>> 2) at least for the implementations MatCreateMPIMatConcatenateSeqMat_MPIBAIJ() and MatCreateMPIMatConcatenateSeqMat_MPIAIJ(), the resulting MatType is MATBAIJ (resp. MATAIJ). Constructors are usually “smart” enough to detect if the MPI communicator on which the Mat lives is of size 1 (your case), and then the resulting Mat is of type MatSeqX instead of MatMPIX, so you would not need to worry about the transformation you are mentioning.
>> If you try this out and this does not work, please provide the backtrace (probably something like “Operation XYZ not implemented for MatType ABC”), and hopefully someone can add the missing plumbing.
>> I do not claim that this will be efficient, but I think this goes in the direction of what you want to achieve.
>> Thanks,
>> Pierre
>>> Chang
>>> 
>>> On 10/14/21 1:35 AM, Pierre Jolivet wrote:
>>>> Maybe I’m missing something, but can’t you use PCTELESCOPE as a subdomain solver, with a reduction factor equal to the number of MPI processes you have per block?
>>>> -sub_pc_type telescope -sub_pc_telescope_reduction_factor X -sub_telescope_pc_type lu
>>>> This does not work with MUMPS -mat_mumps_use_omp_threads because not only do the Mat needs to be redistributed, the secondary processes also need to be “converted” to OpenMP threads.
>>>> Thus the need for specific code in mumps.c.
>>>> Thanks,
>>>> Pierre
>>>>> On 14 Oct 2021, at 6:00 AM, Chang Liu via petsc-users <petsc-users at mcs.anl.gov> wrote:
>>>>> 
>>>>> Hi Junchao,
>>>>> 
>>>>> Yes that is what I want.
>>>>> 
>>>>> Chang
>>>>> 
>>>>> On 10/13/21 11:42 PM, Junchao Zhang wrote:
>>>>>> On Wed, Oct 13, 2021 at 8:58 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
>>>>>>       Junchao,
>>>>>>          If I understand correctly Chang is using the block Jacobi
>>>>>>    method with a single block for a number of MPI ranks and a direct
>>>>>>    solver for each block so it uses PCSetUp_BJacobi_Multiproc() which
>>>>>>    is code Hong Zhang wrote a number of years ago for CPUs. For their
>>>>>>    particular problems this preconditioner works well, but using an
>>>>>>    iterative solver on the blocks does not work well.
>>>>>>          If we had complete MPI-GPU direct solvers he could just use
>>>>>>    the current code with MPIAIJCUSPARSE on each block but since we do
>>>>>>    not he would like to use a single GPU for each block, this means
>>>>>>    that diagonal blocks of  the global parallel MPI matrix needs to be
>>>>>>    sent to a subset of the GPUs (one GPU per block, which has multiple
>>>>>>    MPI ranks associated with the blocks). Similarly for the triangular
>>>>>>    solves the blocks of the right hand side needs to be shipped to the
>>>>>>    appropriate GPU and the resulting solution shipped back to the
>>>>>>    multiple GPUs. So Chang is absolutely correct, this is somewhat like
>>>>>>    your code for MUMPS with OpenMP. OK, I now understand the background..
>>>>>>    One could use PCSetUp_BJacobi_Multiproc() and get the blocks on the
>>>>>>    MPI ranks and then shrink each block down to a single GPU but this
>>>>>>    would be pretty inefficient, ideally one would go directly from the
>>>>>>    big MPI matrix on all the GPUs to the sub matrices on the subset of
>>>>>>    GPUs. But this may be a large coding project.
>>>>>> I don't understand these sentences. Why do you say "shrink"? In my mind, we just need to move each block (submatrix) living over multiple MPI ranks to one of them and solve directly there.  In other words, we keep blocks' size, no shrinking or expanding.
>>>>>> As mentioned before, cusparse does not provide LU factorization. So the LU factorization would be done on CPU, and the solve be done on GPU. I assume Chang wants to gain from the (potential) faster solve (instead of factorization) on GPU.
>>>>>>       Barry
>>>>>>    Since the matrices being factored and solved directly are relatively
>>>>>>    large it is possible that the cusparse code could be reasonably
>>>>>>    efficient (they are not the tiny problems one gets at the coarse
>>>>>>    level of multigrid). Of course, this is speculation, I don't
>>>>>>    actually know how much better the cusparse code would be on the
>>>>>>    direct solver than a good CPU direct sparse solver.
>>>>>>     > On Oct 13, 2021, at 9:32 PM, Chang Liu <cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov>> wrote:
>>>>>>     >
>>>>>>     > Sorry I am not familiar with the details either. Can you please
>>>>>>    check the code in MatMumpsGatherNonzerosOnMaster in mumps.c?
>>>>>>     >
>>>>>>     > Chang
>>>>>>     >
>>>>>>     > On 10/13/21 9:24 PM, Junchao Zhang wrote:
>>>>>>     >> Hi Chang,
>>>>>>     >>   I did the work in mumps. It is easy for me to understand
>>>>>>    gathering matrix rows to one process.
>>>>>>     >>   But how to gather blocks (submatrices) to form a large block?     Can you draw a picture of that?
>>>>>>     >>   Thanks
>>>>>>     >> --Junchao Zhang
>>>>>>     >> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users
>>>>>>    <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>
>>>>>>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>>
>>>>>>    wrote:
>>>>>>     >>    Hi Barry,
>>>>>>     >>    I think mumps solver in petsc does support that. You can
>>>>>>    check the
>>>>>>     >>    documentation on "-mat_mumps_use_omp_threads" at
>>>>>>     >>
>>>>>>    https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html
>>>>>>    <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>
>>>>>>     >>       <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html
>>>>>>    <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>>
>>>>>>     >>    and the code enclosed by #if
>>>>>>    defined(PETSC_HAVE_OPENMP_SUPPORT) in
>>>>>>     >>    functions MatMumpsSetUpDistRHSInfo and
>>>>>>     >>    MatMumpsGatherNonzerosOnMaster in
>>>>>>     >>    mumps.c
>>>>>>     >>    1. I understand it is ideal to do one MPI rank per GPU.
>>>>>>    However, I am
>>>>>>     >>    working on an existing code that was developed based on MPI
>>>>>>    and the the
>>>>>>     >>    # of mpi ranks is typically equal to # of cpu cores. We don't
>>>>>>    want to
>>>>>>     >>    change the whole structure of the code.
>>>>>>     >>    2. What you have suggested has been coded in mumps.c. See
>>>>>>    function
>>>>>>     >>    MatMumpsSetUpDistRHSInfo.
>>>>>>     >>    Regards,
>>>>>>     >>    Chang
>>>>>>     >>    On 10/13/21 7:53 PM, Barry Smith wrote:
>>>>>>     >>     >
>>>>>>     >>     >
>>>>>>     >>     >> On Oct 13, 2021, at 3:50 PM, Chang Liu <cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> wrote:
>>>>>>     >>     >>
>>>>>>     >>     >> Hi Barry,
>>>>>>     >>     >>
>>>>>>     >>     >> That is exactly what I want.
>>>>>>     >>     >>
>>>>>>     >>     >> Back to my original question, I am looking for an approach to
>>>>>>     >>    transfer
>>>>>>     >>     >> matrix
>>>>>>     >>     >> data from many MPI processes to "master" MPI
>>>>>>     >>     >> processes, each of which taking care of one GPU, and then
>>>>>>    upload
>>>>>>     >>    the data to GPU to
>>>>>>     >>     >> solve.
>>>>>>     >>     >> One can just grab some codes from mumps.c to
>>>>>>    aijcusparse.cu <http://aijcusparse.cu>
>>>>>>     >>    <http://aijcusparse.cu <http://aijcusparse.cu>>.
>>>>>>     >>     >
>>>>>>     >>     >    mumps.c doesn't actually do that. It never needs to
>>>>>>    copy the
>>>>>>     >>    entire matrix to a single MPI rank.
>>>>>>     >>     >
>>>>>>     >>     >    It would be possible to write such a code that you
>>>>>>    suggest but
>>>>>>     >>    it is not clear that it makes sense
>>>>>>     >>     >
>>>>>>     >>     > 1)  For normal PETSc GPU usage there is one GPU per MPI
>>>>>>    rank, so
>>>>>>     >>    while your one GPU per big domain is solving its systems the
>>>>>>    other
>>>>>>     >>    GPUs (with the other MPI ranks that share that domain) are doing
>>>>>>     >>    nothing.
>>>>>>     >>     >
>>>>>>     >>     > 2) For each triangular solve you would have to gather the
>>>>>>    right
>>>>>>     >>    hand side from the multiple ranks to the single GPU to pass it to
>>>>>>     >>    the GPU solver and then scatter the resulting solution back
>>>>>>    to all
>>>>>>     >>    of its subdomain ranks.
>>>>>>     >>     >
>>>>>>     >>     >    What I was suggesting was assign an entire subdomain to a
>>>>>>     >>    single MPI rank, thus it does everything on one GPU and can
>>>>>>    use the
>>>>>>     >>    GPU solver directly. If all the major computations of a subdomain
>>>>>>     >>    can fit and be done on a single GPU then you would be
>>>>>>    utilizing all
>>>>>>     >>    the GPUs you are using effectively.
>>>>>>     >>     >
>>>>>>     >>     >    Barry
>>>>>>     >>     >
>>>>>>     >>     >
>>>>>>     >>     >
>>>>>>     >>     >>
>>>>>>     >>     >> Chang
>>>>>>     >>     >>
>>>>>>     >>     >> On 10/13/21 1:53 PM, Barry Smith wrote:
>>>>>>     >>     >>>    Chang,
>>>>>>     >>     >>>      You are correct there is no MPI + GPU direct
>>>>>>    solvers that
>>>>>>     >>    currently do the triangular solves with MPI + GPU parallelism
>>>>>>    that I
>>>>>>     >>    am aware of. You are limited that individual triangular solves be
>>>>>>     >>    done on a single GPU. I can only suggest making each subdomain as
>>>>>>     >>    big as possible to utilize each GPU as much as possible for the
>>>>>>     >>    direct triangular solves.
>>>>>>     >>     >>>     Barry
>>>>>>     >>     >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users
>>>>>>     >>    <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>
>>>>>>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>>
>>>>>>    wrote:
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> Hi Mark,
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> '-mat_type aijcusparse' works with mpiaijcusparse with
>>>>>>    other
>>>>>>     >>    solvers, but with -pc_factor_mat_solver_type cusparse, it
>>>>>>    will give
>>>>>>     >>    an error.
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> Yes what I want is to have mumps or superlu to do the
>>>>>>     >>    factorization, and then do the rest, including GMRES solver,
>>>>>>    on gpu.
>>>>>>     >>    Is that possible?
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> I have tried to use aijcusparse with superlu_dist, it
>>>>>>    runs but
>>>>>>     >>    the iterative solver is still running on CPUs. I have
>>>>>>    contacted the
>>>>>>     >>    superlu group and they confirmed that is the case right now.
>>>>>>    But if
>>>>>>     >>    I set -pc_factor_mat_solver_type cusparse, it seems that the
>>>>>>     >>    iterative solver is running on GPU.
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> Chang
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> On 10/13/21 12:03 PM, Mark Adams wrote:
>>>>>>     >>     >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu
>>>>>>    <cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>> wrote:
>>>>>>     >>     >>>>>     Thank you Junchao for explaining this. I guess in
>>>>>>    my case
>>>>>>     >>    the code is
>>>>>>     >>     >>>>>     just calling a seq solver like superlu to do
>>>>>>     >>    factorization on GPUs.
>>>>>>     >>     >>>>>     My idea is that I want to have a traditional MPI
>>>>>>    code to
>>>>>>     >>    utilize GPUs
>>>>>>     >>     >>>>>     with cusparse. Right now cusparse does not support
>>>>>>    mpiaij
>>>>>>     >>    matrix, Sure it does: '-mat_type aijcusparse' will give you an
>>>>>>     >>    mpiaijcusparse matrix with > 1 processes.
>>>>>>     >>     >>>>> (-mat_type mpiaijcusparse might also work with >1 proc).
>>>>>>     >>     >>>>> However, I see in grepping the repo that all the mumps and
>>>>>>     >>    superlu tests use aij or sell matrix type.
>>>>>>     >>     >>>>> MUMPS and SuperLU provide their own solves, I assume
>>>>>>    .... but
>>>>>>     >>    you might want to do other matrix operations on the GPU. Is
>>>>>>    that the
>>>>>>     >>    issue?
>>>>>>     >>     >>>>> Did you try -mat_type aijcusparse with MUMPS and/or
>>>>>>    SuperLU
>>>>>>     >>    have a problem? (no test with it so it probably does not work)
>>>>>>     >>     >>>>> Thanks,
>>>>>>     >>     >>>>> Mark
>>>>>>     >>     >>>>>     so I
>>>>>>     >>     >>>>>     want the code to have a mpiaij matrix when adding
>>>>>>    all the
>>>>>>     >>    matrix terms,
>>>>>>     >>     >>>>>     and then transform the matrix to seqaij when doing the
>>>>>>     >>    factorization
>>>>>>     >>     >>>>>     and
>>>>>>     >>     >>>>>     solve. This involves sending the data to the master
>>>>>>     >>    process, and I
>>>>>>     >>     >>>>>     think
>>>>>>     >>     >>>>>     the petsc mumps solver have something similar already.
>>>>>>     >>     >>>>>     Chang
>>>>>>     >>     >>>>>     On 10/13/21 10:18 AM, Junchao Zhang wrote:
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams
>>>>>>     >>    <mfadams at lbl.gov <mailto:mfadams at lbl.gov>
>>>>>>    <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>
>>>>>>     >>     >>>>>     <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>
>>>>>>    <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>>
>>>>>>     >>     >>>>>      > <mailto:mfadams at lbl.gov
>>>>>>    <mailto:mfadams at lbl.gov> <mailto:mfadams at lbl.gov
>>>>>>    <mailto:mfadams at lbl.gov>>
>>>>>>     >>    <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>
>>>>>>    <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>>>> wrote:
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu
>>>>>>     >>    <cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov>>
>>>>>>     >>     >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>>>>>>     >>     >>>>>      >     <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>> wrote:
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         Hi Mark,
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         The option I use is like
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         -pc_type bjacobi -pc_bjacobi_blocks 16
>>>>>>     >>    -ksp_type fgmres
>>>>>>     >>     >>>>>     -mat_type
>>>>>>     >>     >>>>>      >         aijcusparse *-sub_pc_factor_mat_solver_type
>>>>>>     >>    cusparse
>>>>>>     >>     >>>>>     *-sub_ksp_type
>>>>>>     >>     >>>>>      >         preonly *-sub_pc_type lu* -ksp_max_it 2000
>>>>>>     >>    -ksp_rtol 1.e-300
>>>>>>     >>     >>>>>      >         -ksp_atol 1.e-300
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >     Note, If you use -log_view the last column
>>>>>>    (rows
>>>>>>     >>    are the
>>>>>>     >>     >>>>>     method like
>>>>>>     >>     >>>>>      >     MatFactorNumeric) has the percent of work
>>>>>>    in the GPU.
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >     Junchao: *This* implies that we have a
>>>>>>    cuSparse LU
>>>>>>     >>     >>>>>     factorization. Is
>>>>>>     >>     >>>>>      >     that correct? (I don't think we do)
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      > No, we don't have cuSparse LU factorization.     If you check
>>>>>>     >>     >>>>>      > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will
>>>>>>    find it
>>>>>>     >>    calls
>>>>>>     >>     >>>>>      > MatLUFactorSymbolic_SeqAIJ() instead.
>>>>>>     >>     >>>>>      > So I don't understand Chang's idea. Do you want to
>>>>>>     >>    make bigger
>>>>>>     >>     >>>>>     blocks?
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         I think this one do both factorization and
>>>>>>     >>    solve on gpu.
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         You can check the
>>>>>>    runex72_aijcusparse.sh file
>>>>>>     >>    in petsc
>>>>>>     >>     >>>>>     install
>>>>>>     >>     >>>>>      >         directory, and try it your self (this
>>>>>>    is only lu
>>>>>>     >>     >>>>>     factorization
>>>>>>     >>     >>>>>      >         without
>>>>>>     >>     >>>>>      >         iterative solve).
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         Chang
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         On 10/12/21 1:17 PM, Mark Adams wrote:
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          > On Tue, Oct 12, 2021 at 11:19 AM
>>>>>>    Chang Liu
>>>>>>     >>     >>>>>     <cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>>>>>>     >>     >>>>>      >          > <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>>> wrote:
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     Hi Junchao,
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     No I only needs it to be transferred
>>>>>>     >>    within a
>>>>>>     >>     >>>>>     node. I use
>>>>>>     >>     >>>>>      >         block-Jacobi
>>>>>>     >>     >>>>>      >          >     method and GMRES to solve the sparse
>>>>>>     >>    matrix, so each
>>>>>>     >>     >>>>>      >         direct solver will
>>>>>>     >>     >>>>>      >          >     take care of a sub-block of the
>>>>>>    whole
>>>>>>     >>    matrix. In this
>>>>>>     >>     >>>>>      >         way, I can use
>>>>>>     >>     >>>>>      >          >     one
>>>>>>     >>     >>>>>      >          >     GPU to solve one sub-block, which is
>>>>>>     >>    stored within
>>>>>>     >>     >>>>>     one node.
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     It was stated in the
>>>>>>    documentation that
>>>>>>     >>    cusparse
>>>>>>     >>     >>>>>     solver
>>>>>>     >>     >>>>>      >         is slow.
>>>>>>     >>     >>>>>      >          >     However, in my test using
>>>>>>    ex72.c, the
>>>>>>     >>    cusparse
>>>>>>     >>     >>>>>     solver is
>>>>>>     >>     >>>>>      >         faster than
>>>>>>     >>     >>>>>      >          >     mumps or superlu_dist on CPUs.
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          > Are we talking about the
>>>>>>    factorization, the
>>>>>>     >>    solve, or
>>>>>>     >>     >>>>>     both?
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          > We do not have an interface to
>>>>>>    cuSparse's LU
>>>>>>     >>     >>>>>     factorization (I
>>>>>>     >>     >>>>>      >         just
>>>>>>     >>     >>>>>      >          > learned that it exists a few weeks ago).
>>>>>>     >>     >>>>>      >          > Perhaps your fast "cusparse solver" is
>>>>>>     >>    '-pc_type lu
>>>>>>     >>     >>>>>     -mat_type
>>>>>>     >>     >>>>>      >          > aijcusparse' ? This would be the CPU
>>>>>>     >>    factorization,
>>>>>>     >>     >>>>>     which is the
>>>>>>     >>     >>>>>      >          > dominant cost.
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     Chang
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     On 10/12/21 10:24 AM, Junchao
>>>>>>    Zhang wrote:
>>>>>>     >>     >>>>>      >          >      > Hi, Chang,
>>>>>>     >>     >>>>>      >          >      >     For the mumps solver, we
>>>>>>    usually
>>>>>>     >>    transfers
>>>>>>     >>     >>>>>     matrix
>>>>>>     >>     >>>>>      >         and vector
>>>>>>     >>     >>>>>      >          >     data
>>>>>>     >>     >>>>>      >          >      > within a compute node.  For
>>>>>>    the idea you
>>>>>>     >>     >>>>>     propose, it
>>>>>>     >>     >>>>>      >         looks like
>>>>>>     >>     >>>>>      >          >     we need
>>>>>>     >>     >>>>>      >          >      > to gather data within
>>>>>>     >>    MPI_COMM_WORLD, right?
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >     Mark, I remember you said
>>>>>>     >>    cusparse solve is
>>>>>>     >>     >>>>>     slow
>>>>>>     >>     >>>>>      >         and you would
>>>>>>     >>     >>>>>      >          >      > rather do it on CPU. Is it right?
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      > --Junchao Zhang
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      > On Mon, Oct 11, 2021 at 10:25 PM
>>>>>>     >>    Chang Liu via
>>>>>>     >>     >>>>>     petsc-users
>>>>>>     >>     >>>>>      >          >      > <petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>>>> <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>>>>>
>>>>>>     >>     >>>>>      >          >     <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>>>> <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>>
>>>>>>     >>     >>>>>     <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>
>>>>>>     >>    <mailto:petsc-users at mcs.anl.gov
>>>>>>    <mailto:petsc-users at mcs.anl.gov>>>>>>>
>>>>>>     >>     >>>>>      >          >     wrote:
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >     Hi,
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >     Currently, it is possible
>>>>>>    to use
>>>>>>     >>    mumps
>>>>>>     >>     >>>>>     solver in
>>>>>>     >>     >>>>>      >         PETSC with
>>>>>>     >>     >>>>>      >          >      >     -mat_mumps_use_omp_threads
>>>>>>     >>    option, so that
>>>>>>     >>     >>>>>      >         multiple MPI
>>>>>>     >>     >>>>>      >          >     processes will
>>>>>>     >>     >>>>>      >          >      >     transfer the matrix and
>>>>>>    rhs data
>>>>>>     >>    to the master
>>>>>>     >>     >>>>>      >         rank, and then
>>>>>>     >>     >>>>>      >          >     master
>>>>>>     >>     >>>>>      >          >      >     rank will call mumps with
>>>>>>    OpenMP
>>>>>>     >>    to solve
>>>>>>     >>     >>>>>     the matrix.
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >     I wonder if someone can
>>>>>>    develop
>>>>>>     >>    similar
>>>>>>     >>     >>>>>     option for
>>>>>>     >>     >>>>>      >         cusparse
>>>>>>     >>     >>>>>      >          >     solver.
>>>>>>     >>     >>>>>      >          >      >     Right now, this solver
>>>>>>    does not
>>>>>>     >>    work with
>>>>>>     >>     >>>>>      >         mpiaijcusparse. I
>>>>>>     >>     >>>>>      >          >     think a
>>>>>>     >>     >>>>>      >          >      >     possible workaround is to
>>>>>>     >>    transfer all the
>>>>>>     >>     >>>>>     matrix
>>>>>>     >>     >>>>>      >         data to one MPI
>>>>>>     >>     >>>>>      >          >      >     process, and then upload the
>>>>>>     >>    data to GPU to
>>>>>>     >>     >>>>>     solve.
>>>>>>     >>     >>>>>      >         In this
>>>>>>     >>     >>>>>      >          >     way, one can
>>>>>>     >>     >>>>>      >          >      >     use cusparse solver for a MPI
>>>>>>     >>    program.
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >      >     Chang
>>>>>>     >>     >>>>>      >          >      >     --
>>>>>>     >>     >>>>>      >          >      >     Chang Liu
>>>>>>     >>     >>>>>      >          >      >     Staff Research Physicist
>>>>>>     >>     >>>>>      >          >      >     +1 609 243 3438
>>>>>>     >>     >>>>>      >          >      > cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>>>>>>     >>     >>>>>      >         <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>>
>>>>>>     >>     >>>>>      >         <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>>>>>>     >>     >>>>>      >          >     <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>>>
>>>>>>     >>     >>>>>      >          >      >     Princeton Plasma Physics
>>>>>>    Laboratory
>>>>>>     >>     >>>>>      >          >      >     100 Stellarator Rd,
>>>>>>    Princeton NJ
>>>>>>     >>    08540, USA
>>>>>>     >>     >>>>>      >          >      >
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >          >     --
>>>>>>     >>     >>>>>      >          >     Chang Liu
>>>>>>     >>     >>>>>      >          >     Staff Research Physicist
>>>>>>     >>     >>>>>      >          >     +1 609 243 3438
>>>>>>     >>     >>>>>      >          > cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>>>>>>     >>     >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>     >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>>>>>>     >>     >>>>>      >         <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>>
>>>>>>     >>     >>>>>      >          >     Princeton Plasma Physics Laboratory
>>>>>>     >>     >>>>>      >          >     100 Stellarator Rd, Princeton NJ
>>>>>>    08540, USA
>>>>>>     >>     >>>>>      >          >
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>      >         --
>>>>>>     >>     >>>>>      >         Chang Liu
>>>>>>     >>     >>>>>      >         Staff Research Physicist
>>>>>>     >>     >>>>>      >         +1 609 243 3438
>>>>>>     >>     >>>>>      > cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>     >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>>>>>>     >>     >>>>>      >         Princeton Plasma Physics Laboratory
>>>>>>     >>     >>>>>      >         100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>     >>     >>>>>      >
>>>>>>     >>     >>>>>     --     Chang Liu
>>>>>>     >>     >>>>>     Staff Research Physicist
>>>>>>     >>     >>>>>     +1 609 243 3438
>>>>>>     >>     >>>>> cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>> <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov>
>>>>>>     >>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>>>>>>     >>     >>>>>     Princeton Plasma Physics Laboratory
>>>>>>     >>     >>>>>     100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>     >>     >>>>
>>>>>>     >>     >>>> --
>>>>>>     >>     >>>> Chang Liu
>>>>>>     >>     >>>> Staff Research Physicist
>>>>>>     >>     >>>> +1 609 243 3438
>>>>>>     >>     >>>> cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>     >>>> Princeton Plasma Physics Laboratory
>>>>>>     >>     >>>> 100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>     >>     >>
>>>>>>     >>     >> --
>>>>>>     >>     >> Chang Liu
>>>>>>     >>     >> Staff Research Physicist
>>>>>>     >>     >> +1 609 243 3438
>>>>>>     >>     >> cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>    <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>>>>>>     >>     >> Princeton Plasma Physics Laboratory
>>>>>>     >>     >> 100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>     >>     >
>>>>>>     >>    --     Chang Liu
>>>>>>     >>    Staff Research Physicist
>>>>>>     >>    +1 609 243 3438
>>>>>>     >> cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>>>>>>    <mailto:cliu at pppl.gov>>
>>>>>>     >>    Princeton Plasma Physics Laboratory
>>>>>>     >>    100 Stellarator Rd, Princeton NJ 08540, USA
>>>>>>     >
>>>>>>     > --
>>>>>>     > Chang Liu
>>>>>>     > Staff Research Physicist
>>>>>>     > +1 609 243 3438
>>>>>>     > cliu at pppl.gov <mailto:cliu at pppl.gov>
>>>>>>     > Princeton Plasma Physics Laboratory
>>>>>>     > 100 Stellarator Rd, Princeton NJ 08540, USA
>>>>> 
>>>>> -- 
>>>>> Chang Liu
>>>>> Staff Research Physicist
>>>>> +1 609 243 3438
>>>>> cliu at pppl.gov
>>>>> Princeton Plasma Physics Laboratory
>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA
>>> 
>>> -- 
>>> Chang Liu
>>> Staff Research Physicist
>>> +1 609 243 3438
>>> cliu at pppl.gov
>>> Princeton Plasma Physics Laboratory
>>> 100 Stellarator Rd, Princeton NJ 08540, USA
> 
> -- 
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> cliu at pppl.gov
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA