[petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver

Wed Oct 13 20:32:29 CDT 2021

Sorry I am not familiar with the details either. Can you please check 
the code in MatMumpsGatherNonzerosOnMaster in mumps.c?

Chang

On 10/13/21 9:24 PM, Junchao Zhang wrote:
> Hi Chang,
>    I did the work in mumps. It is easy for me to understand gathering 
> matrix rows to one process.
>    But how to gather blocks (submatrices) to form a large block?  Can 
> you draw a picture of that?
>    Thanks
> --Junchao Zhang
> 
> 
> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users 
> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
> 
>     Hi Barry,
> 
>     I think mumps solver in petsc does support that. You can check the
>     documentation on "-mat_mumps_use_omp_threads" at
> 
>     https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html
>     <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>
> 
>     and the code enclosed by #if defined(PETSC_HAVE_OPENMP_SUPPORT) in
>     functions MatMumpsSetUpDistRHSInfo and
>     MatMumpsGatherNonzerosOnMaster in
>     mumps.c
> 
>     1. I understand it is ideal to do one MPI rank per GPU. However, I am
>     working on an existing code that was developed based on MPI and the the
>     # of mpi ranks is typically equal to # of cpu cores. We don't want to
>     change the whole structure of the code.
> 
>     2. What you have suggested has been coded in mumps.c. See function
>     MatMumpsSetUpDistRHSInfo.
> 
>     Regards,
> 
>     Chang
> 
>     On 10/13/21 7:53 PM, Barry Smith wrote:
>      >
>      >
>      >> On Oct 13, 2021, at 3:50 PM, Chang Liu <cliu at pppl.gov
>     <mailto:cliu at pppl.gov>> wrote:
>      >>
>      >> Hi Barry,
>      >>
>      >> That is exactly what I want.
>      >>
>      >> Back to my original question, I am looking for an approach to
>     transfer
>      >> matrix
>      >> data from many MPI processes to "master" MPI
>      >> processes, each of which taking care of one GPU, and then upload
>     the data to GPU to
>      >> solve.
>      >> One can just grab some codes from mumps.c to aijcusparse.cu
>     <http://aijcusparse.cu>.
>      >
>      >    mumps.c doesn't actually do that. It never needs to copy the
>     entire matrix to a single MPI rank.
>      >
>      >    It would be possible to write such a code that you suggest but
>     it is not clear that it makes sense
>      >
>      > 1)  For normal PETSc GPU usage there is one GPU per MPI rank, so
>     while your one GPU per big domain is solving its systems the other
>     GPUs (with the other MPI ranks that share that domain) are doing
>     nothing.
>      >
>      > 2) For each triangular solve you would have to gather the right
>     hand side from the multiple ranks to the single GPU to pass it to
>     the GPU solver and then scatter the resulting solution back to all
>     of its subdomain ranks.
>      >
>      >    What I was suggesting was assign an entire subdomain to a
>     single MPI rank, thus it does everything on one GPU and can use the
>     GPU solver directly. If all the major computations of a subdomain
>     can fit and be done on a single GPU then you would be utilizing all
>     the GPUs you are using effectively.
>      >
>      >    Barry
>      >
>      >
>      >
>      >>
>      >> Chang
>      >>
>      >> On 10/13/21 1:53 PM, Barry Smith wrote:
>      >>>    Chang,
>      >>>      You are correct there is no MPI + GPU direct solvers that
>     currently do the triangular solves with MPI + GPU parallelism that I
>     am aware of. You are limited that individual triangular solves be
>     done on a single GPU. I can only suggest making each subdomain as
>     big as possible to utilize each GPU as much as possible for the
>     direct triangular solves.
>      >>>     Barry
>      >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users
>     <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>      >>>>
>      >>>> Hi Mark,
>      >>>>
>      >>>> '-mat_type aijcusparse' works with mpiaijcusparse with other
>     solvers, but with -pc_factor_mat_solver_type cusparse, it will give
>     an error.
>      >>>>
>      >>>> Yes what I want is to have mumps or superlu to do the
>     factorization, and then do the rest, including GMRES solver, on gpu.
>     Is that possible?
>      >>>>
>      >>>> I have tried to use aijcusparse with superlu_dist, it runs but
>     the iterative solver is still running on CPUs. I have contacted the
>     superlu group and they confirmed that is the case right now. But if
>     I set -pc_factor_mat_solver_type cusparse, it seems that the
>     iterative solver is running on GPU.
>      >>>>
>      >>>> Chang
>      >>>>
>      >>>> On 10/13/21 12:03 PM, Mark Adams wrote:
>      >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <cliu at pppl.gov
>     <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>>> wrote:
>      >>>>>     Thank you Junchao for explaining this. I guess in my case
>     the code is
>      >>>>>     just calling a seq solver like superlu to do
>     factorization on GPUs.
>      >>>>>     My idea is that I want to have a traditional MPI code to
>     utilize GPUs
>      >>>>>     with cusparse. Right now cusparse does not support mpiaij
>     matrix, Sure it does: '-mat_type aijcusparse' will give you an
>     mpiaijcusparse matrix with > 1 processes.
>      >>>>> (-mat_type mpiaijcusparse might also work with >1 proc).
>      >>>>> However, I see in grepping the repo that all the mumps and
>     superlu tests use aij or sell matrix type.
>      >>>>> MUMPS and SuperLU provide their own solves, I assume .... but
>     you might want to do other matrix operations on the GPU. Is that the
>     issue?
>      >>>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU
>     have a problem? (no test with it so it probably does not work)
>      >>>>> Thanks,
>      >>>>> Mark
>      >>>>>     so I
>      >>>>>     want the code to have a mpiaij matrix when adding all the
>     matrix terms,
>      >>>>>     and then transform the matrix to seqaij when doing the
>     factorization
>      >>>>>     and
>      >>>>>     solve. This involves sending the data to the master
>     process, and I
>      >>>>>     think
>      >>>>>     the petsc mumps solver have something similar already.
>      >>>>>     Chang
>      >>>>>     On 10/13/21 10:18 AM, Junchao Zhang wrote:
>      >>>>>      >
>      >>>>>      >
>      >>>>>      >
>      >>>>>      > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams
>     <mfadams at lbl.gov <mailto:mfadams at lbl.gov>
>      >>>>>     <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>
>      >>>>>      > <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>
>     <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>>> wrote:
>      >>>>>      >
>      >>>>>      >
>      >>>>>      >
>      >>>>>      >     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu
>     <cliu at pppl.gov <mailto:cliu at pppl.gov>
>      >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>>>>      >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>> wrote:
>      >>>>>      >
>      >>>>>      >         Hi Mark,
>      >>>>>      >
>      >>>>>      >         The option I use is like
>      >>>>>      >
>      >>>>>      >         -pc_type bjacobi -pc_bjacobi_blocks 16
>     -ksp_type fgmres
>      >>>>>     -mat_type
>      >>>>>      >         aijcusparse *-sub_pc_factor_mat_solver_type
>     cusparse
>      >>>>>     *-sub_ksp_type
>      >>>>>      >         preonly *-sub_pc_type lu* -ksp_max_it 2000
>     -ksp_rtol 1.e-300
>      >>>>>      >         -ksp_atol 1.e-300
>      >>>>>      >
>      >>>>>      >
>      >>>>>      >     Note, If you use -log_view the last column (rows
>     are the
>      >>>>>     method like
>      >>>>>      >     MatFactorNumeric) has the percent of work in the GPU.
>      >>>>>      >
>      >>>>>      >     Junchao: *This* implies that we have a cuSparse LU
>      >>>>>     factorization. Is
>      >>>>>      >     that correct? (I don't think we do)
>      >>>>>      >
>      >>>>>      > No, we don't have cuSparse LU factorization.  If you check
>      >>>>>      > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it
>     calls
>      >>>>>      > MatLUFactorSymbolic_SeqAIJ() instead.
>      >>>>>      > So I don't understand Chang's idea. Do you want to
>     make bigger
>      >>>>>     blocks?
>      >>>>>      >
>      >>>>>      >
>      >>>>>      >         I think this one do both factorization and
>     solve on gpu.
>      >>>>>      >
>      >>>>>      >         You can check the runex72_aijcusparse.sh file
>     in petsc
>      >>>>>     install
>      >>>>>      >         directory, and try it your self (this is only lu
>      >>>>>     factorization
>      >>>>>      >         without
>      >>>>>      >         iterative solve).
>      >>>>>      >
>      >>>>>      >         Chang
>      >>>>>      >
>      >>>>>      >         On 10/12/21 1:17 PM, Mark Adams wrote:
>      >>>>>      >          >
>      >>>>>      >          >
>      >>>>>      >          > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu
>      >>>>>     <cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>>>>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>      >>>>>      >          > <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>> wrote:
>      >>>>>      >          >
>      >>>>>      >          >     Hi Junchao,
>      >>>>>      >          >
>      >>>>>      >          >     No I only needs it to be transferred
>     within a
>      >>>>>     node. I use
>      >>>>>      >         block-Jacobi
>      >>>>>      >          >     method and GMRES to solve the sparse
>     matrix, so each
>      >>>>>      >         direct solver will
>      >>>>>      >          >     take care of a sub-block of the whole
>     matrix. In this
>      >>>>>      >         way, I can use
>      >>>>>      >          >     one
>      >>>>>      >          >     GPU to solve one sub-block, which is
>     stored within
>      >>>>>     one node.
>      >>>>>      >          >
>      >>>>>      >          >     It was stated in the documentation that
>     cusparse
>      >>>>>     solver
>      >>>>>      >         is slow.
>      >>>>>      >          >     However, in my test using ex72.c, the
>     cusparse
>      >>>>>     solver is
>      >>>>>      >         faster than
>      >>>>>      >          >     mumps or superlu_dist on CPUs.
>      >>>>>      >          >
>      >>>>>      >          >
>      >>>>>      >          > Are we talking about the factorization, the
>     solve, or
>      >>>>>     both?
>      >>>>>      >          >
>      >>>>>      >          > We do not have an interface to cuSparse's LU
>      >>>>>     factorization (I
>      >>>>>      >         just
>      >>>>>      >          > learned that it exists a few weeks ago).
>      >>>>>      >          > Perhaps your fast "cusparse solver" is
>     '-pc_type lu
>      >>>>>     -mat_type
>      >>>>>      >          > aijcusparse' ? This would be the CPU
>     factorization,
>      >>>>>     which is the
>      >>>>>      >          > dominant cost.
>      >>>>>      >          >
>      >>>>>      >          >
>      >>>>>      >          >     Chang
>      >>>>>      >          >
>      >>>>>      >          >     On 10/12/21 10:24 AM, Junchao Zhang wrote:
>      >>>>>      >          >      > Hi, Chang,
>      >>>>>      >          >      >     For the mumps solver, we usually
>     transfers
>      >>>>>     matrix
>      >>>>>      >         and vector
>      >>>>>      >          >     data
>      >>>>>      >          >      > within a compute node.  For the idea you
>      >>>>>     propose, it
>      >>>>>      >         looks like
>      >>>>>      >          >     we need
>      >>>>>      >          >      > to gather data within
>     MPI_COMM_WORLD, right?
>      >>>>>      >          >      >
>      >>>>>      >          >      >     Mark, I remember you said
>     cusparse solve is
>      >>>>>     slow
>      >>>>>      >         and you would
>      >>>>>      >          >      > rather do it on CPU. Is it right?
>      >>>>>      >          >      >
>      >>>>>      >          >      > --Junchao Zhang
>      >>>>>      >          >      >
>      >>>>>      >          >      >
>      >>>>>      >          >      > On Mon, Oct 11, 2021 at 10:25 PM
>     Chang Liu via
>      >>>>>     petsc-users
>      >>>>>      >          >      > <petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>>>>     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>
>      >>>>>      >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>>>>     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>> <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>>>>     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>
>      >>>>>      >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>>>>     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>>>
>      >>>>>      >          >     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>>>>     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>
>      >>>>>      >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>>>>     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>> <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>>>>     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>
>      >>>>>      >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >>>>>     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>>>>>
>      >>>>>      >          >     wrote:
>      >>>>>      >          >      >
>      >>>>>      >          >      >     Hi,
>      >>>>>      >          >      >
>      >>>>>      >          >      >     Currently, it is possible to use
>     mumps
>      >>>>>     solver in
>      >>>>>      >         PETSC with
>      >>>>>      >          >      >     -mat_mumps_use_omp_threads
>     option, so that
>      >>>>>      >         multiple MPI
>      >>>>>      >          >     processes will
>      >>>>>      >          >      >     transfer the matrix and rhs data
>     to the master
>      >>>>>      >         rank, and then
>      >>>>>      >          >     master
>      >>>>>      >          >      >     rank will call mumps with OpenMP
>     to solve
>      >>>>>     the matrix.
>      >>>>>      >          >      >
>      >>>>>      >          >      >     I wonder if someone can develop
>     similar
>      >>>>>     option for
>      >>>>>      >         cusparse
>      >>>>>      >          >     solver.
>      >>>>>      >          >      >     Right now, this solver does not
>     work with
>      >>>>>      >         mpiaijcusparse. I
>      >>>>>      >          >     think a
>      >>>>>      >          >      >     possible workaround is to
>     transfer all the
>      >>>>>     matrix
>      >>>>>      >         data to one MPI
>      >>>>>      >          >      >     process, and then upload the
>     data to GPU to
>      >>>>>     solve.
>      >>>>>      >         In this
>      >>>>>      >          >     way, one can
>      >>>>>      >          >      >     use cusparse solver for a MPI
>     program.
>      >>>>>      >          >      >
>      >>>>>      >          >      >     Chang
>      >>>>>      >          >      >     --
>      >>>>>      >          >      >     Chang Liu
>      >>>>>      >          >      >     Staff Research Physicist
>      >>>>>      >          >      >     +1 609 243 3438
>      >>>>>      >          >      > cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>      >>>>>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>      >>>>>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>      >>>>>      >          >     <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>>
>      >>>>>      >          >      >     Princeton Plasma Physics Laboratory
>      >>>>>      >          >      >     100 Stellarator Rd, Princeton NJ
>     08540, USA
>      >>>>>      >          >      >
>      >>>>>      >          >
>      >>>>>      >          >     --
>      >>>>>      >          >     Chang Liu
>      >>>>>      >          >     Staff Research Physicist
>      >>>>>      >          >     +1 609 243 3438
>      >>>>>      >          > cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>
>      >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >>>>>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>      >>>>>      >          >     Princeton Plasma Physics Laboratory
>      >>>>>      >          >     100 Stellarator Rd, Princeton NJ 08540, USA
>      >>>>>      >          >
>      >>>>>      >
>      >>>>>      >         --
>      >>>>>      >         Chang Liu
>      >>>>>      >         Staff Research Physicist
>      >>>>>      >         +1 609 243 3438
>      >>>>>      > cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>
>      >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>      >>>>>      >         Princeton Plasma Physics Laboratory
>      >>>>>      >         100 Stellarator Rd, Princeton NJ 08540, USA
>      >>>>>      >
>      >>>>>     --     Chang Liu
>      >>>>>     Staff Research Physicist
>      >>>>>     +1 609 243 3438
>      >>>>> cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>>
>      >>>>>     Princeton Plasma Physics Laboratory
>      >>>>>     100 Stellarator Rd, Princeton NJ 08540, USA
>      >>>>
>      >>>> --
>      >>>> Chang Liu
>      >>>> Staff Research Physicist
>      >>>> +1 609 243 3438
>      >>>> cliu at pppl.gov <mailto:cliu at pppl.gov>
>      >>>> Princeton Plasma Physics Laboratory
>      >>>> 100 Stellarator Rd, Princeton NJ 08540, USA
>      >>
>      >> --
>      >> Chang Liu
>      >> Staff Research Physicist
>      >> +1 609 243 3438
>      >> cliu at pppl.gov <mailto:cliu at pppl.gov>
>      >> Princeton Plasma Physics Laboratory
>      >> 100 Stellarator Rd, Princeton NJ 08540, USA
>      >
> 
>     -- 
>     Chang Liu
>     Staff Research Physicist
>     +1 609 243 3438
>     cliu at pppl.gov <mailto:cliu at pppl.gov>
>     Princeton Plasma Physics Laboratory
>     100 Stellarator Rd, Princeton NJ 08540, USA
> 

-- 
Chang Liu
Staff Research Physicist
+1 609 243 3438
cliu at pppl.gov
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA