[petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver

Wed Oct 13 20:24:38 CDT 2021

Hi Chang,
  I did the work in mumps. It is easy for me to understand gathering matrix
rows to one process.
  But how to gather blocks (submatrices) to form a large block?  Can you
draw a picture of that?
  Thanks
--Junchao Zhang

On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users <
petsc-users at mcs.anl.gov> wrote:

> Hi Barry,
>
> I think mumps solver in petsc does support that. You can check the
> documentation on "-mat_mumps_use_omp_threads" at
>
> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html
>
> and the code enclosed by #if defined(PETSC_HAVE_OPENMP_SUPPORT) in
> functions MatMumpsSetUpDistRHSInfo and MatMumpsGatherNonzerosOnMaster in
> mumps.c
>
> 1. I understand it is ideal to do one MPI rank per GPU. However, I am
> working on an existing code that was developed based on MPI and the the
> # of mpi ranks is typically equal to # of cpu cores. We don't want to
> change the whole structure of the code.
>
> 2. What you have suggested has been coded in mumps.c. See function
> MatMumpsSetUpDistRHSInfo.
>
> Regards,
>
> Chang
>
> On 10/13/21 7:53 PM, Barry Smith wrote:
> >
> >
> >> On Oct 13, 2021, at 3:50 PM, Chang Liu <cliu at pppl.gov> wrote:
> >>
> >> Hi Barry,
> >>
> >> That is exactly what I want.
> >>
> >> Back to my original question, I am looking for an approach to transfer
> >> matrix
> >> data from many MPI processes to "master" MPI
> >> processes, each of which taking care of one GPU, and then upload the
> data to GPU to
> >> solve.
> >> One can just grab some codes from mumps.c to aijcusparse.cu.
> >
> >    mumps.c doesn't actually do that. It never needs to copy the entire
> matrix to a single MPI rank.
> >
> >    It would be possible to write such a code that you suggest but it is
> not clear that it makes sense
> >
> > 1)  For normal PETSc GPU usage there is one GPU per MPI rank, so while
> your one GPU per big domain is solving its systems the other GPUs (with the
> other MPI ranks that share that domain) are doing nothing.
> >
> > 2) For each triangular solve you would have to gather the right hand
> side from the multiple ranks to the single GPU to pass it to the GPU solver
> and then scatter the resulting solution back to all of its subdomain ranks.
> >
> >    What I was suggesting was assign an entire subdomain to a single MPI
> rank, thus it does everything on one GPU and can use the GPU solver
> directly. If all the major computations of a subdomain can fit and be done
> on a single GPU then you would be utilizing all the GPUs you are using
> effectively.
> >
> >    Barry
> >
> >
> >
> >>
> >> Chang
> >>
> >> On 10/13/21 1:53 PM, Barry Smith wrote:
> >>>    Chang,
> >>>      You are correct there is no MPI + GPU direct solvers that
> currently do the triangular solves with MPI + GPU parallelism that I am
> aware of. You are limited that individual triangular solves be done on a
> single GPU. I can only suggest making each subdomain as big as possible to
> utilize each GPU as much as possible for the direct triangular solves.
> >>>     Barry
> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users <
> petsc-users at mcs.anl.gov> wrote:
> >>>>
> >>>> Hi Mark,
> >>>>
> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with other solvers,
> but with -pc_factor_mat_solver_type cusparse, it will give an error.
> >>>>
> >>>> Yes what I want is to have mumps or superlu to do the factorization,
> and then do the rest, including GMRES solver, on gpu. Is that possible?
> >>>>
> >>>> I have tried to use aijcusparse with superlu_dist, it runs but the
> iterative solver is still running on CPUs. I have contacted the superlu
> group and they confirmed that is the case right now. But if I set
> -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is
> running on GPU.
> >>>>
> >>>> Chang
> >>>>
> >>>> On 10/13/21 12:03 PM, Mark Adams wrote:
> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <cliu at pppl.gov <mailto:
> cliu at pppl.gov>> wrote:
> >>>>>     Thank you Junchao for explaining this. I guess in my case the
> code is
> >>>>>     just calling a seq solver like superlu to do factorization on
> GPUs.
> >>>>>     My idea is that I want to have a traditional MPI code to utilize
> GPUs
> >>>>>     with cusparse. Right now cusparse does not support mpiaij
> matrix, Sure it does: '-mat_type aijcusparse' will give you an
> mpiaijcusparse matrix with > 1 processes.
> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc).
> >>>>> However, I see in grepping the repo that all the mumps and superlu
> tests use aij or sell matrix type.
> >>>>> MUMPS and SuperLU provide their own solves, I assume .... but you
> might want to do other matrix operations on the GPU. Is that the issue?
> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a
> problem? (no test with it so it probably does not work)
> >>>>> Thanks,
> >>>>> Mark
> >>>>>     so I
> >>>>>     want the code to have a mpiaij matrix when adding all the matrix
> terms,
> >>>>>     and then transform the matrix to seqaij when doing the
> factorization
> >>>>>     and
> >>>>>     solve. This involves sending the data to the master process, and
> I
> >>>>>     think
> >>>>>     the petsc mumps solver have something similar already.
> >>>>>     Chang
> >>>>>     On 10/13/21 10:18 AM, Junchao Zhang wrote:
> >>>>>      >
> >>>>>      >
> >>>>>      >
> >>>>>      > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams <mfadams at lbl.gov
> >>>>>     <mailto:mfadams at lbl.gov>
> >>>>>      > <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>> wrote:
> >>>>>      >
> >>>>>      >
> >>>>>      >
> >>>>>      >     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu <cliu at pppl.gov
> >>>>>     <mailto:cliu at pppl.gov>
> >>>>>      >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> wrote:
> >>>>>      >
> >>>>>      >         Hi Mark,
> >>>>>      >
> >>>>>      >         The option I use is like
> >>>>>      >
> >>>>>      >         -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type
> fgmres
> >>>>>     -mat_type
> >>>>>      >         aijcusparse *-sub_pc_factor_mat_solver_type cusparse
> >>>>>     *-sub_ksp_type
> >>>>>      >         preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol
> 1.e-300
> >>>>>      >         -ksp_atol 1.e-300
> >>>>>      >
> >>>>>      >
> >>>>>      >     Note, If you use -log_view the last column (rows are the
> >>>>>     method like
> >>>>>      >     MatFactorNumeric) has the percent of work in the GPU.
> >>>>>      >
> >>>>>      >     Junchao: *This* implies that we have a cuSparse LU
> >>>>>     factorization. Is
> >>>>>      >     that correct? (I don't think we do)
> >>>>>      >
> >>>>>      > No, we don't have cuSparse LU factorization.  If you check
> >>>>>      > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls
> >>>>>      > MatLUFactorSymbolic_SeqAIJ() instead.
> >>>>>      > So I don't understand Chang's idea. Do you want to make bigger
> >>>>>     blocks?
> >>>>>      >
> >>>>>      >
> >>>>>      >         I think this one do both factorization and solve on
> gpu.
> >>>>>      >
> >>>>>      >         You can check the runex72_aijcusparse.sh file in petsc
> >>>>>     install
> >>>>>      >         directory, and try it your self (this is only lu
> >>>>>     factorization
> >>>>>      >         without
> >>>>>      >         iterative solve).
> >>>>>      >
> >>>>>      >         Chang
> >>>>>      >
> >>>>>      >         On 10/12/21 1:17 PM, Mark Adams wrote:
> >>>>>      >          >
> >>>>>      >          >
> >>>>>      >          > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu
> >>>>>     <cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>>>>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>>      >          > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>> wrote:
> >>>>>      >          >
> >>>>>      >          >     Hi Junchao,
> >>>>>      >          >
> >>>>>      >          >     No I only needs it to be transferred within a
> >>>>>     node. I use
> >>>>>      >         block-Jacobi
> >>>>>      >          >     method and GMRES to solve the sparse matrix,
> so each
> >>>>>      >         direct solver will
> >>>>>      >          >     take care of a sub-block of the whole matrix.
> In this
> >>>>>      >         way, I can use
> >>>>>      >          >     one
> >>>>>      >          >     GPU to solve one sub-block, which is stored
> within
> >>>>>     one node.
> >>>>>      >          >
> >>>>>      >          >     It was stated in the documentation that
> cusparse
> >>>>>     solver
> >>>>>      >         is slow.
> >>>>>      >          >     However, in my test using ex72.c, the cusparse
> >>>>>     solver is
> >>>>>      >         faster than
> >>>>>      >          >     mumps or superlu_dist on CPUs.
> >>>>>      >          >
> >>>>>      >          >
> >>>>>      >          > Are we talking about the factorization, the solve,
> or
> >>>>>     both?
> >>>>>      >          >
> >>>>>      >          > We do not have an interface to cuSparse's LU
> >>>>>     factorization (I
> >>>>>      >         just
> >>>>>      >          > learned that it exists a few weeks ago).
> >>>>>      >          > Perhaps your fast "cusparse solver" is '-pc_type lu
> >>>>>     -mat_type
> >>>>>      >          > aijcusparse' ? This would be the CPU factorization,
> >>>>>     which is the
> >>>>>      >          > dominant cost.
> >>>>>      >          >
> >>>>>      >          >
> >>>>>      >          >     Chang
> >>>>>      >          >
> >>>>>      >          >     On 10/12/21 10:24 AM, Junchao Zhang wrote:
> >>>>>      >          >      > Hi, Chang,
> >>>>>      >          >      >     For the mumps solver, we usually
> transfers
> >>>>>     matrix
> >>>>>      >         and vector
> >>>>>      >          >     data
> >>>>>      >          >      > within a compute node.  For the idea you
> >>>>>     propose, it
> >>>>>      >         looks like
> >>>>>      >          >     we need
> >>>>>      >          >      > to gather data within MPI_COMM_WORLD, right?
> >>>>>      >          >      >
> >>>>>      >          >      >     Mark, I remember you said cusparse
> solve is
> >>>>>     slow
> >>>>>      >         and you would
> >>>>>      >          >      > rather do it on CPU. Is it right?
> >>>>>      >          >      >
> >>>>>      >          >      > --Junchao Zhang
> >>>>>      >          >      >
> >>>>>      >          >      >
> >>>>>      >          >      > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu
> via
> >>>>>     petsc-users
> >>>>>      >          >      > <petsc-users at mcs.anl.gov
> >>>>>     <mailto:petsc-users at mcs.anl.gov>
> >>>>>      >         <mailto:petsc-users at mcs.anl.gov
> >>>>>     <mailto:petsc-users at mcs.anl.gov>> <mailto:
> petsc-users at mcs.anl.gov
> >>>>>     <mailto:petsc-users at mcs.anl.gov>
> >>>>>      >         <mailto:petsc-users at mcs.anl.gov
> >>>>>     <mailto:petsc-users at mcs.anl.gov>>>
> >>>>>      >          >     <mailto:petsc-users at mcs.anl.gov
> >>>>>     <mailto:petsc-users at mcs.anl.gov>
> >>>>>      >         <mailto:petsc-users at mcs.anl.gov
> >>>>>     <mailto:petsc-users at mcs.anl.gov>> <mailto:
> petsc-users at mcs.anl.gov
> >>>>>     <mailto:petsc-users at mcs.anl.gov>
> >>>>>      >         <mailto:petsc-users at mcs.anl.gov
> >>>>>     <mailto:petsc-users at mcs.anl.gov>>>>>
> >>>>>      >          >     wrote:
> >>>>>      >          >      >
> >>>>>      >          >      >     Hi,
> >>>>>      >          >      >
> >>>>>      >          >      >     Currently, it is possible to use mumps
> >>>>>     solver in
> >>>>>      >         PETSC with
> >>>>>      >          >      >     -mat_mumps_use_omp_threads option, so
> that
> >>>>>      >         multiple MPI
> >>>>>      >          >     processes will
> >>>>>      >          >      >     transfer the matrix and rhs data to the
> master
> >>>>>      >         rank, and then
> >>>>>      >          >     master
> >>>>>      >          >      >     rank will call mumps with OpenMP to
> solve
> >>>>>     the matrix.
> >>>>>      >          >      >
> >>>>>      >          >      >     I wonder if someone can develop similar
> >>>>>     option for
> >>>>>      >         cusparse
> >>>>>      >          >     solver.
> >>>>>      >          >      >     Right now, this solver does not work
> with
> >>>>>      >         mpiaijcusparse. I
> >>>>>      >          >     think a
> >>>>>      >          >      >     possible workaround is to transfer all
> the
> >>>>>     matrix
> >>>>>      >         data to one MPI
> >>>>>      >          >      >     process, and then upload the data to
> GPU to
> >>>>>     solve.
> >>>>>      >         In this
> >>>>>      >          >     way, one can
> >>>>>      >          >      >     use cusparse solver for a MPI program.
> >>>>>      >          >      >
> >>>>>      >          >      >     Chang
> >>>>>      >          >      >     --
> >>>>>      >          >      >     Chang Liu
> >>>>>      >          >      >     Staff Research Physicist
> >>>>>      >          >      >     +1 609 243 3438
> >>>>>      >          >      > cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >>>>>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>>      >          >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
> >>>>>      >          >      >     Princeton Plasma Physics Laboratory
> >>>>>      >          >      >     100 Stellarator Rd, Princeton NJ 08540,
> USA
> >>>>>      >          >      >
> >>>>>      >          >
> >>>>>      >          >     --
> >>>>>      >          >     Chang Liu
> >>>>>      >          >     Staff Research Physicist
> >>>>>      >          >     +1 609 243 3438
> >>>>>      >          > cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>>>>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>> <mailto:
> cliu at pppl.gov
> >>>>>     <mailto:cliu at pppl.gov>
> >>>>>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >>>>>      >          >     Princeton Plasma Physics Laboratory
> >>>>>      >          >     100 Stellarator Rd, Princeton NJ 08540, USA
> >>>>>      >          >
> >>>>>      >
> >>>>>      >         --
> >>>>>      >         Chang Liu
> >>>>>      >         Staff Research Physicist
> >>>>>      >         +1 609 243 3438
> >>>>>      > cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
> >>>>>     <mailto:cliu at pppl.gov>>
> >>>>>      >         Princeton Plasma Physics Laboratory
> >>>>>      >         100 Stellarator Rd, Princeton NJ 08540, USA
> >>>>>      >
> >>>>>     --     Chang Liu
> >>>>>     Staff Research Physicist
> >>>>>     +1 609 243 3438
> >>>>>     cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>>>>     Princeton Plasma Physics Laboratory
> >>>>>     100 Stellarator Rd, Princeton NJ 08540, USA
> >>>>
> >>>> --
> >>>> Chang Liu
> >>>> Staff Research Physicist
> >>>> +1 609 243 3438
> >>>> cliu at pppl.gov
> >>>> Princeton Plasma Physics Laboratory
> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA
> >>
> >> --
> >> Chang Liu
> >> Staff Research Physicist
> >> +1 609 243 3438
> >> cliu at pppl.gov
> >> Princeton Plasma Physics Laboratory
> >> 100 Stellarator Rd, Princeton NJ 08540, USA
> >
>
> --
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> cliu at pppl.gov
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20211013/28ed0ba6/attachment-0001.html>