[petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver
Chang Liu
cliu at pppl.gov
Wed Oct 13 20:32:29 CDT 2021
Sorry I am not familiar with the details either. Can you please check
the code in MatMumpsGatherNonzerosOnMaster in mumps.c?
Chang
On 10/13/21 9:24 PM, Junchao Zhang wrote:
> Hi Chang,
> I did the work in mumps. It is easy for me to understand gathering
> matrix rows to one process.
> But how to gather blocks (submatrices) to form a large block? Can
> you draw a picture of that?
> Thanks
> --Junchao Zhang
>
>
> On Wed, Oct 13, 2021 at 7:47 PM Chang Liu via petsc-users
> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>
> Hi Barry,
>
> I think mumps solver in petsc does support that. You can check the
> documentation on "-mat_mumps_use_omp_threads" at
>
> https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html
> <https://petsc.org/release/docs/manualpages/Mat/MATSOLVERMUMPS.html>
>
> and the code enclosed by #if defined(PETSC_HAVE_OPENMP_SUPPORT) in
> functions MatMumpsSetUpDistRHSInfo and
> MatMumpsGatherNonzerosOnMaster in
> mumps.c
>
> 1. I understand it is ideal to do one MPI rank per GPU. However, I am
> working on an existing code that was developed based on MPI and the the
> # of mpi ranks is typically equal to # of cpu cores. We don't want to
> change the whole structure of the code.
>
> 2. What you have suggested has been coded in mumps.c. See function
> MatMumpsSetUpDistRHSInfo.
>
> Regards,
>
> Chang
>
> On 10/13/21 7:53 PM, Barry Smith wrote:
> >
> >
> >> On Oct 13, 2021, at 3:50 PM, Chang Liu <cliu at pppl.gov
> <mailto:cliu at pppl.gov>> wrote:
> >>
> >> Hi Barry,
> >>
> >> That is exactly what I want.
> >>
> >> Back to my original question, I am looking for an approach to
> transfer
> >> matrix
> >> data from many MPI processes to "master" MPI
> >> processes, each of which taking care of one GPU, and then upload
> the data to GPU to
> >> solve.
> >> One can just grab some codes from mumps.c to aijcusparse.cu
> <http://aijcusparse.cu>.
> >
> > mumps.c doesn't actually do that. It never needs to copy the
> entire matrix to a single MPI rank.
> >
> > It would be possible to write such a code that you suggest but
> it is not clear that it makes sense
> >
> > 1) For normal PETSc GPU usage there is one GPU per MPI rank, so
> while your one GPU per big domain is solving its systems the other
> GPUs (with the other MPI ranks that share that domain) are doing
> nothing.
> >
> > 2) For each triangular solve you would have to gather the right
> hand side from the multiple ranks to the single GPU to pass it to
> the GPU solver and then scatter the resulting solution back to all
> of its subdomain ranks.
> >
> > What I was suggesting was assign an entire subdomain to a
> single MPI rank, thus it does everything on one GPU and can use the
> GPU solver directly. If all the major computations of a subdomain
> can fit and be done on a single GPU then you would be utilizing all
> the GPUs you are using effectively.
> >
> > Barry
> >
> >
> >
> >>
> >> Chang
> >>
> >> On 10/13/21 1:53 PM, Barry Smith wrote:
> >>> Chang,
> >>> You are correct there is no MPI + GPU direct solvers that
> currently do the triangular solves with MPI + GPU parallelism that I
> am aware of. You are limited that individual triangular solves be
> done on a single GPU. I can only suggest making each subdomain as
> big as possible to utilize each GPU as much as possible for the
> direct triangular solves.
> >>> Barry
> >>>> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users
> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
> >>>>
> >>>> Hi Mark,
> >>>>
> >>>> '-mat_type aijcusparse' works with mpiaijcusparse with other
> solvers, but with -pc_factor_mat_solver_type cusparse, it will give
> an error.
> >>>>
> >>>> Yes what I want is to have mumps or superlu to do the
> factorization, and then do the rest, including GMRES solver, on gpu.
> Is that possible?
> >>>>
> >>>> I have tried to use aijcusparse with superlu_dist, it runs but
> the iterative solver is still running on CPUs. I have contacted the
> superlu group and they confirmed that is the case right now. But if
> I set -pc_factor_mat_solver_type cusparse, it seems that the
> iterative solver is running on GPU.
> >>>>
> >>>> Chang
> >>>>
> >>>> On 10/13/21 12:03 PM, Mark Adams wrote:
> >>>>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <cliu at pppl.gov
> <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
> <mailto:cliu at pppl.gov>>> wrote:
> >>>>> Thank you Junchao for explaining this. I guess in my case
> the code is
> >>>>> just calling a seq solver like superlu to do
> factorization on GPUs.
> >>>>> My idea is that I want to have a traditional MPI code to
> utilize GPUs
> >>>>> with cusparse. Right now cusparse does not support mpiaij
> matrix, Sure it does: '-mat_type aijcusparse' will give you an
> mpiaijcusparse matrix with > 1 processes.
> >>>>> (-mat_type mpiaijcusparse might also work with >1 proc).
> >>>>> However, I see in grepping the repo that all the mumps and
> superlu tests use aij or sell matrix type.
> >>>>> MUMPS and SuperLU provide their own solves, I assume .... but
> you might want to do other matrix operations on the GPU. Is that the
> issue?
> >>>>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU
> have a problem? (no test with it so it probably does not work)
> >>>>> Thanks,
> >>>>> Mark
> >>>>> so I
> >>>>> want the code to have a mpiaij matrix when adding all the
> matrix terms,
> >>>>> and then transform the matrix to seqaij when doing the
> factorization
> >>>>> and
> >>>>> solve. This involves sending the data to the master
> process, and I
> >>>>> think
> >>>>> the petsc mumps solver have something similar already.
> >>>>> Chang
> >>>>> On 10/13/21 10:18 AM, Junchao Zhang wrote:
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams
> <mfadams at lbl.gov <mailto:mfadams at lbl.gov>
> >>>>> <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>
> >>>>> > <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>
> <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>>> wrote:
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu
> <cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>>>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>> > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>> wrote:
> >>>>> >
> >>>>> > Hi Mark,
> >>>>> >
> >>>>> > The option I use is like
> >>>>> >
> >>>>> > -pc_type bjacobi -pc_bjacobi_blocks 16
> -ksp_type fgmres
> >>>>> -mat_type
> >>>>> > aijcusparse *-sub_pc_factor_mat_solver_type
> cusparse
> >>>>> *-sub_ksp_type
> >>>>> > preonly *-sub_pc_type lu* -ksp_max_it 2000
> -ksp_rtol 1.e-300
> >>>>> > -ksp_atol 1.e-300
> >>>>> >
> >>>>> >
> >>>>> > Note, If you use -log_view the last column (rows
> are the
> >>>>> method like
> >>>>> > MatFactorNumeric) has the percent of work in the GPU.
> >>>>> >
> >>>>> > Junchao: *This* implies that we have a cuSparse LU
> >>>>> factorization. Is
> >>>>> > that correct? (I don't think we do)
> >>>>> >
> >>>>> > No, we don't have cuSparse LU factorization. If you check
> >>>>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it
> calls
> >>>>> > MatLUFactorSymbolic_SeqAIJ() instead.
> >>>>> > So I don't understand Chang's idea. Do you want to
> make bigger
> >>>>> blocks?
> >>>>> >
> >>>>> >
> >>>>> > I think this one do both factorization and
> solve on gpu.
> >>>>> >
> >>>>> > You can check the runex72_aijcusparse.sh file
> in petsc
> >>>>> install
> >>>>> > directory, and try it your self (this is only lu
> >>>>> factorization
> >>>>> > without
> >>>>> > iterative solve).
> >>>>> >
> >>>>> > Chang
> >>>>> >
> >>>>> > On 10/12/21 1:17 PM, Mark Adams wrote:
> >>>>> > >
> >>>>> > >
> >>>>> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu
> >>>>> <cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>> > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >>>>> > > <mailto:cliu at pppl.gov
> <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>> wrote:
> >>>>> > >
> >>>>> > > Hi Junchao,
> >>>>> > >
> >>>>> > > No I only needs it to be transferred
> within a
> >>>>> node. I use
> >>>>> > block-Jacobi
> >>>>> > > method and GMRES to solve the sparse
> matrix, so each
> >>>>> > direct solver will
> >>>>> > > take care of a sub-block of the whole
> matrix. In this
> >>>>> > way, I can use
> >>>>> > > one
> >>>>> > > GPU to solve one sub-block, which is
> stored within
> >>>>> one node.
> >>>>> > >
> >>>>> > > It was stated in the documentation that
> cusparse
> >>>>> solver
> >>>>> > is slow.
> >>>>> > > However, in my test using ex72.c, the
> cusparse
> >>>>> solver is
> >>>>> > faster than
> >>>>> > > mumps or superlu_dist on CPUs.
> >>>>> > >
> >>>>> > >
> >>>>> > > Are we talking about the factorization, the
> solve, or
> >>>>> both?
> >>>>> > >
> >>>>> > > We do not have an interface to cuSparse's LU
> >>>>> factorization (I
> >>>>> > just
> >>>>> > > learned that it exists a few weeks ago).
> >>>>> > > Perhaps your fast "cusparse solver" is
> '-pc_type lu
> >>>>> -mat_type
> >>>>> > > aijcusparse' ? This would be the CPU
> factorization,
> >>>>> which is the
> >>>>> > > dominant cost.
> >>>>> > >
> >>>>> > >
> >>>>> > > Chang
> >>>>> > >
> >>>>> > > On 10/12/21 10:24 AM, Junchao Zhang wrote:
> >>>>> > > > Hi, Chang,
> >>>>> > > > For the mumps solver, we usually
> transfers
> >>>>> matrix
> >>>>> > and vector
> >>>>> > > data
> >>>>> > > > within a compute node. For the idea you
> >>>>> propose, it
> >>>>> > looks like
> >>>>> > > we need
> >>>>> > > > to gather data within
> MPI_COMM_WORLD, right?
> >>>>> > > >
> >>>>> > > > Mark, I remember you said
> cusparse solve is
> >>>>> slow
> >>>>> > and you would
> >>>>> > > > rather do it on CPU. Is it right?
> >>>>> > > >
> >>>>> > > > --Junchao Zhang
> >>>>> > > >
> >>>>> > > >
> >>>>> > > > On Mon, Oct 11, 2021 at 10:25 PM
> Chang Liu via
> >>>>> petsc-users
> >>>>> > > > <petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>
> >>>>> <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>>
> >>>>> > <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>
> >>>>> <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>>> <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>
> >>>>> <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>>
> >>>>> > <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>
> >>>>> <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>>>>
> >>>>> > > <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>
> >>>>> <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>>
> >>>>> > <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>
> >>>>> <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>>> <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>
> >>>>> <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>>
> >>>>> > <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>
> >>>>> <mailto:petsc-users at mcs.anl.gov
> <mailto:petsc-users at mcs.anl.gov>>>>>>
> >>>>> > > wrote:
> >>>>> > > >
> >>>>> > > > Hi,
> >>>>> > > >
> >>>>> > > > Currently, it is possible to use
> mumps
> >>>>> solver in
> >>>>> > PETSC with
> >>>>> > > > -mat_mumps_use_omp_threads
> option, so that
> >>>>> > multiple MPI
> >>>>> > > processes will
> >>>>> > > > transfer the matrix and rhs data
> to the master
> >>>>> > rank, and then
> >>>>> > > master
> >>>>> > > > rank will call mumps with OpenMP
> to solve
> >>>>> the matrix.
> >>>>> > > >
> >>>>> > > > I wonder if someone can develop
> similar
> >>>>> option for
> >>>>> > cusparse
> >>>>> > > solver.
> >>>>> > > > Right now, this solver does not
> work with
> >>>>> > mpiaijcusparse. I
> >>>>> > > think a
> >>>>> > > > possible workaround is to
> transfer all the
> >>>>> matrix
> >>>>> > data to one MPI
> >>>>> > > > process, and then upload the
> data to GPU to
> >>>>> solve.
> >>>>> > In this
> >>>>> > > way, one can
> >>>>> > > > use cusparse solver for a MPI
> program.
> >>>>> > > >
> >>>>> > > > Chang
> >>>>> > > > --
> >>>>> > > > Chang Liu
> >>>>> > > > Staff Research Physicist
> >>>>> > > > +1 609 243 3438
> >>>>> > > > cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >>>>> > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
> >>>>> > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >>>>> > > <mailto:cliu at pppl.gov
> <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>>
> >>>>> > > > Princeton Plasma Physics Laboratory
> >>>>> > > > 100 Stellarator Rd, Princeton NJ
> 08540, USA
> >>>>> > > >
> >>>>> > >
> >>>>> > > --
> >>>>> > > Chang Liu
> >>>>> > > Staff Research Physicist
> >>>>> > > +1 609 243 3438
> >>>>> > > cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> <mailto:cliu at pppl.gov
> <mailto:cliu at pppl.gov>
> >>>>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
> >>>>> > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
> >>>>> > > Princeton Plasma Physics Laboratory
> >>>>> > > 100 Stellarator Rd, Princeton NJ 08540, USA
> >>>>> > >
> >>>>> >
> >>>>> > --
> >>>>> > Chang Liu
> >>>>> > Staff Research Physicist
> >>>>> > +1 609 243 3438
> >>>>> > cliu at pppl.gov <mailto:cliu at pppl.gov>
> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>> <mailto:cliu at pppl.gov
> <mailto:cliu at pppl.gov>
> >>>>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
> >>>>> > Princeton Plasma Physics Laboratory
> >>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA
> >>>>> >
> >>>>> -- Chang Liu
> >>>>> Staff Research Physicist
> >>>>> +1 609 243 3438
> >>>>> cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
> <mailto:cliu at pppl.gov>>
> >>>>> Princeton Plasma Physics Laboratory
> >>>>> 100 Stellarator Rd, Princeton NJ 08540, USA
> >>>>
> >>>> --
> >>>> Chang Liu
> >>>> Staff Research Physicist
> >>>> +1 609 243 3438
> >>>> cliu at pppl.gov <mailto:cliu at pppl.gov>
> >>>> Princeton Plasma Physics Laboratory
> >>>> 100 Stellarator Rd, Princeton NJ 08540, USA
> >>
> >> --
> >> Chang Liu
> >> Staff Research Physicist
> >> +1 609 243 3438
> >> cliu at pppl.gov <mailto:cliu at pppl.gov>
> >> Princeton Plasma Physics Laboratory
> >> 100 Stellarator Rd, Princeton NJ 08540, USA
> >
>
> --
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> cliu at pppl.gov <mailto:cliu at pppl.gov>
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA
>
--
Chang Liu
Staff Research Physicist
+1 609 243 3438
cliu at pppl.gov
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA
More information about the petsc-users
mailing list