[petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver
Barry Smith
bsmith at petsc.dev
Wed Oct 13 12:53:27 CDT 2021
Chang,
You are correct there is no MPI + GPU direct solvers that currently do the triangular solves with MPI + GPU parallelism that I am aware of. You are limited that individual triangular solves be done on a single GPU. I can only suggest making each subdomain as big as possible to utilize each GPU as much as possible for the direct triangular solves.
Barry
> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users <petsc-users at mcs.anl.gov> wrote:
>
> Hi Mark,
>
> '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, but with -pc_factor_mat_solver_type cusparse, it will give an error.
>
> Yes what I want is to have mumps or superlu to do the factorization, and then do the rest, including GMRES solver, on gpu. Is that possible?
>
> I have tried to use aijcusparse with superlu_dist, it runs but the iterative solver is still running on CPUs. I have contacted the superlu group and they confirmed that is the case right now. But if I set -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is running on GPU.
>
> Chang
>
> On 10/13/21 12:03 PM, Mark Adams wrote:
>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <cliu at pppl.gov <mailto:cliu at pppl.gov>> wrote:
>> Thank you Junchao for explaining this. I guess in my case the code is
>> just calling a seq solver like superlu to do factorization on GPUs.
>> My idea is that I want to have a traditional MPI code to utilize GPUs
>> with cusparse. Right now cusparse does not support mpiaij matrix, Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse matrix with > 1 processes.
>> (-mat_type mpiaijcusparse might also work with >1 proc).
>> However, I see in grepping the repo that all the mumps and superlu tests use aij or sell matrix type.
>> MUMPS and SuperLU provide their own solves, I assume .... but you might want to do other matrix operations on the GPU. Is that the issue?
>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a problem? (no test with it so it probably does not work)
>> Thanks,
>> Mark
>> so I
>> want the code to have a mpiaij matrix when adding all the matrix terms,
>> and then transform the matrix to seqaij when doing the factorization
>> and
>> solve. This involves sending the data to the master process, and I
>> think
>> the petsc mumps solver have something similar already.
>> Chang
>> On 10/13/21 10:18 AM, Junchao Zhang wrote:
>> >
>> >
>> >
>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams <mfadams at lbl.gov
>> <mailto:mfadams at lbl.gov>
>> > <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>> wrote:
>> >
>> >
>> >
>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu <cliu at pppl.gov
>> <mailto:cliu at pppl.gov>
>> > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> wrote:
>> >
>> > Hi Mark,
>> >
>> > The option I use is like
>> >
>> > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres
>> -mat_type
>> > aijcusparse *-sub_pc_factor_mat_solver_type cusparse
>> *-sub_ksp_type
>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300
>> > -ksp_atol 1.e-300
>> >
>> >
>> > Note, If you use -log_view the last column (rows are the
>> method like
>> > MatFactorNumeric) has the percent of work in the GPU.
>> >
>> > Junchao: *This* implies that we have a cuSparse LU
>> factorization. Is
>> > that correct? (I don't think we do)
>> >
>> > No, we don't have cuSparse LU factorization. If you check
>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls
>> > MatLUFactorSymbolic_SeqAIJ() instead.
>> > So I don't understand Chang's idea. Do you want to make bigger
>> blocks?
>> >
>> >
>> > I think this one do both factorization and solve on gpu.
>> >
>> > You can check the runex72_aijcusparse.sh file in petsc
>> install
>> > directory, and try it your self (this is only lu
>> factorization
>> > without
>> > iterative solve).
>> >
>> > Chang
>> >
>> > On 10/12/21 1:17 PM, Mark Adams wrote:
>> > >
>> > >
>> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu
>> <cliu at pppl.gov <mailto:cliu at pppl.gov>
>> > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>> > > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>> wrote:
>> > >
>> > > Hi Junchao,
>> > >
>> > > No I only needs it to be transferred within a
>> node. I use
>> > block-Jacobi
>> > > method and GMRES to solve the sparse matrix, so each
>> > direct solver will
>> > > take care of a sub-block of the whole matrix. In this
>> > way, I can use
>> > > one
>> > > GPU to solve one sub-block, which is stored within
>> one node.
>> > >
>> > > It was stated in the documentation that cusparse
>> solver
>> > is slow.
>> > > However, in my test using ex72.c, the cusparse
>> solver is
>> > faster than
>> > > mumps or superlu_dist on CPUs.
>> > >
>> > >
>> > > Are we talking about the factorization, the solve, or
>> both?
>> > >
>> > > We do not have an interface to cuSparse's LU
>> factorization (I
>> > just
>> > > learned that it exists a few weeks ago).
>> > > Perhaps your fast "cusparse solver" is '-pc_type lu
>> -mat_type
>> > > aijcusparse' ? This would be the CPU factorization,
>> which is the
>> > > dominant cost.
>> > >
>> > >
>> > > Chang
>> > >
>> > > On 10/12/21 10:24 AM, Junchao Zhang wrote:
>> > > > Hi, Chang,
>> > > > For the mumps solver, we usually transfers
>> matrix
>> > and vector
>> > > data
>> > > > within a compute node. For the idea you
>> propose, it
>> > looks like
>> > > we need
>> > > > to gather data within MPI_COMM_WORLD, right?
>> > > >
>> > > > Mark, I remember you said cusparse solve is
>> slow
>> > and you would
>> > > > rather do it on CPU. Is it right?
>> > > >
>> > > > --Junchao Zhang
>> > > >
>> > > >
>> > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via
>> petsc-users
>> > > > <petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>
>> > <mailto:petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>> <mailto:petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>
>> > <mailto:petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>>>
>> > > <mailto:petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>
>> > <mailto:petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>> <mailto:petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>
>> > <mailto:petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>>>>>
>> > > wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > Currently, it is possible to use mumps
>> solver in
>> > PETSC with
>> > > > -mat_mumps_use_omp_threads option, so that
>> > multiple MPI
>> > > processes will
>> > > > transfer the matrix and rhs data to the master
>> > rank, and then
>> > > master
>> > > > rank will call mumps with OpenMP to solve
>> the matrix.
>> > > >
>> > > > I wonder if someone can develop similar
>> option for
>> > cusparse
>> > > solver.
>> > > > Right now, this solver does not work with
>> > mpiaijcusparse. I
>> > > think a
>> > > > possible workaround is to transfer all the
>> matrix
>> > data to one MPI
>> > > > process, and then upload the data to GPU to
>> solve.
>> > In this
>> > > way, one can
>> > > > use cusparse solver for a MPI program.
>> > > >
>> > > > Chang
>> > > > --
>> > > > Chang Liu
>> > > > Staff Research Physicist
>> > > > +1 609 243 3438
>> > > > cliu at pppl.gov <mailto:cliu at pppl.gov>
>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>> > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>> > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>> > > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>> > > > Princeton Plasma Physics Laboratory
>> > > > 100 Stellarator Rd, Princeton NJ 08540, USA
>> > > >
>> > >
>> > > --
>> > > Chang Liu
>> > > Staff Research Physicist
>> > > +1 609 243 3438
>> > > cliu at pppl.gov <mailto:cliu at pppl.gov>
>> <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>> <mailto:cliu at pppl.gov
>> <mailto:cliu at pppl.gov>
>> > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>> > > Princeton Plasma Physics Laboratory
>> > > 100 Stellarator Rd, Princeton NJ 08540, USA
>> > >
>> >
>> > --
>> > Chang Liu
>> > Staff Research Physicist
>> > +1 609 243 3438
>> > cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>> <mailto:cliu at pppl.gov>>
>> > Princeton Plasma Physics Laboratory
>> > 100 Stellarator Rd, Princeton NJ 08540, USA
>> >
>> -- Chang Liu
>> Staff Research Physicist
>> +1 609 243 3438
>> cliu at pppl.gov <mailto:cliu at pppl.gov>
>> Princeton Plasma Physics Laboratory
>> 100 Stellarator Rd, Princeton NJ 08540, USA
>
> --
> Chang Liu
> Staff Research Physicist
> +1 609 243 3438
> cliu at pppl.gov
> Princeton Plasma Physics Laboratory
> 100 Stellarator Rd, Princeton NJ 08540, USA
More information about the petsc-users
mailing list