[petsc-users] [External] Re: request to add an option similar to use_omp_threads for mumps to cusparse solver

Wed Oct 13 11:16:47 CDT 2021

Hi Mark,

'-mat_type aijcusparse' works with mpiaijcusparse with other solvers, 
but with -pc_factor_mat_solver_type cusparse, it will give an error.

Yes what I want is to have mumps or superlu to do the factorization, and 
then do the rest, including GMRES solver, on gpu. Is that possible?

I have tried to use aijcusparse with superlu_dist, it runs but the 
iterative solver is still running on CPUs. I have contacted the superlu 
group and they confirmed that is the case right now. But if I set 
-pc_factor_mat_solver_type cusparse, it seems that the iterative solver 
is running on GPU.

Chang

On 10/13/21 12:03 PM, Mark Adams wrote:
> 
> 
> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <cliu at pppl.gov 
> <mailto:cliu at pppl.gov>> wrote:
> 
>     Thank you Junchao for explaining this. I guess in my case the code is
>     just calling a seq solver like superlu to do factorization on GPUs.
> 
>     My idea is that I want to have a traditional MPI code to utilize GPUs
>     with cusparse. Right now cusparse does not support mpiaij matrix, 
> 
> 
> Sure it does: '-mat_type aijcusparse' will give you an 
> mpiaijcusparse matrix with > 1 processes.
> (-mat_type mpiaijcusparse might also work with >1 proc).
> 
> However, I see in grepping the repo that all the mumps and superlu tests 
> use aij or sell matrix type.
> MUMPS and SuperLU provide their own solves, I assume .... but you might 
> want to do other matrix operations on the GPU. Is that the issue?
> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a 
> problem? (no test with it so it probably does not work)
> 
> Thanks,
> Mark
> 
>     so I
>     want the code to have a mpiaij matrix when adding all the matrix terms,
>     and then transform the matrix to seqaij when doing the factorization
>     and
>     solve. This involves sending the data to the master process, and I
>     think
>     the petsc mumps solver have something similar already.
> 
>     Chang
> 
>     On 10/13/21 10:18 AM, Junchao Zhang wrote:
>      >
>      >
>      >
>      > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams <mfadams at lbl.gov
>     <mailto:mfadams at lbl.gov>
>      > <mailto:mfadams at lbl.gov <mailto:mfadams at lbl.gov>>> wrote:
>      >
>      >
>      >
>      >     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu <cliu at pppl.gov
>     <mailto:cliu at pppl.gov>
>      >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>> wrote:
>      >
>      >         Hi Mark,
>      >
>      >         The option I use is like
>      >
>      >         -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres
>     -mat_type
>      >         aijcusparse *-sub_pc_factor_mat_solver_type cusparse
>     *-sub_ksp_type
>      >         preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300
>      >         -ksp_atol 1.e-300
>      >
>      >
>      >     Note, If you use -log_view the last column (rows are the
>     method like
>      >     MatFactorNumeric) has the percent of work in the GPU.
>      >
>      >     Junchao: *This* implies that we have a cuSparse LU
>     factorization. Is
>      >     that correct? (I don't think we do)
>      >
>      > No, we don't have cuSparse LU factorization.  If you check
>      > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls
>      > MatLUFactorSymbolic_SeqAIJ() instead.
>      > So I don't understand Chang's idea. Do you want to make bigger
>     blocks?
>      >
>      >
>      >         I think this one do both factorization and solve on gpu.
>      >
>      >         You can check the runex72_aijcusparse.sh file in petsc
>     install
>      >         directory, and try it your self (this is only lu
>     factorization
>      >         without
>      >         iterative solve).
>      >
>      >         Chang
>      >
>      >         On 10/12/21 1:17 PM, Mark Adams wrote:
>      >          >
>      >          >
>      >          > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu
>     <cliu at pppl.gov <mailto:cliu at pppl.gov>
>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >          > <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>> wrote:
>      >          >
>      >          >     Hi Junchao,
>      >          >
>      >          >     No I only needs it to be transferred within a
>     node. I use
>      >         block-Jacobi
>      >          >     method and GMRES to solve the sparse matrix, so each
>      >         direct solver will
>      >          >     take care of a sub-block of the whole matrix. In this
>      >         way, I can use
>      >          >     one
>      >          >     GPU to solve one sub-block, which is stored within
>     one node.
>      >          >
>      >          >     It was stated in the documentation that cusparse
>     solver
>      >         is slow.
>      >          >     However, in my test using ex72.c, the cusparse
>     solver is
>      >         faster than
>      >          >     mumps or superlu_dist on CPUs.
>      >          >
>      >          >
>      >          > Are we talking about the factorization, the solve, or
>     both?
>      >          >
>      >          > We do not have an interface to cuSparse's LU
>     factorization (I
>      >         just
>      >          > learned that it exists a few weeks ago).
>      >          > Perhaps your fast "cusparse solver" is '-pc_type lu
>     -mat_type
>      >          > aijcusparse' ? This would be the CPU factorization,
>     which is the
>      >          > dominant cost.
>      >          >
>      >          >
>      >          >     Chang
>      >          >
>      >          >     On 10/12/21 10:24 AM, Junchao Zhang wrote:
>      >          >      > Hi, Chang,
>      >          >      >     For the mumps solver, we usually transfers
>     matrix
>      >         and vector
>      >          >     data
>      >          >      > within a compute node.  For the idea you
>     propose, it
>      >         looks like
>      >          >     we need
>      >          >      > to gather data within MPI_COMM_WORLD, right?
>      >          >      >
>      >          >      >     Mark, I remember you said cusparse solve is
>     slow
>      >         and you would
>      >          >      > rather do it on CPU. Is it right?
>      >          >      >
>      >          >      > --Junchao Zhang
>      >          >      >
>      >          >      >
>      >          >      > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via
>     petsc-users
>      >          >      > <petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>> <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>>
>      >          >     <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>> <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>
>      >         <mailto:petsc-users at mcs.anl.gov
>     <mailto:petsc-users at mcs.anl.gov>>>>>
>      >          >     wrote:
>      >          >      >
>      >          >      >     Hi,
>      >          >      >
>      >          >      >     Currently, it is possible to use mumps
>     solver in
>      >         PETSC with
>      >          >      >     -mat_mumps_use_omp_threads option, so that
>      >         multiple MPI
>      >          >     processes will
>      >          >      >     transfer the matrix and rhs data to the master
>      >         rank, and then
>      >          >     master
>      >          >      >     rank will call mumps with OpenMP to solve
>     the matrix.
>      >          >      >
>      >          >      >     I wonder if someone can develop similar
>     option for
>      >         cusparse
>      >          >     solver.
>      >          >      >     Right now, this solver does not work with
>      >         mpiaijcusparse. I
>      >          >     think a
>      >          >      >     possible workaround is to transfer all the
>     matrix
>      >         data to one MPI
>      >          >      >     process, and then upload the data to GPU to
>     solve.
>      >         In this
>      >          >     way, one can
>      >          >      >     use cusparse solver for a MPI program.
>      >          >      >
>      >          >      >     Chang
>      >          >      >     --
>      >          >      >     Chang Liu
>      >          >      >     Staff Research Physicist
>      >          >      >     +1 609 243 3438
>      >          >      > cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>
>      >          >     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>>
>      >          >      >     Princeton Plasma Physics Laboratory
>      >          >      >     100 Stellarator Rd, Princeton NJ 08540, USA
>      >          >      >
>      >          >
>      >          >     --
>      >          >     Chang Liu
>      >          >     Staff Research Physicist
>      >          >     +1 609 243 3438
>      >          > cliu at pppl.gov <mailto:cliu at pppl.gov>
>     <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>
>      >         <mailto:cliu at pppl.gov <mailto:cliu at pppl.gov>>>
>      >          >     Princeton Plasma Physics Laboratory
>      >          >     100 Stellarator Rd, Princeton NJ 08540, USA
>      >          >
>      >
>      >         --
>      >         Chang Liu
>      >         Staff Research Physicist
>      >         +1 609 243 3438
>      > cliu at pppl.gov <mailto:cliu at pppl.gov> <mailto:cliu at pppl.gov
>     <mailto:cliu at pppl.gov>>
>      >         Princeton Plasma Physics Laboratory
>      >         100 Stellarator Rd, Princeton NJ 08540, USA
>      >
> 
>     -- 
>     Chang Liu
>     Staff Research Physicist
>     +1 609 243 3438
>     cliu at pppl.gov <mailto:cliu at pppl.gov>
>     Princeton Plasma Physics Laboratory
>     100 Stellarator Rd, Princeton NJ 08540, USA
> 

-- 
Chang Liu
Staff Research Physicist
+1 609 243 3438
cliu at pppl.gov
Princeton Plasma Physics Laboratory
100 Stellarator Rd, Princeton NJ 08540, USA