<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <<a href="mailto:cliu@pppl.gov">cliu@pppl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thank you Junchao for explaining this. I guess in my case the code is <br>

just calling a seq solver like superlu to do factorization on GPUs.<br>

<br>

My idea is that I want to have a traditional MPI code to utilize GPUs <br>

with cusparse. Right now cusparse does not support mpiaij matrix, </blockquote><div><br></div><div>Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse matrix with > 1 processes.</div><div>(-mat_type mpiaijcusparse might also work with >1 proc).</div><div><br></div><div>However, I see in grepping the repo that all the mumps and superlu tests use aij or sell matrix type.</div><div>MUMPS and SuperLU provide their own solves, I assume .... but you might want to do other matrix operations on the GPU. Is that the issue?<br></div><div>Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a problem? (no test with it so it probably does not work)</div><div><br></div><div>Thanks,</div><div>Mark</div><div>  </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">so I <br>

want the code to have a mpiaij matrix when adding all the matrix terms, <br>

and then transform the matrix to seqaij when doing the factorization and <br>

solve. This involves sending the data to the master process, and I think <br>

the petsc mumps solver have something similar already.<br>

<br>

Chang<br>

<br>

On 10/13/21 10:18 AM, Junchao Zhang wrote:<br>

> <br>

> <br>

> <br>

> On Tue, Oct 12, 2021 at 1:07 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a> <br>

> <mailto:<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>>> wrote:<br>

> <br>

> <br>

> <br>

>     On Tue, Oct 12, 2021 at 1:45 PM Chang Liu <<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a><br>

>     <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>> wrote:<br>

> <br>

>         Hi Mark,<br>

> <br>

>         The option I use is like<br>

> <br>

>         -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres -mat_type<br>

>         aijcusparse *-sub_pc_factor_mat_solver_type cusparse *-sub_ksp_type<br>

>         preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300<br>

>         -ksp_atol 1.e-300<br>

> <br>

> <br>

>     Note, If you use -log_view the last column (rows are the method like<br>

>     MatFactorNumeric) has the percent of work in the GPU.<br>

> <br>

>     Junchao: *This* implies that we have a cuSparse LU factorization. Is<br>

>     that correct? (I don't think we do)<br>

> <br>

> No, we don't have cuSparse LU factorization.  If you check <br>

> MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls <br>

> MatLUFactorSymbolic_SeqAIJ() instead.<br>

> So I don't understand Chang's idea. Do you want to make bigger blocks?<br>

> <br>

> <br>

>         I think this one do both factorization and solve on gpu.<br>

> <br>

>         You can check the runex72_aijcusparse.sh file in petsc install<br>

>         directory, and try it your self (this is only lu factorization<br>

>         without<br>

>         iterative solve).<br>

> <br>

>         Chang<br>

> <br>

>         On 10/12/21 1:17 PM, Mark Adams wrote:<br>

>          ><br>

>          ><br>

>          > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu <<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a><br>

>         <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>

>          > <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>>> wrote:<br>

>          ><br>

>          >     Hi Junchao,<br>

>          ><br>

>          >     No I only needs it to be transferred within a node. I use<br>

>         block-Jacobi<br>

>          >     method and GMRES to solve the sparse matrix, so each<br>

>         direct solver will<br>

>          >     take care of a sub-block of the whole matrix. In this<br>

>         way, I can use<br>

>          >     one<br>

>          >     GPU to solve one sub-block, which is stored within one node.<br>

>          ><br>

>          >     It was stated in the documentation that cusparse solver<br>

>         is slow.<br>

>          >     However, in my test using ex72.c, the cusparse solver is<br>

>         faster than<br>

>          >     mumps or superlu_dist on CPUs.<br>

>          ><br>

>          ><br>

>          > Are we talking about the factorization, the solve, or both?<br>

>          ><br>

>          > We do not have an interface to cuSparse's LU factorization (I<br>

>         just<br>

>          > learned that it exists a few weeks ago).<br>

>          > Perhaps your fast "cusparse solver" is '-pc_type lu -mat_type<br>

>          > aijcusparse' ? This would be the CPU factorization, which is the<br>

>          > dominant cost.<br>

>          ><br>

>          ><br>

>          >     Chang<br>

>          ><br>

>          >     On 10/12/21 10:24 AM, Junchao Zhang wrote:<br>

>          >      > Hi, Chang,<br>

>          >      >     For the mumps solver, we usually transfers matrix<br>

>         and vector<br>

>          >     data<br>

>          >      > within a compute node.  For the idea you propose, it<br>

>         looks like<br>

>          >     we need<br>

>          >      > to gather data within MPI_COMM_WORLD, right?<br>

>          >      ><br>

>          >      >     Mark, I remember you said cusparse solve is slow<br>

>         and you would<br>

>          >      > rather do it on CPU. Is it right?<br>

>          >      ><br>

>          >      > --Junchao Zhang<br>

>          >      ><br>

>          >      ><br>

>          >      > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via petsc-users<br>

>          >      > <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>

>         <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>

>         <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>><br>

>          >     <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>

>         <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>

>         <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>>>><br>

>          >     wrote:<br>

>          >      ><br>

>          >      >     Hi,<br>

>          >      ><br>

>          >      >     Currently, it is possible to use mumps solver in<br>

>         PETSC with<br>

>          >      >     -mat_mumps_use_omp_threads option, so that<br>

>         multiple MPI<br>

>          >     processes will<br>

>          >      >     transfer the matrix and rhs data to the master<br>

>         rank, and then<br>

>          >     master<br>

>          >      >     rank will call mumps with OpenMP to solve the matrix.<br>

>          >      ><br>

>          >      >     I wonder if someone can develop similar option for<br>

>         cusparse<br>

>          >     solver.<br>

>          >      >     Right now, this solver does not work with<br>

>         mpiaijcusparse. I<br>

>          >     think a<br>

>          >      >     possible workaround is to transfer all the matrix<br>

>         data to one MPI<br>

>          >      >     process, and then upload the data to GPU to solve.<br>

>         In this<br>

>          >     way, one can<br>

>          >      >     use cusparse solver for a MPI program.<br>

>          >      ><br>

>          >      >     Chang<br>

>          >      >     --<br>

>          >      >     Chang Liu<br>

>          >      >     Staff Research Physicist<br>

>          >      >     +1 609 243 3438<br>

>          >      > <a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>

>         <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>><br>

>         <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>

>          >     <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>>><br>

>          >      >     Princeton Plasma Physics Laboratory<br>

>          >      >     100 Stellarator Rd, Princeton NJ 08540, USA<br>

>          >      ><br>

>          ><br>

>          >     --<br>

>          >     Chang Liu<br>

>          >     Staff Research Physicist<br>

>          >     +1 609 243 3438<br>

>          > <a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a><br>

>         <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>><br>

>          >     Princeton Plasma Physics Laboratory<br>

>          >     100 Stellarator Rd, Princeton NJ 08540, USA<br>

>          ><br>

> <br>

>         -- <br>

>         Chang Liu<br>

>         Staff Research Physicist<br>

>         +1 609 243 3438<br>

>         <a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>

>         Princeton Plasma Physics Laboratory<br>

>         100 Stellarator Rd, Princeton NJ 08540, USA<br>

> <br>

<br>

-- <br>

Chang Liu<br>

Staff Research Physicist<br>

+1 609 243 3438<br>

<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a><br>

Princeton Plasma Physics Laboratory<br>

100 Stellarator Rd, Princeton NJ 08540, USA<br>

</blockquote></div></div>