<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 13, 2021 at 1:53 PM Barry Smith <<a href="mailto:bsmith@petsc.dev">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
Chang,<br>
<br>
You are correct there is no MPI + GPU direct solvers that currently do the triangular solves with MPI + GPU parallelism that I am aware of. </blockquote><div><br></div><div>So SuperLU and MUMPS do MPI solves on the CPU. That is reasonable. I have not been able to get decent performance with GPU solves. Complex code and low AI is not a good fit for GPUs. No work and all latency.</div><div><br></div><div>Chang, you would find that GPU solves suck and, anyway, machines these days are configured with significant (high quality) CPU resources. I think you would find that you can't get GPU solves to beat CPU solves, except if you have enormous problems to solve, perhaps.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">You are limited that individual triangular solves be done on a single GPU. I can only suggest making each subdomain as big as possible to utilize each GPU as much as possible for the direct triangular solves.<br>
<br>
Barry<br>
<br>
<br>
> On Oct 13, 2021, at 12:16 PM, Chang Liu via petsc-users <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>
> <br>
> Hi Mark,<br>
> <br>
> '-mat_type aijcusparse' works with mpiaijcusparse with other solvers, but with -pc_factor_mat_solver_type cusparse, it will give an error.<br>
> <br>
> Yes what I want is to have mumps or superlu to do the factorization, and then do the rest, including GMRES solver, on gpu. Is that possible?<br>
> <br>
> I have tried to use aijcusparse with superlu_dist, it runs but the iterative solver is still running on CPUs. I have contacted the superlu group and they confirmed that is the case right now. But if I set -pc_factor_mat_solver_type cusparse, it seems that the iterative solver is running on GPU.<br>
> <br>
> Chang<br>
> <br>
> On 10/13/21 12:03 PM, Mark Adams wrote:<br>
>> On Wed, Oct 13, 2021 at 11:10 AM Chang Liu <<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>> wrote:<br>
>> Thank you Junchao for explaining this. I guess in my case the code is<br>
>> just calling a seq solver like superlu to do factorization on GPUs.<br>
>> My idea is that I want to have a traditional MPI code to utilize GPUs<br>
>> with cusparse. Right now cusparse does not support mpiaij matrix, Sure it does: '-mat_type aijcusparse' will give you an mpiaijcusparse matrix with > 1 processes.<br>
>> (-mat_type mpiaijcusparse might also work with >1 proc).<br>
>> However, I see in grepping the repo that all the mumps and superlu tests use aij or sell matrix type.<br>
>> MUMPS and SuperLU provide their own solves, I assume .... but you might want to do other matrix operations on the GPU. Is that the issue?<br>
>> Did you try -mat_type aijcusparse with MUMPS and/or SuperLU have a problem? (no test with it so it probably does not work)<br>
>> Thanks,<br>
>> Mark<br>
>> so I<br>
>> want the code to have a mpiaij matrix when adding all the matrix terms,<br>
>> and then transform the matrix to seqaij when doing the factorization<br>
>> and<br>
>> solve. This involves sending the data to the master process, and I<br>
>> think<br>
>> the petsc mumps solver have something similar already.<br>
>> Chang<br>
>> On 10/13/21 10:18 AM, Junchao Zhang wrote:<br>
>> ><br>
>> ><br>
>> ><br>
>> > On Tue, Oct 12, 2021 at 1:07 PM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a><br>
>> <mailto:<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>><br>
>> > <mailto:<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a> <mailto:<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>>>> wrote:<br>
>> ><br>
>> ><br>
>> ><br>
>> > On Tue, Oct 12, 2021 at 1:45 PM Chang Liu <<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a><br>
>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>
>> > <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>>> wrote:<br>
>> ><br>
>> > Hi Mark,<br>
>> ><br>
>> > The option I use is like<br>
>> ><br>
>> > -pc_type bjacobi -pc_bjacobi_blocks 16 -ksp_type fgmres<br>
>> -mat_type<br>
>> > aijcusparse *-sub_pc_factor_mat_solver_type cusparse<br>
>> *-sub_ksp_type<br>
>> > preonly *-sub_pc_type lu* -ksp_max_it 2000 -ksp_rtol 1.e-300<br>
>> > -ksp_atol 1.e-300<br>
>> ><br>
>> ><br>
>> > Note, If you use -log_view the last column (rows are the<br>
>> method like<br>
>> > MatFactorNumeric) has the percent of work in the GPU.<br>
>> ><br>
>> > Junchao: *This* implies that we have a cuSparse LU<br>
>> factorization. Is<br>
>> > that correct? (I don't think we do)<br>
>> ><br>
>> > No, we don't have cuSparse LU factorization. If you check<br>
>> > MatLUFactorSymbolic_SeqAIJCUSPARSE(),you will find it calls<br>
>> > MatLUFactorSymbolic_SeqAIJ() instead.<br>
>> > So I don't understand Chang's idea. Do you want to make bigger<br>
>> blocks?<br>
>> ><br>
>> ><br>
>> > I think this one do both factorization and solve on gpu.<br>
>> ><br>
>> > You can check the runex72_aijcusparse.sh file in petsc<br>
>> install<br>
>> > directory, and try it your self (this is only lu<br>
>> factorization<br>
>> > without<br>
>> > iterative solve).<br>
>> ><br>
>> > Chang<br>
>> ><br>
>> > On 10/12/21 1:17 PM, Mark Adams wrote:<br>
>> > ><br>
>> > ><br>
>> > > On Tue, Oct 12, 2021 at 11:19 AM Chang Liu<br>
>> <<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>
>> > <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>><br>
>> > > <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>
>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>>>> wrote:<br>
>> > ><br>
>> > > Hi Junchao,<br>
>> > ><br>
>> > > No I only needs it to be transferred within a<br>
>> node. I use<br>
>> > block-Jacobi<br>
>> > > method and GMRES to solve the sparse matrix, so each<br>
>> > direct solver will<br>
>> > > take care of a sub-block of the whole matrix. In this<br>
>> > way, I can use<br>
>> > > one<br>
>> > > GPU to solve one sub-block, which is stored within<br>
>> one node.<br>
>> > ><br>
>> > > It was stated in the documentation that cusparse<br>
>> solver<br>
>> > is slow.<br>
>> > > However, in my test using ex72.c, the cusparse<br>
>> solver is<br>
>> > faster than<br>
>> > > mumps or superlu_dist on CPUs.<br>
>> > ><br>
>> > ><br>
>> > > Are we talking about the factorization, the solve, or<br>
>> both?<br>
>> > ><br>
>> > > We do not have an interface to cuSparse's LU<br>
>> factorization (I<br>
>> > just<br>
>> > > learned that it exists a few weeks ago).<br>
>> > > Perhaps your fast "cusparse solver" is '-pc_type lu<br>
>> -mat_type<br>
>> > > aijcusparse' ? This would be the CPU factorization,<br>
>> which is the<br>
>> > > dominant cost.<br>
>> > ><br>
>> > ><br>
>> > > Chang<br>
>> > ><br>
>> > > On 10/12/21 10:24 AM, Junchao Zhang wrote:<br>
>> > > > Hi, Chang,<br>
>> > > > For the mumps solver, we usually transfers<br>
>> matrix<br>
>> > and vector<br>
>> > > data<br>
>> > > > within a compute node. For the idea you<br>
>> propose, it<br>
>> > looks like<br>
>> > > we need<br>
>> > > > to gather data within MPI_COMM_WORLD, right?<br>
>> > > ><br>
>> > > > Mark, I remember you said cusparse solve is<br>
>> slow<br>
>> > and you would<br>
>> > > > rather do it on CPU. Is it right?<br>
>> > > ><br>
>> > > > --Junchao Zhang<br>
>> > > ><br>
>> > > ><br>
>> > > > On Mon, Oct 11, 2021 at 10:25 PM Chang Liu via<br>
>> petsc-users<br>
>> > > > <<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>
>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
>> > <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>
>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>
>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
>> > <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>
>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>>><br>
>> > > <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>
>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
>> > <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>
>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>
>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>><br>
>> > <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a><br>
>> <mailto:<a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>>>>>><br>
>> > > wrote:<br>
>> > > ><br>
>> > > > Hi,<br>
>> > > ><br>
>> > > > Currently, it is possible to use mumps<br>
>> solver in<br>
>> > PETSC with<br>
>> > > > -mat_mumps_use_omp_threads option, so that<br>
>> > multiple MPI<br>
>> > > processes will<br>
>> > > > transfer the matrix and rhs data to the master<br>
>> > rank, and then<br>
>> > > master<br>
>> > > > rank will call mumps with OpenMP to solve<br>
>> the matrix.<br>
>> > > ><br>
>> > > > I wonder if someone can develop similar<br>
>> option for<br>
>> > cusparse<br>
>> > > solver.<br>
>> > > > Right now, this solver does not work with<br>
>> > mpiaijcusparse. I<br>
>> > > think a<br>
>> > > > possible workaround is to transfer all the<br>
>> matrix<br>
>> > data to one MPI<br>
>> > > > process, and then upload the data to GPU to<br>
>> solve.<br>
>> > In this<br>
>> > > way, one can<br>
>> > > > use cusparse solver for a MPI program.<br>
>> > > ><br>
>> > > > Chang<br>
>> > > > --<br>
>> > > > Chang Liu<br>
>> > > > Staff Research Physicist<br>
>> > > > +1 609 243 3438<br>
>> > > > <a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>
>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>><br>
>> > <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>
>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>>><br>
>> > <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>
>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>><br>
>> > > <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>
>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>>>><br>
>> > > > Princeton Plasma Physics Laboratory<br>
>> > > > 100 Stellarator Rd, Princeton NJ 08540, USA<br>
>> > > ><br>
>> > ><br>
>> > > --<br>
>> > > Chang Liu<br>
>> > > Staff Research Physicist<br>
>> > > +1 609 243 3438<br>
>> > > <a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>
>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a><br>
>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>
>> > <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>>><br>
>> > > Princeton Plasma Physics Laboratory<br>
>> > > 100 Stellarator Rd, Princeton NJ 08540, USA<br>
>> > ><br>
>> ><br>
>> > --<br>
>> > Chang Liu<br>
>> > Staff Research Physicist<br>
>> > +1 609 243 3438<br>
>> > <a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a><br>
>> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>>><br>
>> > Princeton Plasma Physics Laboratory<br>
>> > 100 Stellarator Rd, Princeton NJ 08540, USA<br>
>> ><br>
>> -- Chang Liu<br>
>> Staff Research Physicist<br>
>> +1 609 243 3438<br>
>> <a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a> <mailto:<a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a>><br>
>> Princeton Plasma Physics Laboratory<br>
>> 100 Stellarator Rd, Princeton NJ 08540, USA<br>
> <br>
> -- <br>
> Chang Liu<br>
> Staff Research Physicist<br>
> +1 609 243 3438<br>
> <a href="mailto:cliu@pppl.gov" target="_blank">cliu@pppl.gov</a><br>
> Princeton Plasma Physics Laboratory<br>
> 100 Stellarator Rd, Princeton NJ 08540, USA<br>
<br>
</blockquote></div></div>