<div dir="ltr"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">BTW, on unstructured grids, coloring requires a lot of colors and thus 

many times more bandwidth (due to multiple passes) than the operator 

itself.</blockquote><div><br></div><div>I've noticed -- in AMGx the multicolor GS was generally dramatically slower than jacobi because of lots of colors with few elements.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>You can use sparse triangular kernels like ILU (provided by cuBLAS), but

 they are so mindbogglingly slow that you'll go back to the drawing 

board and try to use a multigrid method of some sort with 

polynomial/point-block smoothing.</div></blockquote><div><br></div><div>I definitely need multigrid. I was under the impression that GAMG was relatively cuda-complete, is that not the case? What functionality works fully on GPU and what doesn't, without any host transfers (aside from what's needed for MPI)?<br><br>If I use -ksp-pc_type gamg -mg_levels_pc_type pbjacobi -mg_levels_ksp_type richardson is that fully on device, but -mg_levels_pc_type ilu or -mg_levels_pc_type sor require transfers?<br></div><div> <br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jan 10, 2023 at 2:47 PM Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The joy of GPUs. You can use sparse triangular kernels like ILU (provided by cuBLAS), but they are so mindbogglingly slow that you'll go back to the drawing board and try to use a multigrid method of some sort with polynomial/point-block smoothing.<br>

<br>

BTW, on unstructured grids, coloring requires a lot of colors and thus many times more bandwidth (due to multiple passes) than the operator itself.<br>

<br>

Mark Lohry <<a href="mailto:mlohry@gmail.com" target="_blank">mlohry@gmail.com</a>> writes:<br>

<br>

> Well that's suboptimal. What are my options for 100% GPU solves with no<br>

> host transfers?<br>

><br>

> On Tue, Jan 10, 2023, 2:23 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br>

><br>

>><br>

>><br>

>> On Jan 10, 2023, at 2:19 PM, Mark Lohry <<a href="mailto:mlohry@gmail.com" target="_blank">mlohry@gmail.com</a>> wrote:<br>

>><br>

>> Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi if<br>

>>> the node size is not uniform). The are good choices for scale-resolving CFD<br>

>>> on GPUs.<br>

>>><br>

>><br>

>> I was hoping you'd know :)  pbjacobi is underperforming ilu by a pretty<br>

>> wide margin on some of the systems i'm looking at.<br>

>><br>

>> We don't have colored smoothers currently in PETSc.<br>

>>><br>

>><br>

>> So what happens under the hood when I run -mg_levels_pc_type sor on GPU?<br>

>> Are you actually decomposing the matrix into lower and computing updates<br>

>> with matrix multiplications? Or is it just the standard serial algorithm<br>

>> with thread safety ignored?<br>

>><br>

>><br>

>>   It is running the regular SOR on the CPU and needs to copy up the vector<br>

>> and copy down the result.<br>

>><br>

>><br>

>> On Tue, Jan 10, 2023 at 1:52 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br>

>><br>

>>><br>

>>>   We don't have colored smoothers currently in PETSc.<br>

>>><br>

>>> > On Jan 10, 2023, at 12:56 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>> wrote:<br>

>>> ><br>

>>> > Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi<br>

>>> if the node size is not uniform). The are good choices for scale-resolving<br>

>>> CFD on GPUs.<br>

>>> ><br>

>>> > Mark Lohry <<a href="mailto:mlohry@gmail.com" target="_blank">mlohry@gmail.com</a>> writes:<br>

>>> ><br>

>>> >> I'm running GAMG with CUDA, and I'm wondering how the nominally serial<br>

>>> >> smoother algorithms are implemented on GPU? Specifically SOR/GS and<br>

>>> ILU(0)<br>

>>> >> -- in e.g. AMGx these are applied by first creating a coloring, and the<br>

>>> >> smoother passes are done color by color. Is this how it's done in<br>

>>> petsc AMG?<br>

>>> >><br>

>>> >> Tangential, AMGx and OpenFOAM offer something called "DILU", diagonal<br>

>>> ILU.<br>

>>> >> Is there an equivalent in petsc?<br>

>>> >><br>

>>> >> Thanks,<br>

>>> >> Mark<br>

>>><br>

>>><br>

>><br>

</blockquote></div>