[petsc-users] GPU implementation of serial smoothers

Mark Lohry mlohry at gmail.com
Tue Jan 10 13:54:23 CST 2023


>
> BTW, on unstructured grids, coloring requires a lot of colors and thus
> many times more bandwidth (due to multiple passes) than the operator itself.


I've noticed -- in AMGx the multicolor GS was generally dramatically slower
than jacobi because of lots of colors with few elements.

You can use sparse triangular kernels like ILU (provided by cuBLAS), but
> they are so mindbogglingly slow that you'll go back to the drawing board
> and try to use a multigrid method of some sort with polynomial/point-block
> smoothing.
>

I definitely need multigrid. I was under the impression that GAMG was
relatively cuda-complete, is that not the case? What functionality works
fully on GPU and what doesn't, without any host transfers (aside from
what's needed for MPI)?

If I use -ksp-pc_type gamg -mg_levels_pc_type pbjacobi -mg_levels_ksp_type
richardson is that fully on device, but -mg_levels_pc_type ilu or
-mg_levels_pc_type sor require transfers?


On Tue, Jan 10, 2023 at 2:47 PM Jed Brown <jed at jedbrown.org> wrote:

> The joy of GPUs. You can use sparse triangular kernels like ILU (provided
> by cuBLAS), but they are so mindbogglingly slow that you'll go back to the
> drawing board and try to use a multigrid method of some sort with
> polynomial/point-block smoothing.
>
> BTW, on unstructured grids, coloring requires a lot of colors and thus
> many times more bandwidth (due to multiple passes) than the operator itself.
>
> Mark Lohry <mlohry at gmail.com> writes:
>
> > Well that's suboptimal. What are my options for 100% GPU solves with no
> > host transfers?
> >
> > On Tue, Jan 10, 2023, 2:23 PM Barry Smith <bsmith at petsc.dev> wrote:
> >
> >>
> >>
> >> On Jan 10, 2023, at 2:19 PM, Mark Lohry <mlohry at gmail.com> wrote:
> >>
> >> Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi
> if
> >>> the node size is not uniform). The are good choices for
> scale-resolving CFD
> >>> on GPUs.
> >>>
> >>
> >> I was hoping you'd know :)  pbjacobi is underperforming ilu by a pretty
> >> wide margin on some of the systems i'm looking at.
> >>
> >> We don't have colored smoothers currently in PETSc.
> >>>
> >>
> >> So what happens under the hood when I run -mg_levels_pc_type sor on GPU?
> >> Are you actually decomposing the matrix into lower and computing updates
> >> with matrix multiplications? Or is it just the standard serial algorithm
> >> with thread safety ignored?
> >>
> >>
> >>   It is running the regular SOR on the CPU and needs to copy up the
> vector
> >> and copy down the result.
> >>
> >>
> >> On Tue, Jan 10, 2023 at 1:52 PM Barry Smith <bsmith at petsc.dev> wrote:
> >>
> >>>
> >>>   We don't have colored smoothers currently in PETSc.
> >>>
> >>> > On Jan 10, 2023, at 12:56 PM, Jed Brown <jed at jedbrown.org> wrote:
> >>> >
> >>> > Is DILU a point-block method? We have -pc_type pbjacobi (and
> vpbjacobi
> >>> if the node size is not uniform). The are good choices for
> scale-resolving
> >>> CFD on GPUs.
> >>> >
> >>> > Mark Lohry <mlohry at gmail.com> writes:
> >>> >
> >>> >> I'm running GAMG with CUDA, and I'm wondering how the nominally
> serial
> >>> >> smoother algorithms are implemented on GPU? Specifically SOR/GS and
> >>> ILU(0)
> >>> >> -- in e.g. AMGx these are applied by first creating a coloring, and
> the
> >>> >> smoother passes are done color by color. Is this how it's done in
> >>> petsc AMG?
> >>> >>
> >>> >> Tangential, AMGx and OpenFOAM offer something called "DILU",
> diagonal
> >>> ILU.
> >>> >> Is there an equivalent in petsc?
> >>> >>
> >>> >> Thanks,
> >>> >> Mark
> >>>
> >>>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230110/70f9615b/attachment-0001.html>


More information about the petsc-users mailing list