[petsc-dev] GPU performance of MatSOR()

Stephen Thomas stephethomas at gmail.com
Tue Aug 2 10:59:28 CDT 2022


Barry, Jed

Paul and I developed a polynomial Gauss-Seidel smoother and ILUTP, ILU(0)
based smoothers that employ iterative
(Neuman series, or RIchardson iteration ) for the triangular solves (faster
than Jacobi, does not diverge, finite due to nitpotent
strictly upper triangular part of the U - we also use LDU (row scaled)

glad to send along more details as we have a paper in flight (revision
being sent to NLAA this week).

the problems we are solving with PeleLM are quite similar to those
described in the attached papers
by Prenter et al (2020) and Jomo et al (2021) - namely cut-cell, immersed
boundary

We are seeing a 5x speed-up on the NREL eagle machine with NVIDIA V100 and
a reduced 1.5 - 2x speed-up on
crusher with AMD MI250X GPUs (over the direct triangular solver ILU's). We
also see 5X with the MFIX-Exa model for ECP.

this work was motivated by Edmond CHow and Hartwiz Anzt looking at Jacobi
for triangular systems.

I also have a new GMRES formulation (talking about this at CEED) - that is
leading to good results
for Krylov Schur eigenvalues as well.

Cheers and best regards
Steve



On Tue, Aug 2, 2022 at 9:34 AM Paul Mullowney <paulmullowney at gmail.com>
wrote:

> The implementation is being (slowly) moved into Hypre. We have
> primarily used this technique with ILU-based smoothers for AMG. We did some
> comparisons against other smoothers like GS but not with Chebyshev or
> Polynomial.
>
> For the problems we cared about, ILU was an effective smoother. The power
> series representation of the solve provided some nice speedups. I'ved
> cc'ed Steve Thomas who could say more.
>
> -Paul
>
> On Sun, Jul 31, 2022 at 10:14 PM Jed Brown <jed at jedbrown.org> wrote:
>
>> Do you have a test that compares this with a polynomial smoother for the
>> original problem (like Chebyshev for SPD)?
>>
>> Paul Mullowney <paulmullowney at gmail.com> writes:
>>
>> > One could also approximate the SOR triangular solves with a Neumann
>> series,
>> > where each term in the series is a SpMV (great for GPUs). The number of
>> > terms needed in the series is matrix dependent.
>> > We've seen this work to great effect for some problems.
>> >
>> > -Paul
>> >
>> > On Wed, Jul 27, 2022 at 8:05 PM Barry Smith <bsmith at petsc.dev> wrote:
>> >
>> >>
>> >>   There are multicolor versions of SOR that theoretically offer good
>> >> parallelism on GPUs but at the cost of multiple phases and slower
>> >> convergence rates. Unless someone already has one coded for CUDA or
>> Kokkos
>> >> it would take a good amount of code to produce one that offers (but
>> does
>> >> not necessarily guarantee) reasonable performance on GPUs.
>> >>
>> >> > On Jul 27, 2022, at 7:57 PM, Jed Brown <jed at jedbrown.org> wrote:
>> >> >
>> >> > Unfortunately, MatSOR is a really bad operation for GPUs. We can
>> make it
>> >> use sparse triangular primitives from cuSPARSE, but those run on GPU at
>> >> about 20x slower than MatMult with the same sparse matrix. So unless
>> MatSOR
>> >> reduces iteration count by 20x compared to your next-best
>> preconditioning
>> >> option, you'll be better off finding a different preconditioner. This
>> might
>> >> be some elements of multigrid or polynomial smoothing with point-block
>> >> Jacobi. If you can explain a bit about your application, we may be
>> able to
>> >> offer some advice.
>> >> >
>> >> > Han Tran <hantran at cs.utah.edu> writes:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >> Running my example using VECMPICUDA for VecSetType(), and
>> MATMPIAIJCUSP
>> >> for MatSetType(), I have the profiling results as shown below. It is
>> seen
>> >> that MatSOR() has %F of GPU, only has GpuToCpu count and size. Is it
>> >> correct that PETSc currently does not have MatSOR implemented on GPU?
>> It
>> >> would be appreciated if you can provide an explanation on how MatSOR()
>> >> currently use GPU. From this example, MatSOR takes a considerable time
>> >> relatively compared to other functions.
>> >> >>
>> >> >> Thank you.
>> >> >>
>> >> >> -Han
>> >> >>
>> >> >>
>> >>
>> ------------------------------------------------------------------------------------------------------------------------
>> >> >> Event                Count      Time (sec)     Flop
>> >>           --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -
>>  -
>> >> GpuToCpu - GPU
>> >> >>                   Max Ratio  Max     Ratio   Max  Ratio  Mess
>>  AvgLen
>> >> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>> >>  Count   Size  %F
>> >> >>
>> >>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >> >>
>> >> >> --- Event Stage 0: Main Stage
>> >> >>
>> >> >> BuildTwoSided     220001 1.0 3.9580e+02139.9 0.00e+00 0.0 2.0e+00
>> >> 4.0e+00 2.2e+05  4  0  0  0 20   4  0  0  0 20     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> BuildTwoSidedF    220000 1.0 3.9614e+02126.4 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 2.2e+05  4  0  0  0 20   4  0  0  0 20     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> VecMDot           386001 1.0 6.3426e+01 1.5 1.05e+11 1.0 0.0e+00
>> >> 0.0e+00 3.9e+05  1 11  0  0 35   1 11  0  0 35  3311   26012   386001
>> >> 1.71e+05    0 0.00e+00 100
>> >> >> VecNorm           496001 1.0 5.0877e+01 1.2 5.49e+10 1.0 0.0e+00
>> >> 0.0e+00 5.0e+05  1  6  0  0 45   1  6  0  0 45  2159    3707   110000
>> >> 4.87e+04    0 0.00e+00 100
>> >> >> VecScale          496001 1.0 7.9951e+00 1.0 2.75e+10 1.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  3  0  0  0   0  3  0  0  0  6869   13321      0
>> >> 0.00e+00    0 0.00e+00 100
>> >> >> VecCopy           110000 1.0 1.9323e+00 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> VecSet            330017 1.0 5.4319e+00 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> VecAXPY           110000 1.0 1.5820e+00 1.0 1.22e+10 1.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0 15399   35566      0
>> >> 0.00e+00    0 0.00e+00 100
>> >> >> VecMAXPY          496001 1.0 1.1505e+01 1.0 1.48e+11 1.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0 16  0  0  0   0 16  0  0  0 25665   39638      0
>> >> 0.00e+00    0 0.00e+00 100
>> >> >> VecAssemblyBegin  110000 1.0 1.2021e+00 1.2 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 1.1e+05  0  0  0  0 10   0  0  0  0 10     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> VecAssemblyEnd    110000 1.0 1.5988e-01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> VecScatterBegin   496001 1.0 1.3002e+01 1.0 0.00e+00 0.0 9.9e+05
>> >> 1.3e+04 1.0e+00  0  0100100  0   0  0100100  0     0       0   110000
>> >> 4.87e+04    0 0.00e+00  0
>> >> >> VecScatterEnd     496001 1.0 1.8988e+01 1.3 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> VecNormalize      496001 1.0 5.8797e+01 1.1 8.24e+10 1.0 0.0e+00
>> >> 0.0e+00 5.0e+05  1  9  0  0 45   1  9  0  0 45  2802    4881   110000
>> >> 4.87e+04    0 0.00e+00 100
>> >> >> VecCUDACopyTo     716001 1.0 3.4483e+01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0   716001
>> >> 3.17e+05    0 0.00e+00  0
>> >> >> VecCUDACopyFrom  1211994 1.0 5.1752e+01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0      0
>> >> 0.00e+00 1211994 5.37e+05  0
>> >> >> MatMult           386001 1.0 4.8436e+01 1.0 1.90e+11 1.0 7.7e+05
>> >> 1.3e+04 0.0e+00  1 21 78 78  0   1 21 78 78  0  7862   16962      0
>> >> 0.00e+00    0 0.00e+00 100
>> >> >> MatMultAdd        110000 1.0 6.2666e+01 1.1 6.03e+10 1.0 2.2e+05
>> >> 1.3e+04 1.0e+00  1  7 22 22  0   1  7 22 22  0  1926   16893   440000
>> >> 3.39e+05    0 0.00e+00 100
>> >> >> MatSOR            496001 1.0 5.1821e+02 1.1 2.83e+11 1.0 0.0e+00
>> >> 0.0e+00 0.0e+00 10 31  0  0  0  10 31  0  0  0  1090       0      0
>> >> 0.00e+00 991994 4.39e+05  0
>> >> >> MatAssemblyBegin  110000 1.0 3.9732e+02109.2 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 1.1e+05  4  0  0  0 10   4  0  0  0 10     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> MatAssemblyEnd    110000 1.0 5.3015e-01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> MatZeroEntries    110000 1.0 1.3179e+01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> MatCUSPARSCopyTo  220000 1.0 3.2805e+01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0   220000
>> >> 2.41e+05    0 0.00e+00  0
>> >> >> KSPSetUp          110000 1.0 3.5344e-02 1.3 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> KSPSolve          110000 1.0 6.8304e+02 1.0 8.20e+11 1.0 7.7e+05
>> >> 1.3e+04 8.8e+05 13 89 78 78 80  13 89 78 78 80  2401   14311   496001
>> >> 2.20e+05 991994 4.39e+05 66
>> >> >> KSPGMRESOrthog    386001 1.0 7.2820e+01 1.4 2.10e+11 1.0 0.0e+00
>> >> 0.0e+00 3.9e+05  1 23  0  0 35   1 23  0  0 35  5765   30176   386001
>> >> 1.71e+05    0 0.00e+00 100
>> >> >> PCSetUp           110000 1.0 1.8825e-02 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> PCApply           496001 1.0 5.1857e+02 1.1 2.83e+11 1.0 0.0e+00
>> >> 0.0e+00 0.0e+00 10 31  0  0  0  10 31  0  0  0  1090       0      0
>> >> 0.00e+00 991994 4.39e+05  0
>> >> >> SFSetGraph             1 1.0 2.0936e-05 1.1 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> SFSetUp                1 1.0 2.5347e-03 1.0 0.00e+00 0.0 4.0e+00
>> >> 3.3e+03 1.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> SFPack            496001 1.0 3.0026e+00 1.1 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >> SFUnpack          496001 1.0 1.1296e-01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
>> >> 0.00e+00    0 0.00e+00  0
>> >> >>
>> >>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >>
>> >>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220802/021fa159/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2010.00881.pdf
Type: application/pdf
Size: 3724133 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220802/021fa159/attachment-0003.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Prenter2020_Article_MultigridSolversForImmersedFin.pdf
Type: application/pdf
Size: 10568688 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220802/021fa159/attachment-0004.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Post_Modern_GMRES (1).pdf
Type: application/pdf
Size: 800227 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220802/021fa159/attachment-0005.pdf>


More information about the petsc-dev mailing list