[petsc-dev] GPU performance of MatSOR()
Stephen Thomas
stephethomas at gmail.com
Tue Aug 2 10:59:28 CDT 2022
Barry, Jed
Paul and I developed a polynomial Gauss-Seidel smoother and ILUTP, ILU(0)
based smoothers that employ iterative
(Neuman series, or RIchardson iteration ) for the triangular solves (faster
than Jacobi, does not diverge, finite due to nitpotent
strictly upper triangular part of the U - we also use LDU (row scaled)
glad to send along more details as we have a paper in flight (revision
being sent to NLAA this week).
the problems we are solving with PeleLM are quite similar to those
described in the attached papers
by Prenter et al (2020) and Jomo et al (2021) - namely cut-cell, immersed
boundary
We are seeing a 5x speed-up on the NREL eagle machine with NVIDIA V100 and
a reduced 1.5 - 2x speed-up on
crusher with AMD MI250X GPUs (over the direct triangular solver ILU's). We
also see 5X with the MFIX-Exa model for ECP.
this work was motivated by Edmond CHow and Hartwiz Anzt looking at Jacobi
for triangular systems.
I also have a new GMRES formulation (talking about this at CEED) - that is
leading to good results
for Krylov Schur eigenvalues as well.
Cheers and best regards
Steve
On Tue, Aug 2, 2022 at 9:34 AM Paul Mullowney <paulmullowney at gmail.com>
wrote:
> The implementation is being (slowly) moved into Hypre. We have
> primarily used this technique with ILU-based smoothers for AMG. We did some
> comparisons against other smoothers like GS but not with Chebyshev or
> Polynomial.
>
> For the problems we cared about, ILU was an effective smoother. The power
> series representation of the solve provided some nice speedups. I'ved
> cc'ed Steve Thomas who could say more.
>
> -Paul
>
> On Sun, Jul 31, 2022 at 10:14 PM Jed Brown <jed at jedbrown.org> wrote:
>
>> Do you have a test that compares this with a polynomial smoother for the
>> original problem (like Chebyshev for SPD)?
>>
>> Paul Mullowney <paulmullowney at gmail.com> writes:
>>
>> > One could also approximate the SOR triangular solves with a Neumann
>> series,
>> > where each term in the series is a SpMV (great for GPUs). The number of
>> > terms needed in the series is matrix dependent.
>> > We've seen this work to great effect for some problems.
>> >
>> > -Paul
>> >
>> > On Wed, Jul 27, 2022 at 8:05 PM Barry Smith <bsmith at petsc.dev> wrote:
>> >
>> >>
>> >> There are multicolor versions of SOR that theoretically offer good
>> >> parallelism on GPUs but at the cost of multiple phases and slower
>> >> convergence rates. Unless someone already has one coded for CUDA or
>> Kokkos
>> >> it would take a good amount of code to produce one that offers (but
>> does
>> >> not necessarily guarantee) reasonable performance on GPUs.
>> >>
>> >> > On Jul 27, 2022, at 7:57 PM, Jed Brown <jed at jedbrown.org> wrote:
>> >> >
>> >> > Unfortunately, MatSOR is a really bad operation for GPUs. We can
>> make it
>> >> use sparse triangular primitives from cuSPARSE, but those run on GPU at
>> >> about 20x slower than MatMult with the same sparse matrix. So unless
>> MatSOR
>> >> reduces iteration count by 20x compared to your next-best
>> preconditioning
>> >> option, you'll be better off finding a different preconditioner. This
>> might
>> >> be some elements of multigrid or polynomial smoothing with point-block
>> >> Jacobi. If you can explain a bit about your application, we may be
>> able to
>> >> offer some advice.
>> >> >
>> >> > Han Tran <hantran at cs.utah.edu> writes:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >> Running my example using VECMPICUDA for VecSetType(), and
>> MATMPIAIJCUSP
>> >> for MatSetType(), I have the profiling results as shown below. It is
>> seen
>> >> that MatSOR() has %F of GPU, only has GpuToCpu count and size. Is it
>> >> correct that PETSc currently does not have MatSOR implemented on GPU?
>> It
>> >> would be appreciated if you can provide an explanation on how MatSOR()
>> >> currently use GPU. From this example, MatSOR takes a considerable time
>> >> relatively compared to other functions.
>> >> >>
>> >> >> Thank you.
>> >> >>
>> >> >> -Han
>> >> >>
>> >> >>
>> >>
>> ------------------------------------------------------------------------------------------------------------------------
>> >> >> Event Count Time (sec) Flop
>> >> --- Global --- --- Stage ---- Total GPU - CpuToGpu -
>> -
>> >> GpuToCpu - GPU
>> >> >> Max Ratio Max Ratio Max Ratio Mess
>> AvgLen
>> >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size
>> >> Count Size %F
>> >> >>
>> >>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >> >>
>> >> >> --- Event Stage 0: Main Stage
>> >> >>
>> >> >> BuildTwoSided 220001 1.0 3.9580e+02139.9 0.00e+00 0.0 2.0e+00
>> >> 4.0e+00 2.2e+05 4 0 0 0 20 4 0 0 0 20 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> BuildTwoSidedF 220000 1.0 3.9614e+02126.4 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 2.2e+05 4 0 0 0 20 4 0 0 0 20 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> VecMDot 386001 1.0 6.3426e+01 1.5 1.05e+11 1.0 0.0e+00
>> >> 0.0e+00 3.9e+05 1 11 0 0 35 1 11 0 0 35 3311 26012 386001
>> >> 1.71e+05 0 0.00e+00 100
>> >> >> VecNorm 496001 1.0 5.0877e+01 1.2 5.49e+10 1.0 0.0e+00
>> >> 0.0e+00 5.0e+05 1 6 0 0 45 1 6 0 0 45 2159 3707 110000
>> >> 4.87e+04 0 0.00e+00 100
>> >> >> VecScale 496001 1.0 7.9951e+00 1.0 2.75e+10 1.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 6869 13321 0
>> >> 0.00e+00 0 0.00e+00 100
>> >> >> VecCopy 110000 1.0 1.9323e+00 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> VecSet 330017 1.0 5.4319e+00 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> VecAXPY 110000 1.0 1.5820e+00 1.0 1.22e+10 1.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 15399 35566 0
>> >> 0.00e+00 0 0.00e+00 100
>> >> >> VecMAXPY 496001 1.0 1.1505e+01 1.0 1.48e+11 1.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 16 0 0 0 0 16 0 0 0 25665 39638 0
>> >> 0.00e+00 0 0.00e+00 100
>> >> >> VecAssemblyBegin 110000 1.0 1.2021e+00 1.2 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 1.1e+05 0 0 0 0 10 0 0 0 0 10 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> VecAssemblyEnd 110000 1.0 1.5988e-01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> VecScatterBegin 496001 1.0 1.3002e+01 1.0 0.00e+00 0.0 9.9e+05
>> >> 1.3e+04 1.0e+00 0 0100100 0 0 0100100 0 0 0 110000
>> >> 4.87e+04 0 0.00e+00 0
>> >> >> VecScatterEnd 496001 1.0 1.8988e+01 1.3 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> VecNormalize 496001 1.0 5.8797e+01 1.1 8.24e+10 1.0 0.0e+00
>> >> 0.0e+00 5.0e+05 1 9 0 0 45 1 9 0 0 45 2802 4881 110000
>> >> 4.87e+04 0 0.00e+00 100
>> >> >> VecCUDACopyTo 716001 1.0 3.4483e+01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 716001
>> >> 3.17e+05 0 0.00e+00 0
>> >> >> VecCUDACopyFrom 1211994 1.0 5.1752e+01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0
>> >> 0.00e+00 1211994 5.37e+05 0
>> >> >> MatMult 386001 1.0 4.8436e+01 1.0 1.90e+11 1.0 7.7e+05
>> >> 1.3e+04 0.0e+00 1 21 78 78 0 1 21 78 78 0 7862 16962 0
>> >> 0.00e+00 0 0.00e+00 100
>> >> >> MatMultAdd 110000 1.0 6.2666e+01 1.1 6.03e+10 1.0 2.2e+05
>> >> 1.3e+04 1.0e+00 1 7 22 22 0 1 7 22 22 0 1926 16893 440000
>> >> 3.39e+05 0 0.00e+00 100
>> >> >> MatSOR 496001 1.0 5.1821e+02 1.1 2.83e+11 1.0 0.0e+00
>> >> 0.0e+00 0.0e+00 10 31 0 0 0 10 31 0 0 0 1090 0 0
>> >> 0.00e+00 991994 4.39e+05 0
>> >> >> MatAssemblyBegin 110000 1.0 3.9732e+02109.2 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 1.1e+05 4 0 0 0 10 4 0 0 0 10 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> MatAssemblyEnd 110000 1.0 5.3015e-01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> MatZeroEntries 110000 1.0 1.3179e+01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> MatCUSPARSCopyTo 220000 1.0 3.2805e+01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 220000
>> >> 2.41e+05 0 0.00e+00 0
>> >> >> KSPSetUp 110000 1.0 3.5344e-02 1.3 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> KSPSolve 110000 1.0 6.8304e+02 1.0 8.20e+11 1.0 7.7e+05
>> >> 1.3e+04 8.8e+05 13 89 78 78 80 13 89 78 78 80 2401 14311 496001
>> >> 2.20e+05 991994 4.39e+05 66
>> >> >> KSPGMRESOrthog 386001 1.0 7.2820e+01 1.4 2.10e+11 1.0 0.0e+00
>> >> 0.0e+00 3.9e+05 1 23 0 0 35 1 23 0 0 35 5765 30176 386001
>> >> 1.71e+05 0 0.00e+00 100
>> >> >> PCSetUp 110000 1.0 1.8825e-02 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> PCApply 496001 1.0 5.1857e+02 1.1 2.83e+11 1.0 0.0e+00
>> >> 0.0e+00 0.0e+00 10 31 0 0 0 10 31 0 0 0 1090 0 0
>> >> 0.00e+00 991994 4.39e+05 0
>> >> >> SFSetGraph 1 1.0 2.0936e-05 1.1 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> SFSetUp 1 1.0 2.5347e-03 1.0 0.00e+00 0.0 4.0e+00
>> >> 3.3e+03 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> SFPack 496001 1.0 3.0026e+00 1.1 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >> SFUnpack 496001 1.0 1.1296e-01 1.0 0.00e+00 0.0 0.0e+00
>> >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0
>> >> 0.00e+00 0 0.00e+00 0
>> >> >>
>> >>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >>
>> >>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220802/021fa159/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2010.00881.pdf
Type: application/pdf
Size: 3724133 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220802/021fa159/attachment-0003.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Prenter2020_Article_MultigridSolversForImmersedFin.pdf
Type: application/pdf
Size: 10568688 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220802/021fa159/attachment-0004.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Post_Modern_GMRES (1).pdf
Type: application/pdf
Size: 800227 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220802/021fa159/attachment-0005.pdf>
More information about the petsc-dev
mailing list