[petsc-dev] GPU performance of MatSOR()

Barry Smith bsmith at petsc.dev
Tue Aug 2 11:05:59 CDT 2022


  It would be great to see your publication when it is ready. I feel we need such approaches in PETSc.

  Barry


> On Aug 2, 2022, at 10:59 AM, Stephen Thomas <stephethomas at gmail.com> wrote:
> 
> 
> Barry, Jed
> 
> Paul and I developed a polynomial Gauss-Seidel smoother and ILUTP, ILU(0) based smoothers that employ iterative
> (Neuman series, or RIchardson iteration ) for the triangular solves (faster than Jacobi, does not diverge, finite due to nitpotent
> strictly upper triangular part of the U - we also use LDU (row scaled)
> 
> glad to send along more details as we have a paper in flight (revision being sent to NLAA this week).
> 
> the problems we are solving with PeleLM are quite similar to those described in the attached papers
> by Prenter et al (2020) and Jomo et al (2021) - namely cut-cell, immersed boundary
> 
> We are seeing a 5x speed-up on the NREL eagle machine with NVIDIA V100 and a reduced 1.5 - 2x speed-up on
> crusher with AMD MI250X GPUs (over the direct triangular solver ILU's). We also see 5X with the MFIX-Exa model for ECP.
> 
> this work was motivated by Edmond CHow and Hartwiz Anzt looking at Jacobi for triangular systems.
> 
> I also have a new GMRES formulation (talking about this at CEED) - that is leading to good results
> for Krylov Schur eigenvalues as well.
> 
> Cheers and best regards
> Steve
> 
> 
> 
> On Tue, Aug 2, 2022 at 9:34 AM Paul Mullowney <paulmullowney at gmail.com <mailto:paulmullowney at gmail.com>> wrote:
> The implementation is being (slowly) moved into Hypre. We have primarily used this technique with ILU-based smoothers for AMG. We did some comparisons against other smoothers like GS but not with Chebyshev or Polynomial. 
> 
> For the problems we cared about, ILU was an effective smoother. The power series representation of the solve provided some nice speedups. I'ved cc'ed Steve Thomas who could say more.
> 
> -Paul
> 
> On Sun, Jul 31, 2022 at 10:14 PM Jed Brown <jed at jedbrown.org <mailto:jed at jedbrown.org>> wrote:
> Do you have a test that compares this with a polynomial smoother for the original problem (like Chebyshev for SPD)?
> 
> Paul Mullowney <paulmullowney at gmail.com <mailto:paulmullowney at gmail.com>> writes:
> 
> > One could also approximate the SOR triangular solves with a Neumann series,
> > where each term in the series is a SpMV (great for GPUs). The number of
> > terms needed in the series is matrix dependent.
> > We've seen this work to great effect for some problems.
> >
> > -Paul
> >
> > On Wed, Jul 27, 2022 at 8:05 PM Barry Smith <bsmith at petsc.dev <mailto:bsmith at petsc.dev>> wrote:
> >
> >>
> >>   There are multicolor versions of SOR that theoretically offer good
> >> parallelism on GPUs but at the cost of multiple phases and slower
> >> convergence rates. Unless someone already has one coded for CUDA or Kokkos
> >> it would take a good amount of code to produce one that offers (but does
> >> not necessarily guarantee) reasonable performance on GPUs.
> >>
> >> > On Jul 27, 2022, at 7:57 PM, Jed Brown <jed at jedbrown.org <mailto:jed at jedbrown.org>> wrote:
> >> >
> >> > Unfortunately, MatSOR is a really bad operation for GPUs. We can make it
> >> use sparse triangular primitives from cuSPARSE, but those run on GPU at
> >> about 20x slower than MatMult with the same sparse matrix. So unless MatSOR
> >> reduces iteration count by 20x compared to your next-best preconditioning
> >> option, you'll be better off finding a different preconditioner. This might
> >> be some elements of multigrid or polynomial smoothing with point-block
> >> Jacobi. If you can explain a bit about your application, we may be able to
> >> offer some advice.
> >> >
> >> > Han Tran <hantran at cs.utah.edu <mailto:hantran at cs.utah.edu>> writes:
> >> >
> >> >> Hello,
> >> >>
> >> >> Running my example using VECMPICUDA for VecSetType(), and MATMPIAIJCUSP
> >> for MatSetType(), I have the profiling results as shown below. It is seen
> >> that MatSOR() has %F of GPU, only has GpuToCpu count and size. Is it
> >> correct that PETSc currently does not have MatSOR implemented on GPU? It
> >> would be appreciated if you can provide an explanation on how MatSOR()
> >> currently use GPU. From this example, MatSOR takes a considerable time
> >> relatively compared to other functions.
> >> >>
> >> >> Thank you.
> >> >>
> >> >> -Han
> >> >>
> >> >>
> >> ------------------------------------------------------------------------------------------------------------------------
> >> >> Event                Count      Time (sec)     Flop
> >>           --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
> >> GpuToCpu - GPU
> >> >>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
> >> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
> >>  Count   Size  %F
> >> >>
> >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> >> >>
> >> >> --- Event Stage 0: Main Stage
> >> >>
> >> >> BuildTwoSided     220001 1.0 3.9580e+02139.9 0.00e+00 0.0 2.0e+00
> >> 4.0e+00 2.2e+05  4  0  0  0 20   4  0  0  0 20     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> BuildTwoSidedF    220000 1.0 3.9614e+02126.4 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 2.2e+05  4  0  0  0 20   4  0  0  0 20     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> VecMDot           386001 1.0 6.3426e+01 1.5 1.05e+11 1.0 0.0e+00
> >> 0.0e+00 3.9e+05  1 11  0  0 35   1 11  0  0 35  3311   26012   386001
> >> 1.71e+05    0 0.00e+00 100
> >> >> VecNorm           496001 1.0 5.0877e+01 1.2 5.49e+10 1.0 0.0e+00
> >> 0.0e+00 5.0e+05  1  6  0  0 45   1  6  0  0 45  2159    3707   110000
> >> 4.87e+04    0 0.00e+00 100
> >> >> VecScale          496001 1.0 7.9951e+00 1.0 2.75e+10 1.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  3  0  0  0   0  3  0  0  0  6869   13321      0
> >> 0.00e+00    0 0.00e+00 100
> >> >> VecCopy           110000 1.0 1.9323e+00 1.0 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> VecSet            330017 1.0 5.4319e+00 1.0 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> VecAXPY           110000 1.0 1.5820e+00 1.0 1.22e+10 1.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0 15399   35566      0
> >> 0.00e+00    0 0.00e+00 100
> >> >> VecMAXPY          496001 1.0 1.1505e+01 1.0 1.48e+11 1.0 0.0e+00
> >> 0.0e+00 0.0e+00  0 16  0  0  0   0 16  0  0  0 25665   39638      0
> >> 0.00e+00    0 0.00e+00 100
> >> >> VecAssemblyBegin  110000 1.0 1.2021e+00 1.2 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 1.1e+05  0  0  0  0 10   0  0  0  0 10     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> VecAssemblyEnd    110000 1.0 1.5988e-01 1.0 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> VecScatterBegin   496001 1.0 1.3002e+01 1.0 0.00e+00 0.0 9.9e+05
> >> 1.3e+04 1.0e+00  0  0100100  0   0  0100100  0     0       0   110000
> >> 4.87e+04    0 0.00e+00  0
> >> >> VecScatterEnd     496001 1.0 1.8988e+01 1.3 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> VecNormalize      496001 1.0 5.8797e+01 1.1 8.24e+10 1.0 0.0e+00
> >> 0.0e+00 5.0e+05  1  9  0  0 45   1  9  0  0 45  2802    4881   110000
> >> 4.87e+04    0 0.00e+00 100
> >> >> VecCUDACopyTo     716001 1.0 3.4483e+01 1.0 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0   716001
> >> 3.17e+05    0 0.00e+00  0
> >> >> VecCUDACopyFrom  1211994 1.0 5.1752e+01 1.0 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0      0
> >> 0.00e+00 1211994 5.37e+05  0
> >> >> MatMult           386001 1.0 4.8436e+01 1.0 1.90e+11 1.0 7.7e+05
> >> 1.3e+04 0.0e+00  1 21 78 78  0   1 21 78 78  0  7862   16962      0
> >> 0.00e+00    0 0.00e+00 100
> >> >> MatMultAdd        110000 1.0 6.2666e+01 1.1 6.03e+10 1.0 2.2e+05
> >> 1.3e+04 1.0e+00  1  7 22 22  0   1  7 22 22  0  1926   16893   440000
> >> 3.39e+05    0 0.00e+00 100
> >> >> MatSOR            496001 1.0 5.1821e+02 1.1 2.83e+11 1.0 0.0e+00
> >> 0.0e+00 0.0e+00 10 31  0  0  0  10 31  0  0  0  1090       0      0
> >> 0.00e+00 991994 4.39e+05  0
> >> >> MatAssemblyBegin  110000 1.0 3.9732e+02109.2 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 1.1e+05  4  0  0  0 10   4  0  0  0 10     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> MatAssemblyEnd    110000 1.0 5.3015e-01 1.0 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> MatZeroEntries    110000 1.0 1.3179e+01 1.0 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> MatCUSPARSCopyTo  220000 1.0 3.2805e+01 1.0 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0   220000
> >> 2.41e+05    0 0.00e+00  0
> >> >> KSPSetUp          110000 1.0 3.5344e-02 1.3 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> KSPSolve          110000 1.0 6.8304e+02 1.0 8.20e+11 1.0 7.7e+05
> >> 1.3e+04 8.8e+05 13 89 78 78 80  13 89 78 78 80  2401   14311   496001
> >> 2.20e+05 991994 4.39e+05 66
> >> >> KSPGMRESOrthog    386001 1.0 7.2820e+01 1.4 2.10e+11 1.0 0.0e+00
> >> 0.0e+00 3.9e+05  1 23  0  0 35   1 23  0  0 35  5765   30176   386001
> >> 1.71e+05    0 0.00e+00 100
> >> >> PCSetUp           110000 1.0 1.8825e-02 1.0 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> PCApply           496001 1.0 5.1857e+02 1.1 2.83e+11 1.0 0.0e+00
> >> 0.0e+00 0.0e+00 10 31  0  0  0  10 31  0  0  0  1090       0      0
> >> 0.00e+00 991994 4.39e+05  0
> >> >> SFSetGraph             1 1.0 2.0936e-05 1.1 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> SFSetUp                1 1.0 2.5347e-03 1.0 0.00e+00 0.0 4.0e+00
> >> 3.3e+03 1.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> SFPack            496001 1.0 3.0026e+00 1.1 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >> SFUnpack          496001 1.0 1.1296e-01 1.0 0.00e+00 0.0 0.0e+00
> >> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
> >> 0.00e+00    0 0.00e+00  0
> >> >>
> >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>
> >>
> <2010.00881.pdf><Prenter2020_Article_MultigridSolversForImmersedFin.pdf><Post_Modern_GMRES (1).pdf>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220802/46b3d0bc/attachment.html>


More information about the petsc-dev mailing list