<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><br class=""></div> It would be great to see your publication when it is ready. I feel we need such approaches in PETSc.<div class=""><br class=""></div><div class=""> Barry</div><div class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Aug 2, 2022, at 10:59 AM, Stephen Thomas <<a href="mailto:stephethomas@gmail.com" class="">stephethomas@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">Barry, Jed</div><br class=""><div class="">Paul and I developed a polynomial Gauss-Seidel smoother and ILUTP, ILU(0) based smoothers that employ iterative</div><div class="">(Neuman series, or RIchardson iteration ) for the triangular solves (faster than Jacobi, does not diverge, finite due to nitpotent</div><div class="">strictly upper triangular part of the U - we also use LDU (row scaled)</div><div class=""><br class=""></div><div class="">glad to send along more details as we have a paper in flight (revision being sent to NLAA this week).</div><div class=""><br class=""></div><div class="">the problems we are solving with PeleLM are quite similar to those described in the attached papers</div><div class="">by Prenter et al (2020) and Jomo et al (2021) - namely cut-cell, immersed boundary</div><div class=""><br class=""></div><div class="">We are seeing a 5x speed-up on the NREL eagle machine with NVIDIA V100 and a reduced 1.5 - 2x speed-up on</div><div class="">crusher with AMD MI250X GPUs (over the direct triangular solver ILU's). We also see 5X with the MFIX-Exa model for ECP.</div><div class=""><br class=""></div><div class="">this work was motivated by Edmond CHow and Hartwiz Anzt looking at Jacobi for triangular systems.</div><div class=""><br class=""></div><div class="">I also have a new GMRES formulation (talking about this at CEED) - that is leading to good results</div><div class="">for Krylov Schur eigenvalues as well.</div><div class=""><br class=""></div><div class="">Cheers and best regards</div><div class="">Steve</div><div class=""><br class=""></div><div class=""><br class=""></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Aug 2, 2022 at 9:34 AM Paul Mullowney <<a href="mailto:paulmullowney@gmail.com" target="_blank" class="">paulmullowney@gmail.com</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div id="gmail-m_7386655642987321857gmail-m_-8522094473160406613gmail-:2f2" aria-label="Message Body" role="textbox" aria-multiline="true" style="direction:ltr;min-height:85px" class="">The implementation is being (slowly) moved into Hypre. We have primarily used this technique with ILU-based smoothers for AMG. We did some comparisons against other smoothers like GS but not with Chebyshev or Polynomial. <div class=""><br class=""></div><div class="">For the problems we cared about, ILU was an effective smoother. The power series representation of the solve provided some nice speedups. I'ved cc'ed Steve Thomas who could say more.<div class=""><br class=""></div><div class="">-Paul</div></div></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jul 31, 2022 at 10:14 PM Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank" class="">jed@jedbrown.org</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Do you have a test that compares this with a polynomial smoother for the original problem (like Chebyshev for SPD)?<br class="">
<br class="">
Paul Mullowney <<a href="mailto:paulmullowney@gmail.com" target="_blank" class="">paulmullowney@gmail.com</a>> writes:<br class="">
<br class="">
> One could also approximate the SOR triangular solves with a Neumann series,<br class="">
> where each term in the series is a SpMV (great for GPUs). The number of<br class="">
> terms needed in the series is matrix dependent.<br class="">
> We've seen this work to great effect for some problems.<br class="">
><br class="">
> -Paul<br class="">
><br class="">
> On Wed, Jul 27, 2022 at 8:05 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank" class="">bsmith@petsc.dev</a>> wrote:<br class="">
><br class="">
>><br class="">
>> There are multicolor versions of SOR that theoretically offer good<br class="">
>> parallelism on GPUs but at the cost of multiple phases and slower<br class="">
>> convergence rates. Unless someone already has one coded for CUDA or Kokkos<br class="">
>> it would take a good amount of code to produce one that offers (but does<br class="">
>> not necessarily guarantee) reasonable performance on GPUs.<br class="">
>><br class="">
>> > On Jul 27, 2022, at 7:57 PM, Jed Brown <<a href="mailto:jed@jedbrown.org" target="_blank" class="">jed@jedbrown.org</a>> wrote:<br class="">
>> ><br class="">
>> > Unfortunately, MatSOR is a really bad operation for GPUs. We can make it<br class="">
>> use sparse triangular primitives from cuSPARSE, but those run on GPU at<br class="">
>> about 20x slower than MatMult with the same sparse matrix. So unless MatSOR<br class="">
>> reduces iteration count by 20x compared to your next-best preconditioning<br class="">
>> option, you'll be better off finding a different preconditioner. This might<br class="">
>> be some elements of multigrid or polynomial smoothing with point-block<br class="">
>> Jacobi. If you can explain a bit about your application, we may be able to<br class="">
>> offer some advice.<br class="">
>> ><br class="">
>> > Han Tran <<a href="mailto:hantran@cs.utah.edu" target="_blank" class="">hantran@cs.utah.edu</a>> writes:<br class="">
>> ><br class="">
>> >> Hello,<br class="">
>> >><br class="">
>> >> Running my example using VECMPICUDA for VecSetType(), and MATMPIAIJCUSP<br class="">
>> for MatSetType(), I have the profiling results as shown below. It is seen<br class="">
>> that MatSOR() has %F of GPU, only has GpuToCpu count and size. Is it<br class="">
>> correct that PETSc currently does not have MatSOR implemented on GPU? It<br class="">
>> would be appreciated if you can provide an explanation on how MatSOR()<br class="">
>> currently use GPU. From this example, MatSOR takes a considerable time<br class="">
>> relatively compared to other functions.<br class="">
>> >><br class="">
>> >> Thank you.<br class="">
>> >><br class="">
>> >> -Han<br class="">
>> >><br class="">
>> >><br class="">
>> ------------------------------------------------------------------------------------------------------------------------<br class="">
>> >> Event Count Time (sec) Flop<br class="">
>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - -<br class="">
>> GpuToCpu - GPU<br class="">
>> >> Max Ratio Max Ratio Max Ratio Mess AvgLen<br class="">
>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size<br class="">
>> Count Size %F<br class="">
>> >><br class="">
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------<br class="">
>> >><br class="">
>> >> --- Event Stage 0: Main Stage<br class="">
>> >><br class="">
>> >> BuildTwoSided 220001 1.0 3.9580e+02139.9 0.00e+00 0.0 2.0e+00<br class="">
>> 4.0e+00 2.2e+05 4 0 0 0 20 4 0 0 0 20 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> BuildTwoSidedF 220000 1.0 3.9614e+02126.4 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 2.2e+05 4 0 0 0 20 4 0 0 0 20 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> VecMDot 386001 1.0 6.3426e+01 1.5 1.05e+11 1.0 0.0e+00<br class="">
>> 0.0e+00 3.9e+05 1 11 0 0 35 1 11 0 0 35 3311 26012 386001<br class="">
>> 1.71e+05 0 0.00e+00 100<br class="">
>> >> VecNorm 496001 1.0 5.0877e+01 1.2 5.49e+10 1.0 0.0e+00<br class="">
>> 0.0e+00 5.0e+05 1 6 0 0 45 1 6 0 0 45 2159 3707 110000<br class="">
>> 4.87e+04 0 0.00e+00 100<br class="">
>> >> VecScale 496001 1.0 7.9951e+00 1.0 2.75e+10 1.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 3 0 0 0 0 3 0 0 0 6869 13321 0<br class="">
>> 0.00e+00 0 0.00e+00 100<br class="">
>> >> VecCopy 110000 1.0 1.9323e+00 1.0 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> VecSet 330017 1.0 5.4319e+00 1.0 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> VecAXPY 110000 1.0 1.5820e+00 1.0 1.22e+10 1.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 15399 35566 0<br class="">
>> 0.00e+00 0 0.00e+00 100<br class="">
>> >> VecMAXPY 496001 1.0 1.1505e+01 1.0 1.48e+11 1.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 16 0 0 0 0 16 0 0 0 25665 39638 0<br class="">
>> 0.00e+00 0 0.00e+00 100<br class="">
>> >> VecAssemblyBegin 110000 1.0 1.2021e+00 1.2 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 1.1e+05 0 0 0 0 10 0 0 0 0 10 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> VecAssemblyEnd 110000 1.0 1.5988e-01 1.0 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> VecScatterBegin 496001 1.0 1.3002e+01 1.0 0.00e+00 0.0 9.9e+05<br class="">
>> 1.3e+04 1.0e+00 0 0100100 0 0 0100100 0 0 0 110000<br class="">
>> 4.87e+04 0 0.00e+00 0<br class="">
>> >> VecScatterEnd 496001 1.0 1.8988e+01 1.3 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> VecNormalize 496001 1.0 5.8797e+01 1.1 8.24e+10 1.0 0.0e+00<br class="">
>> 0.0e+00 5.0e+05 1 9 0 0 45 1 9 0 0 45 2802 4881 110000<br class="">
>> 4.87e+04 0 0.00e+00 100<br class="">
>> >> VecCUDACopyTo 716001 1.0 3.4483e+01 1.0 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 716001<br class="">
>> 3.17e+05 0 0.00e+00 0<br class="">
>> >> VecCUDACopyFrom 1211994 1.0 5.1752e+01 1.0 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 1211994 5.37e+05 0<br class="">
>> >> MatMult 386001 1.0 4.8436e+01 1.0 1.90e+11 1.0 7.7e+05<br class="">
>> 1.3e+04 0.0e+00 1 21 78 78 0 1 21 78 78 0 7862 16962 0<br class="">
>> 0.00e+00 0 0.00e+00 100<br class="">
>> >> MatMultAdd 110000 1.0 6.2666e+01 1.1 6.03e+10 1.0 2.2e+05<br class="">
>> 1.3e+04 1.0e+00 1 7 22 22 0 1 7 22 22 0 1926 16893 440000<br class="">
>> 3.39e+05 0 0.00e+00 100<br class="">
>> >> MatSOR 496001 1.0 5.1821e+02 1.1 2.83e+11 1.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 10 31 0 0 0 10 31 0 0 0 1090 0 0<br class="">
>> 0.00e+00 991994 4.39e+05 0<br class="">
>> >> MatAssemblyBegin 110000 1.0 3.9732e+02109.2 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 1.1e+05 4 0 0 0 10 4 0 0 0 10 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> MatAssemblyEnd 110000 1.0 5.3015e-01 1.0 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> MatZeroEntries 110000 1.0 1.3179e+01 1.0 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> MatCUSPARSCopyTo 220000 1.0 3.2805e+01 1.0 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 220000<br class="">
>> 2.41e+05 0 0.00e+00 0<br class="">
>> >> KSPSetUp 110000 1.0 3.5344e-02 1.3 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> KSPSolve 110000 1.0 6.8304e+02 1.0 8.20e+11 1.0 7.7e+05<br class="">
>> 1.3e+04 8.8e+05 13 89 78 78 80 13 89 78 78 80 2401 14311 496001<br class="">
>> 2.20e+05 991994 4.39e+05 66<br class="">
>> >> KSPGMRESOrthog 386001 1.0 7.2820e+01 1.4 2.10e+11 1.0 0.0e+00<br class="">
>> 0.0e+00 3.9e+05 1 23 0 0 35 1 23 0 0 35 5765 30176 386001<br class="">
>> 1.71e+05 0 0.00e+00 100<br class="">
>> >> PCSetUp 110000 1.0 1.8825e-02 1.0 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> PCApply 496001 1.0 5.1857e+02 1.1 2.83e+11 1.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 10 31 0 0 0 10 31 0 0 0 1090 0 0<br class="">
>> 0.00e+00 991994 4.39e+05 0<br class="">
>> >> SFSetGraph 1 1.0 2.0936e-05 1.1 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> SFSetUp 1 1.0 2.5347e-03 1.0 0.00e+00 0.0 4.0e+00<br class="">
>> 3.3e+03 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> SFPack 496001 1.0 3.0026e+00 1.1 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >> SFUnpack 496001 1.0 1.1296e-01 1.0 0.00e+00 0.0 0.0e+00<br class="">
>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0<br class="">
>> 0.00e+00 0 0.00e+00 0<br class="">
>> >><br class="">
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------<br class="">
>><br class="">
>><br class="">
</blockquote></div>
</blockquote></div>
<span id="cid:f_l6cd60ng0"><2010.00881.pdf></span><span id="cid:f_l6cd60nk1"><Prenter2020_Article_MultigridSolversForImmersedFin.pdf></span><span id="cid:f_l6cd7wks2"><Post_Modern_GMRES (1).pdf></span></div></blockquote></div><br class=""></div></body></html>