<div dir="ltr">Hi Mark,<div><br></div><div>Sure, I will try a 3D Lid-driven case by combining OpenFOAM, PETSc and HYPRE, let's see what would happen.</div><div><br></div><div>Kind regards,</div><div>Qi</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 28, 2022 at 11:04 PM Mark Adams <<a href="mailto:mfadams@lbl.gov">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Qi, these are good discussions and data and we like to share, so let's keep this on the list.<div><br></div><div>* I would suggest you use a 3D test. This is more relevant to what HPC applications do.</div><div>* In my experience, hypre's default parameters are tuned for 2D low order problems like this so I would start with the defaults. I think they should be fine for 3D also.</div><div>* As I think I said before we have an AMGx interface under development and I heard yesterday that it should not be long until it is available. It would be great if you could test that and we can work with the NVIDIA developer to optimize it. We will let you know when its available.</div><div><br></div><div>Cheers,</div><div>Mark</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 28, 2022 at 10:44 AM Qi Yang <<a href="mailto:qiyang@oakland.edu" target="_blank">qiyang@oakland.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"> Hi Mark and Barry,<div><br></div><div>Really appreciate your explanation about the setup process, those days I tried to use the HYPRE amg solver to replace the original amg solver in PETSc. </div><div><br></div><div>The solver settings of HYPRE are as follows:</div><div>mpiexec -n 1 ./ex50 -da_grid_x 3000 -da_grid_y 3000 -ksp_type cg -pc_type hypre -pc_hypre_type boomeramg -pc_hypre_boomeramg_max_iter 1 -pc_hypre_boomeramg_strong_threshold 0.7 -pc_hypre_boomeramg_grid_sweeps_up 1 -pc_hypre_boomeramg_grid_sweeps_down 1 -pc_hypre_boomeramg_agg_nl 2 -pc_hypre_boomeramg_agg_num_paths 1 -pc_hypre_boomeramg_max_levels 25 <b>-pc_hypre_boomeramg_coarsen_type PMIS</b> -pc_hypre_boomeramg_interp_type ext+i -pc_hypre_boomeramg_P_max 2 -pc_hypre_boomeramg_truncfactor 0.2 -vec_type cuda -mat_type aijcusparse -ksp_monitor -ksp_view -log-view<br></div><div><br></div><div><img src="cid:ii_l1aryxjh0" alt="PMIS.PNG" width="562" height="194"><br></div><div><br></div><div>The interesting part is that I choose the coarsen type as PMIS, through the code, you can find only PMIS has GPU codes(Host and Device).</div><div>* HYPRE does reduce the solution time from 20s to 8s</div><div>* The memory mapping process is found inside the solver process, which causes several gaps in the following NVIDIA Nsight System profile, I am not sure what does it mean, </div><div><img src="cid:ii_l1at6u1x2" alt="image.png" width="510" height="263"><br></div><div>I am really interested to do some benchmarks by using hypre amg solver, actually, I already connected OpenFOAM, PETSc, HYPRE and AMGX together by using the API</div><div> petsc4foam(<a href="https://develop.openfoam.com/modules/external-solver/-/tree/amgxwrapper/src/petsc4Foam" target="_blank">https://develop.openfoam.com/modules/external-solver/-/tree/amgxwrapper/src/petsc4Foam</a>), I prefer to use PETSc as the base matrix solver for possible HIP code</div><div>implementation in the future, that way, I can compare the difference between NVIDIA and AMD GPU. It seems there are many benchmark cases I can do in the future.</div><div><br></div><div>Regards,</div><div>Qi</div><div><br></div><div><br></div><div><br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 23, 2022 at 9:39 AM Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">A few points, but first this is a nice start. If you are interested in working on benchmarking that would be great. If so, read on.<div><br></div><div>* Barry pointed out the SOR issues that are thrashing the memory system. This solve would run faster on the CPU (maybe, 9M eqs is a lot).</div><div>* Most applications run for some time doing 100-1,000 and more solves with one configuration and this amortizes the setup costs for each mesh. What I call "mesh setup" cost.</div><div>* Many applications are nonlinear and use a full Newton solver that does a "matrix setup" for each solve, but many applications can also amortize this matrix setup (PtAP stuff in the output, which is small for 2D problem but can be large for 3D problems)</div><div>* Now hypre's mesh setup is definitely better that GAMG's and AMGx is out of this world.</div><div> - AMGx is the result of a serious development effort by NVIDIA about 15 years ago with many 10's of NVIDIA developer years in it (I am guessing but I know it was a serious effort for a few years)</div><div> + We are currently working with the current AMG developer, Matt, to provide an AMGx interface in PETSc, like hypre (DOE does not like us working with non-portable solvers but AMGx is very good)</div><div>* Hypre and AMGx use "classic" AMG, which is like geometric multigrid (fast) for M-matrices (very low order Laplacians, like ex50).</div><div>* GAMG uses "smoothed aggregation" AMG because this algorithm has better theoretical properties for high order and elasticity problems and the algorithm's implementations and default parameters have been optimized for these types of problems.</div><div><br></div><div>It would be interesting to add Hypre to your study (Ex50) and add a high order 3D elasticity problem (eg, snes/tests/ex13, or Jed Brown has some nice elasticity problems).</div><div>If you are interested we can give you Hypre parameters for elasticity problems.</div><div>I have no experience with AMGx on elasticity but the NVIDIA developer is available and can be looped in.</div><div>For that matter we could bring the main hypre developer, Ruipeng, in as well.</div><div>I would also suggest timing the setup (you can combine mesh and matrix if you like) and solve phase separately. ex13 does this and we should find another 5-point stencil example that does this if ex50 does not.</div><div><br></div><div>BTW, I have been intending to write a benchmarking paper this year with Matt and Ruipeng, but I am just not getting around to it ...</div><div>If you want to lead a paper and the experiments, we can help optimize and tune our solvers, setup tests, write background material, etc. </div><div><br></div><div>Cheers,</div><div>Mark</div><div><br></div><div> </div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 22, 2022 at 12:30 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div>Indeed PCSetUp is taking most of the time (79%). In the version of PETSc you are running it is doing a great deal of the setup work on the CPU. You can see there is a lot of data movement between the CPU and GPU (in both directions) during the setup; 64 1.91e+03 54 1.21e+03 90<div><br></div><div>Clearly, we need help in porting all the parts of the GAMG setup that still occur on the CPU to the GPU.</div><div><br></div><div> Barry</div><div><br><div><br></div><div><br><div><br><blockquote type="cite"><div>On Mar 22, 2022, at 12:07 PM, Qi Yang <<a href="mailto:qiyang@oakland.edu" target="_blank">qiyang@oakland.edu</a>> wrote:</div><br><div><div dir="ltr">Dear Barry,<div><br></div><div>Your advice is helpful, now the total time reduce from 30s to 20s(now all matrix run on gpu), actually I have tried other settings for amg predicontioner, seems not help that a lot, like -pc_gamg_threshold 0.05 -pc_gamg_threshold_scale 0.5.</div><div>it seems the key point is the PCSetup process, from the log, it takes the most time, and we can find from the new nsight system analysis, there is a big gap before the ksp solver starts, seems like the PCSetup process, not sure, am I right?</div><div><span id="gmail-m_-515485367039392424gmail-m_5031360872187516760gmail-m_-6500856249588234263gmail-m_620068857480766643gmail-m_-3126015662514999636gmail-m_1358925268107088104cid:ii_l12buewx4"><3.png></span><br></div><div><br></div><div>PCSetUp 2 1.0 1.5594e+01 1.0 3.06e+09 1.0 0.0e+00 0.0e+00 0.0e+00 79 78 0 0 0 79 78 0 0 0 196 8433 64 1.91e+03 54 1.21e+03 90<br></div><div><br></div><div><br></div><div>Regards,</div><div>Qi</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 22, 2022 at 10:44 PM Barry Smith <<a href="mailto:bsmith@petsc.dev" target="_blank">bsmith@petsc.dev</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br></div> It is using <div><br></div><div>MatSOR 369 1.0 9.1214e+00 1.0 7.32e+09 1.0 0.0e+00 0.0e+00 0.0e+00 29 27 0 0 0 29 27 0 0 0 803 0 0 0.00e+00 565 1.35e+03 0</div><div><br></div><div>which runs on the CPU not the GPU hence the large amount of time in memory copies and poor performance. We are switching the default to be Chebyshev/Jacobi which runs completely on the GPU (may already be switched in the main branch). </div><div><br></div><div>You can run with <span style="font-family:Menlo;font-size:14px">-mg_levels_pc_type</span><span style="font-family:Menlo;font-size:14px"> jacobi</span> You should then see almost the entire solver running on the GPU.</div><div><font face="Menlo"><span style="font-size:14px"><br></span></font></div><div>You may need to tune the number of smoothing steps or other parameters of GAMG to get the faster solution time.</div><div><br></div><div> Barry</div><div><br><div><br><blockquote type="cite"><div>On Mar 22, 2022, at 10:30 AM, Qi Yang <<a href="mailto:qiyang@oakland.edu" target="_blank">qiyang@oakland.edu</a>> wrote:</div><br><div><div dir="ltr"><div dir="ltr"><div dir="ltr">To whom it may concern,<div><div><br></div><div>I have tried petsc ex50(Possion) with cuda, ksp cg solver and gamg precondition, however, it run for about 30s. I also tried NVIDIA AMGX with the same solver and same grid (3000*3000), it only took 2s. I used nsight system software to analyze those two cases, found petsc took much time in the memory process (63% of total time, however, amgx only took 19%). Attached are screenshots of them.</div><div><br></div><div>The petsc command is : mpiexec -n 1 ./ex50 -da_grid_x 3000 -da_grid_y 3000 -ksp_type cg -pc_type gamg -pc_gamg_type agg -pc_gamg_agg_nsmooths 1 -vec_type cuda -mat_type aijcusparse -ksp_monitor -ksp_view -log-view </div><div><br></div><div>The log file is also attached.</div><div><br></div><div>Regards,</div><div>Qi</div><div><br></div><div><span id="gmail-m_-515485367039392424gmail-m_5031360872187516760gmail-m_-6500856249588234263gmail-m_620068857480766643gmail-m_-3126015662514999636gmail-m_1358925268107088104gmail-m_-5254694205004387314cid:ii_l1288l930"><1.png></span><br></div></div><div><span id="gmail-m_-515485367039392424gmail-m_5031360872187516760gmail-m_-6500856249588234263gmail-m_620068857480766643gmail-m_-3126015662514999636gmail-m_1358925268107088104gmail-m_-5254694205004387314cid:ii_l1288w5h1"><2.png></span><br></div></div></div></div>
<span id="gmail-m_-515485367039392424gmail-m_5031360872187516760gmail-m_-6500856249588234263gmail-m_620068857480766643gmail-m_-3126015662514999636gmail-m_1358925268107088104gmail-m_-5254694205004387314cid:f_l128i6sx2"><log.PETSc_cg_amg_ex50_gpu_cuda></span></div></blockquote></div><br></div></div></blockquote></div>
<span id="gmail-m_-515485367039392424gmail-m_5031360872187516760gmail-m_-6500856249588234263gmail-m_620068857480766643gmail-m_-3126015662514999636gmail-m_1358925268107088104cid:f_l12by7jf4"><log.PETSc_cg_amg_jacobi_ex50_gpu_cuda></span></div></blockquote></div><br></div></div></div></blockquote></div>
</blockquote></div>
</blockquote></div>
</blockquote></div>