[petsc-dev] PETSc amg solver with gpu seems run slowly

Mon Mar 28 11:35:55 CDT 2022

Hi Qi, Mark,

My colleague Suyash Tandon has almost completed a PETSc HIP port
(essentially a hipification of the CUDA port) and has been trying to test
it on the same OpenFOAM 3D Lid-driven case. It would be interesting to see
what the optimal HYPRE parameters are as we could experiment from the AMD
side.

Thanks,
Justin

On Mon, Mar 28, 2022 at 10:28 AM Qi Yang <qiyang at oakland.edu> wrote:

> Hi Mark,
>
> Sure, I will try a 3D Lid-driven case by combining OpenFOAM, PETSc and
> HYPRE, let's see what would happen.
>
> Kind regards,
> Qi
>
> On Mon, Mar 28, 2022 at 11:04 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>> Hi Qi, these are good discussions and data and we like to share, so let's
>> keep this on the list.
>>
>> * I would suggest you use a 3D test. This is more relevant to what HPC
>> applications do.
>> * In my experience, hypre's default parameters are tuned for 2D low order
>> problems like this so I would start with the defaults. I think they should
>> be fine for 3D also.
>> * As I think I said before we have an AMGx interface under
>> development and I heard yesterday that it should not be long until it is
>> available. It would be great if you could test that and we can work with
>> the NVIDIA developer to optimize it. We will let you know when
>> its available.
>>
>> Cheers,
>> Mark
>>
>>
>> On Mon, Mar 28, 2022 at 10:44 AM Qi Yang <qiyang at oakland.edu> wrote:
>>
>>>   Hi Mark and Barry,
>>>
>>> Really appreciate your explanation about the setup process, those days I
>>> tried to use the HYPRE amg solver to replace the original amg solver in
>>> PETSc.
>>>
>>> The solver settings of HYPRE are as follows:
>>> mpiexec -n 1 ./ex50  -da_grid_x 3000 -da_grid_y 3000 -ksp_type cg
>>> -pc_type hypre -pc_hypre_type boomeramg -pc_hypre_boomeramg_max_iter 1
>>>  -pc_hypre_boomeramg_strong_threshold 0.7
>>> -pc_hypre_boomeramg_grid_sweeps_up 1 -pc_hypre_boomeramg_grid_sweeps_down 1
>>> -pc_hypre_boomeramg_agg_nl 2 -pc_hypre_boomeramg_agg_num_paths 1
>>> -pc_hypre_boomeramg_max_levels 25 *-pc_hypre_boomeramg_coarsen_type
>>> PMIS* -pc_hypre_boomeramg_interp_type ext+i -pc_hypre_boomeramg_P_max 2
>>> -pc_hypre_boomeramg_truncfactor 0.2 -vec_type cuda -mat_type aijcusparse
>>> -ksp_monitor -ksp_view -log-view
>>>
>>> [image: PMIS.PNG]
>>>
>>> The interesting part is that I choose the coarsen type as PMIS, through
>>> the code, you can find only PMIS has GPU codes(Host and Device).
>>> * HYPRE does reduce the solution time from 20s to 8s
>>> * The memory mapping process is found inside the solver process, which
>>> causes several gaps in the following NVIDIA Nsight System profile, I am not
>>> sure what does it mean,
>>> [image: image.png]
>>> I am really interested to do some benchmarks by using hypre amg solver,
>>> actually, I already connected OpenFOAM, PETSc, HYPRE and AMGX together by
>>> using the API
>>>  petsc4foam(
>>> https://develop.openfoam.com/modules/external-solver/-/tree/amgxwrapper/src/petsc4Foam),
>>> I prefer to use PETSc as the base matrix solver for possible HIP code
>>> implementation in the future, that way, I can compare the difference
>>> between NVIDIA and AMD GPU. It seems there are many benchmark cases I can
>>> do in the future.
>>>
>>> Regards,
>>> Qi
>>>
>>>
>>>
>>>
>>> On Wed, Mar 23, 2022 at 9:39 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> A few points, but first this is a nice start. If you are interested in
>>>> working on benchmarking that would be great. If so, read on.
>>>>
>>>> * Barry pointed out the SOR issues that are thrashing the
>>>> memory system. This solve would run faster on the CPU (maybe, 9M eqs is a
>>>> lot).
>>>> * Most applications run for some time doing 100-1,000 and more solves
>>>> with one configuration and this amortizes the setup costs for each mesh.
>>>> What I call "mesh setup" cost.
>>>> * Many applications are nonlinear and use a full Newton solver that
>>>> does a "matrix setup" for each solve, but many applications can also
>>>> amortize this matrix setup (PtAP stuff in the output, which is small for 2D
>>>> problem but can be large for 3D problems)
>>>> * Now hypre's mesh setup is definitely better that GAMG's and AMGx is
>>>> out of this world.
>>>>   - AMGx is the result of a serious development effort by NVIDIA about
>>>> 15 years ago with many 10's of NVIDIA developer years in it (I am guessing
>>>> but I know it was a serious effort for a few years)
>>>>     + We are currently working with the current AMG developer, Matt, to
>>>> provide an AMGx interface in PETSc, like hypre (DOE does not like us
>>>> working with non-portable solvers but AMGx is very good)
>>>> * Hypre and AMGx use "classic" AMG, which is like geometric multigrid
>>>> (fast) for M-matrices (very low order Laplacians, like ex50).
>>>> * GAMG uses "smoothed aggregation" AMG  because this algorithm has
>>>> better theoretical properties for high order and elasticity problems and
>>>> the algorithm's implementations and default parameters have been optimized
>>>> for these types of problems.
>>>>
>>>> It would be interesting to add Hypre to your study (Ex50) and add a
>>>> high order 3D elasticity problem (eg, snes/tests/ex13, or Jed Brown has
>>>> some nice elasticity problems).
>>>> If you are interested we can give you Hypre parameters for elasticity
>>>> problems.
>>>> I have no experience with AMGx on elasticity but the NVIDIA developer
>>>> is available and can be looped in.
>>>> For that matter we could bring the main hypre developer, Ruipeng, in as
>>>> well.
>>>> I would also suggest timing the setup (you can combine mesh and matrix
>>>> if you like) and solve phase separately. ex13 does this and we should find
>>>> another 5-point stencil example that does this if ex50 does not.
>>>>
>>>> BTW, I have been intending to write a benchmarking paper this year with
>>>> Matt and Ruipeng, but I am just not getting around to it ...
>>>> If you want to lead a paper and the experiments, we can help optimize
>>>> and tune our solvers, setup tests, write background material, etc.
>>>>
>>>> Cheers,
>>>> Mark
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 22, 2022 at 12:30 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>> Indeed PCSetUp is taking most of the time (79%). In the version of
>>>>> PETSc you are running it is doing a great deal of the setup work on the
>>>>> CPU. You can see there is a lot of data movement between the CPU and GPU
>>>>> (in both directions) during the setup; 64 1.91e+03   54 1.21e+03 90
>>>>>
>>>>> Clearly, we need help in porting all the parts of the GAMG setup that
>>>>> still occur on the CPU to the GPU.
>>>>>
>>>>>  Barry
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mar 22, 2022, at 12:07 PM, Qi Yang <qiyang at oakland.edu> wrote:
>>>>>
>>>>> Dear Barry,
>>>>>
>>>>> Your advice is helpful, now the total time reduce from 30s to 20s(now
>>>>> all matrix run on gpu), actually I have tried other settings for amg
>>>>> predicontioner, seems not help that a lot, like  -pc_gamg_threshold 0.05
>>>>> -pc_gamg_threshold_scale  0.5.
>>>>> it seems the key point is the PCSetup process, from the log, it takes
>>>>> the most time, and we can find from the new nsight system analysis, there
>>>>> is a big gap before the ksp solver starts, seems like the PCSetup process,
>>>>> not sure, am I right?
>>>>> <3.png>
>>>>>
>>>>> PCSetUp                2 1.0 1.5594e+01 1.0 3.06e+09 1.0 0.0e+00
>>>>> 0.0e+00 0.0e+00 79 78  0  0  0  79 78  0  0  0   196    8433     64
>>>>> 1.91e+03   54 1.21e+03 90
>>>>>
>>>>>
>>>>> Regards,
>>>>> Qi
>>>>>
>>>>> On Tue, Mar 22, 2022 at 10:44 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>>
>>>>>>
>>>>>>   It is using
>>>>>>
>>>>>> MatSOR               369 1.0 9.1214e+00 1.0 7.32e+09 1.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00 29 27  0  0  0  29 27  0  0  0   803       0      0
>>>>>> 0.00e+00  565 1.35e+03  0
>>>>>>
>>>>>> which runs on the CPU not the GPU hence the large amount of time in
>>>>>> memory copies and poor performance. We are switching the default to be
>>>>>> Chebyshev/Jacobi which runs completely on the GPU (may already be switched
>>>>>> in the main branch).
>>>>>>
>>>>>> You can run with -mg_levels_pc_type jacobi You should then see
>>>>>> almost the entire solver running on the GPU.
>>>>>>
>>>>>> You may need to tune the number of smoothing steps or other
>>>>>> parameters of GAMG to get the faster solution time.
>>>>>>
>>>>>>   Barry
>>>>>>
>>>>>>
>>>>>> On Mar 22, 2022, at 10:30 AM, Qi Yang <qiyang at oakland.edu> wrote:
>>>>>>
>>>>>> To whom it may concern,
>>>>>>
>>>>>> I have tried petsc ex50(Possion) with cuda, ksp cg solver and
>>>>>> gamg precondition, however, it run for about 30s. I also tried NVIDIA AMGX
>>>>>> with the same solver and same grid (3000*3000), it only took 2s. I used
>>>>>> nsight system software to analyze those two cases, found petsc took much
>>>>>> time in the memory process (63% of total time, however, amgx only took
>>>>>> 19%). Attached are screenshots of them.
>>>>>>
>>>>>> The petsc command is : mpiexec -n 1 ./ex50  -da_grid_x 3000
>>>>>> -da_grid_y 3000 -ksp_type cg -pc_type gamg -pc_gamg_type agg
>>>>>> -pc_gamg_agg_nsmooths 1 -vec_type cuda -mat_type aijcusparse -ksp_monitor
>>>>>> -ksp_view -log-view
>>>>>>
>>>>>> The log file is also attached.
>>>>>>
>>>>>> Regards,
>>>>>> Qi
>>>>>>
>>>>>> <1.png>
>>>>>> <2.png>
>>>>>> <log.PETSc_cg_amg_ex50_gpu_cuda>
>>>>>>
>>>>>>
>>>>>> <log.PETSc_cg_amg_jacobi_ex50_gpu_cuda>
>>>>>
>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220328/4840f649/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PMIS.PNG
Type: image/png
Size: 64498 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220328/4840f649/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 15908 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220328/4840f649/attachment-0003.png>