[petsc-dev] PETSc amg solver with gpu seems run slowly

Mon Mar 28 12:21:45 CDT 2022

You can run on AMD GPUs now with -dm_vec_type kokkos -dm_mat_type aijkokkos, for example. GAMG works that way, with the PtAP setup on device. If you use MatSetValuesCOO, then matrix assembly is also entirely on-device.

Justin Chang <jychang48 at gmail.com> writes:

> Hi Qi, Mark,
>
> My colleague Suyash Tandon has almost completed a PETSc HIP port
> (essentially a hipification of the CUDA port) and has been trying to test
> it on the same OpenFOAM 3D Lid-driven case. It would be interesting to see
> what the optimal HYPRE parameters are as we could experiment from the AMD
> side.
>
> Thanks,
> Justin
>
>
> On Mon, Mar 28, 2022 at 10:28 AM Qi Yang <qiyang at oakland.edu> wrote:
>
>> Hi Mark,
>>
>> Sure, I will try a 3D Lid-driven case by combining OpenFOAM, PETSc and
>> HYPRE, let's see what would happen.
>>
>> Kind regards,
>> Qi
>>
>> On Mon, Mar 28, 2022 at 11:04 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> Hi Qi, these are good discussions and data and we like to share, so let's
>>> keep this on the list.
>>>
>>> * I would suggest you use a 3D test. This is more relevant to what HPC
>>> applications do.
>>> * In my experience, hypre's default parameters are tuned for 2D low order
>>> problems like this so I would start with the defaults. I think they should
>>> be fine for 3D also.
>>> * As I think I said before we have an AMGx interface under
>>> development and I heard yesterday that it should not be long until it is
>>> available. It would be great if you could test that and we can work with
>>> the NVIDIA developer to optimize it. We will let you know when
>>> its available.
>>>
>>> Cheers,
>>> Mark
>>>
>>>
>>> On Mon, Mar 28, 2022 at 10:44 AM Qi Yang <qiyang at oakland.edu> wrote:
>>>
>>>>   Hi Mark and Barry,
>>>>
>>>> Really appreciate your explanation about the setup process, those days I
>>>> tried to use the HYPRE amg solver to replace the original amg solver in
>>>> PETSc.
>>>>
>>>> The solver settings of HYPRE are as follows:
>>>> mpiexec -n 1 ./ex50  -da_grid_x 3000 -da_grid_y 3000 -ksp_type cg
>>>> -pc_type hypre -pc_hypre_type boomeramg -pc_hypre_boomeramg_max_iter 1
>>>>  -pc_hypre_boomeramg_strong_threshold 0.7
>>>> -pc_hypre_boomeramg_grid_sweeps_up 1 -pc_hypre_boomeramg_grid_sweeps_down 1
>>>> -pc_hypre_boomeramg_agg_nl 2 -pc_hypre_boomeramg_agg_num_paths 1
>>>> -pc_hypre_boomeramg_max_levels 25 *-pc_hypre_boomeramg_coarsen_type
>>>> PMIS* -pc_hypre_boomeramg_interp_type ext+i -pc_hypre_boomeramg_P_max 2
>>>> -pc_hypre_boomeramg_truncfactor 0.2 -vec_type cuda -mat_type aijcusparse
>>>> -ksp_monitor -ksp_view -log-view
>>>>
>>>> [image: PMIS.PNG]
>>>>
>>>> The interesting part is that I choose the coarsen type as PMIS, through
>>>> the code, you can find only PMIS has GPU codes(Host and Device).
>>>> * HYPRE does reduce the solution time from 20s to 8s
>>>> * The memory mapping process is found inside the solver process, which
>>>> causes several gaps in the following NVIDIA Nsight System profile, I am not
>>>> sure what does it mean,
>>>> [image: image.png]
>>>> I am really interested to do some benchmarks by using hypre amg solver,
>>>> actually, I already connected OpenFOAM, PETSc, HYPRE and AMGX together by
>>>> using the API
>>>>  petsc4foam(
>>>> https://develop.openfoam.com/modules/external-solver/-/tree/amgxwrapper/src/petsc4Foam),
>>>> I prefer to use PETSc as the base matrix solver for possible HIP code
>>>> implementation in the future, that way, I can compare the difference
>>>> between NVIDIA and AMD GPU. It seems there are many benchmark cases I can
>>>> do in the future.
>>>>
>>>> Regards,
>>>> Qi
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 23, 2022 at 9:39 AM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>> A few points, but first this is a nice start. If you are interested in
>>>>> working on benchmarking that would be great. If so, read on.
>>>>>
>>>>> * Barry pointed out the SOR issues that are thrashing the
>>>>> memory system. This solve would run faster on the CPU (maybe, 9M eqs is a
>>>>> lot).
>>>>> * Most applications run for some time doing 100-1,000 and more solves
>>>>> with one configuration and this amortizes the setup costs for each mesh.
>>>>> What I call "mesh setup" cost.
>>>>> * Many applications are nonlinear and use a full Newton solver that
>>>>> does a "matrix setup" for each solve, but many applications can also
>>>>> amortize this matrix setup (PtAP stuff in the output, which is small for 2D
>>>>> problem but can be large for 3D problems)
>>>>> * Now hypre's mesh setup is definitely better that GAMG's and AMGx is
>>>>> out of this world.
>>>>>   - AMGx is the result of a serious development effort by NVIDIA about
>>>>> 15 years ago with many 10's of NVIDIA developer years in it (I am guessing
>>>>> but I know it was a serious effort for a few years)
>>>>>     + We are currently working with the current AMG developer, Matt, to
>>>>> provide an AMGx interface in PETSc, like hypre (DOE does not like us
>>>>> working with non-portable solvers but AMGx is very good)
>>>>> * Hypre and AMGx use "classic" AMG, which is like geometric multigrid
>>>>> (fast) for M-matrices (very low order Laplacians, like ex50).
>>>>> * GAMG uses "smoothed aggregation" AMG  because this algorithm has
>>>>> better theoretical properties for high order and elasticity problems and
>>>>> the algorithm's implementations and default parameters have been optimized
>>>>> for these types of problems.
>>>>>
>>>>> It would be interesting to add Hypre to your study (Ex50) and add a
>>>>> high order 3D elasticity problem (eg, snes/tests/ex13, or Jed Brown has
>>>>> some nice elasticity problems).
>>>>> If you are interested we can give you Hypre parameters for elasticity
>>>>> problems.
>>>>> I have no experience with AMGx on elasticity but the NVIDIA developer
>>>>> is available and can be looped in.
>>>>> For that matter we could bring the main hypre developer, Ruipeng, in as
>>>>> well.
>>>>> I would also suggest timing the setup (you can combine mesh and matrix
>>>>> if you like) and solve phase separately. ex13 does this and we should find
>>>>> another 5-point stencil example that does this if ex50 does not.
>>>>>
>>>>> BTW, I have been intending to write a benchmarking paper this year with
>>>>> Matt and Ruipeng, but I am just not getting around to it ...
>>>>> If you want to lead a paper and the experiments, we can help optimize
>>>>> and tune our solvers, setup tests, write background material, etc.
>>>>>
>>>>> Cheers,
>>>>> Mark
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 22, 2022 at 12:30 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>>
>>>>>>
>>>>>> Indeed PCSetUp is taking most of the time (79%). In the version of
>>>>>> PETSc you are running it is doing a great deal of the setup work on the
>>>>>> CPU. You can see there is a lot of data movement between the CPU and GPU
>>>>>> (in both directions) during the setup; 64 1.91e+03   54 1.21e+03 90
>>>>>>
>>>>>> Clearly, we need help in porting all the parts of the GAMG setup that
>>>>>> still occur on the CPU to the GPU.
>>>>>>
>>>>>>  Barry
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mar 22, 2022, at 12:07 PM, Qi Yang <qiyang at oakland.edu> wrote:
>>>>>>
>>>>>> Dear Barry,
>>>>>>
>>>>>> Your advice is helpful, now the total time reduce from 30s to 20s(now
>>>>>> all matrix run on gpu), actually I have tried other settings for amg
>>>>>> predicontioner, seems not help that a lot, like  -pc_gamg_threshold 0.05
>>>>>> -pc_gamg_threshold_scale  0.5.
>>>>>> it seems the key point is the PCSetup process, from the log, it takes
>>>>>> the most time, and we can find from the new nsight system analysis, there
>>>>>> is a big gap before the ksp solver starts, seems like the PCSetup process,
>>>>>> not sure, am I right?
>>>>>> <3.png>
>>>>>>
>>>>>> PCSetUp                2 1.0 1.5594e+01 1.0 3.06e+09 1.0 0.0e+00
>>>>>> 0.0e+00 0.0e+00 79 78  0  0  0  79 78  0  0  0   196    8433     64
>>>>>> 1.91e+03   54 1.21e+03 90
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Qi
>>>>>>
>>>>>> On Tue, Mar 22, 2022 at 10:44 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>>>
>>>>>>>
>>>>>>>   It is using
>>>>>>>
>>>>>>> MatSOR               369 1.0 9.1214e+00 1.0 7.32e+09 1.0 0.0e+00
>>>>>>> 0.0e+00 0.0e+00 29 27  0  0  0  29 27  0  0  0   803       0      0
>>>>>>> 0.00e+00  565 1.35e+03  0
>>>>>>>
>>>>>>> which runs on the CPU not the GPU hence the large amount of time in
>>>>>>> memory copies and poor performance. We are switching the default to be
>>>>>>> Chebyshev/Jacobi which runs completely on the GPU (may already be switched
>>>>>>> in the main branch).
>>>>>>>
>>>>>>> You can run with -mg_levels_pc_type jacobi You should then see
>>>>>>> almost the entire solver running on the GPU.
>>>>>>>
>>>>>>> You may need to tune the number of smoothing steps or other
>>>>>>> parameters of GAMG to get the faster solution time.
>>>>>>>
>>>>>>>   Barry
>>>>>>>
>>>>>>>
>>>>>>> On Mar 22, 2022, at 10:30 AM, Qi Yang <qiyang at oakland.edu> wrote:
>>>>>>>
>>>>>>> To whom it may concern,
>>>>>>>
>>>>>>> I have tried petsc ex50(Possion) with cuda, ksp cg solver and
>>>>>>> gamg precondition, however, it run for about 30s. I also tried NVIDIA AMGX
>>>>>>> with the same solver and same grid (3000*3000), it only took 2s. I used
>>>>>>> nsight system software to analyze those two cases, found petsc took much
>>>>>>> time in the memory process (63% of total time, however, amgx only took
>>>>>>> 19%). Attached are screenshots of them.
>>>>>>>
>>>>>>> The petsc command is : mpiexec -n 1 ./ex50  -da_grid_x 3000
>>>>>>> -da_grid_y 3000 -ksp_type cg -pc_type gamg -pc_gamg_type agg
>>>>>>> -pc_gamg_agg_nsmooths 1 -vec_type cuda -mat_type aijcusparse -ksp_monitor
>>>>>>> -ksp_view -log-view
>>>>>>>
>>>>>>> The log file is also attached.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Qi
>>>>>>>
>>>>>>> <1.png>
>>>>>>> <2.png>
>>>>>>> <log.PETSc_cg_amg_ex50_gpu_cuda>
>>>>>>>
>>>>>>>
>>>>>>> <log.PETSc_cg_amg_jacobi_ex50_gpu_cuda>
>>>>>>
>>>>>>
>>>>>>