[petsc-dev] PETSc amg solver with gpu seems run slowly

Qi Yang qiyang at oakland.edu
Mon Mar 28 10:28:23 CDT 2022


Hi Mark,

Sure, I will try a 3D Lid-driven case by combining OpenFOAM, PETSc and
HYPRE, let's see what would happen.

Kind regards,
Qi

On Mon, Mar 28, 2022 at 11:04 PM Mark Adams <mfadams at lbl.gov> wrote:

> Hi Qi, these are good discussions and data and we like to share, so let's
> keep this on the list.
>
> * I would suggest you use a 3D test. This is more relevant to what HPC
> applications do.
> * In my experience, hypre's default parameters are tuned for 2D low order
> problems like this so I would start with the defaults. I think they should
> be fine for 3D also.
> * As I think I said before we have an AMGx interface under development and
> I heard yesterday that it should not be long until it is available. It
> would be great if you could test that and we can work with the NVIDIA
> developer to optimize it. We will let you know when its available.
>
> Cheers,
> Mark
>
>
> On Mon, Mar 28, 2022 at 10:44 AM Qi Yang <qiyang at oakland.edu> wrote:
>
>>   Hi Mark and Barry,
>>
>> Really appreciate your explanation about the setup process, those days I
>> tried to use the HYPRE amg solver to replace the original amg solver in
>> PETSc.
>>
>> The solver settings of HYPRE are as follows:
>> mpiexec -n 1 ./ex50  -da_grid_x 3000 -da_grid_y 3000 -ksp_type cg
>> -pc_type hypre -pc_hypre_type boomeramg -pc_hypre_boomeramg_max_iter 1
>>  -pc_hypre_boomeramg_strong_threshold 0.7
>> -pc_hypre_boomeramg_grid_sweeps_up 1 -pc_hypre_boomeramg_grid_sweeps_down 1
>> -pc_hypre_boomeramg_agg_nl 2 -pc_hypre_boomeramg_agg_num_paths 1
>> -pc_hypre_boomeramg_max_levels 25 *-pc_hypre_boomeramg_coarsen_type PMIS*
>> -pc_hypre_boomeramg_interp_type ext+i -pc_hypre_boomeramg_P_max 2
>> -pc_hypre_boomeramg_truncfactor 0.2 -vec_type cuda -mat_type aijcusparse
>> -ksp_monitor -ksp_view -log-view
>>
>> [image: PMIS.PNG]
>>
>> The interesting part is that I choose the coarsen type as PMIS, through
>> the code, you can find only PMIS has GPU codes(Host and Device).
>> * HYPRE does reduce the solution time from 20s to 8s
>> * The memory mapping process is found inside the solver process, which
>> causes several gaps in the following NVIDIA Nsight System profile, I am not
>> sure what does it mean,
>> [image: image.png]
>> I am really interested to do some benchmarks by using hypre amg solver,
>> actually, I already connected OpenFOAM, PETSc, HYPRE and AMGX together by
>> using the API
>>  petsc4foam(
>> https://develop.openfoam.com/modules/external-solver/-/tree/amgxwrapper/src/petsc4Foam),
>> I prefer to use PETSc as the base matrix solver for possible HIP code
>> implementation in the future, that way, I can compare the difference
>> between NVIDIA and AMD GPU. It seems there are many benchmark cases I can
>> do in the future.
>>
>> Regards,
>> Qi
>>
>>
>>
>>
>> On Wed, Mar 23, 2022 at 9:39 AM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> A few points, but first this is a nice start. If you are interested in
>>> working on benchmarking that would be great. If so, read on.
>>>
>>> * Barry pointed out the SOR issues that are thrashing the memory system.
>>> This solve would run faster on the CPU (maybe, 9M eqs is a lot).
>>> * Most applications run for some time doing 100-1,000 and more solves
>>> with one configuration and this amortizes the setup costs for each mesh.
>>> What I call "mesh setup" cost.
>>> * Many applications are nonlinear and use a full Newton solver that does
>>> a "matrix setup" for each solve, but many applications can also amortize
>>> this matrix setup (PtAP stuff in the output, which is small for 2D problem
>>> but can be large for 3D problems)
>>> * Now hypre's mesh setup is definitely better that GAMG's and AMGx is
>>> out of this world.
>>>   - AMGx is the result of a serious development effort by NVIDIA about
>>> 15 years ago with many 10's of NVIDIA developer years in it (I am guessing
>>> but I know it was a serious effort for a few years)
>>>     + We are currently working with the current AMG developer, Matt, to
>>> provide an AMGx interface in PETSc, like hypre (DOE does not like us
>>> working with non-portable solvers but AMGx is very good)
>>> * Hypre and AMGx use "classic" AMG, which is like geometric multigrid
>>> (fast) for M-matrices (very low order Laplacians, like ex50).
>>> * GAMG uses "smoothed aggregation" AMG  because this algorithm has
>>> better theoretical properties for high order and elasticity problems and
>>> the algorithm's implementations and default parameters have been optimized
>>> for these types of problems.
>>>
>>> It would be interesting to add Hypre to your study (Ex50) and add a high
>>> order 3D elasticity problem (eg, snes/tests/ex13, or Jed Brown has some
>>> nice elasticity problems).
>>> If you are interested we can give you Hypre parameters for elasticity
>>> problems.
>>> I have no experience with AMGx on elasticity but the NVIDIA developer is
>>> available and can be looped in.
>>> For that matter we could bring the main hypre developer, Ruipeng, in as
>>> well.
>>> I would also suggest timing the setup (you can combine mesh and matrix
>>> if you like) and solve phase separately. ex13 does this and we should find
>>> another 5-point stencil example that does this if ex50 does not.
>>>
>>> BTW, I have been intending to write a benchmarking paper this year with
>>> Matt and Ruipeng, but I am just not getting around to it ...
>>> If you want to lead a paper and the experiments, we can help optimize
>>> and tune our solvers, setup tests, write background material, etc.
>>>
>>> Cheers,
>>> Mark
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 22, 2022 at 12:30 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>
>>>>
>>>> Indeed PCSetUp is taking most of the time (79%). In the version of
>>>> PETSc you are running it is doing a great deal of the setup work on the
>>>> CPU. You can see there is a lot of data movement between the CPU and GPU
>>>> (in both directions) during the setup; 64 1.91e+03   54 1.21e+03 90
>>>>
>>>> Clearly, we need help in porting all the parts of the GAMG setup that
>>>> still occur on the CPU to the GPU.
>>>>
>>>>  Barry
>>>>
>>>>
>>>>
>>>>
>>>> On Mar 22, 2022, at 12:07 PM, Qi Yang <qiyang at oakland.edu> wrote:
>>>>
>>>> Dear Barry,
>>>>
>>>> Your advice is helpful, now the total time reduce from 30s to 20s(now
>>>> all matrix run on gpu), actually I have tried other settings for amg
>>>> predicontioner, seems not help that a lot, like  -pc_gamg_threshold 0.05
>>>> -pc_gamg_threshold_scale  0.5.
>>>> it seems the key point is the PCSetup process, from the log, it takes
>>>> the most time, and we can find from the new nsight system analysis, there
>>>> is a big gap before the ksp solver starts, seems like the PCSetup process,
>>>> not sure, am I right?
>>>> <3.png>
>>>>
>>>> PCSetUp                2 1.0 1.5594e+01 1.0 3.06e+09 1.0 0.0e+00
>>>> 0.0e+00 0.0e+00 79 78  0  0  0  79 78  0  0  0   196    8433     64
>>>> 1.91e+03   54 1.21e+03 90
>>>>
>>>>
>>>> Regards,
>>>> Qi
>>>>
>>>> On Tue, Mar 22, 2022 at 10:44 PM Barry Smith <bsmith at petsc.dev> wrote:
>>>>
>>>>>
>>>>>   It is using
>>>>>
>>>>> MatSOR               369 1.0 9.1214e+00 1.0 7.32e+09 1.0 0.0e+00
>>>>> 0.0e+00 0.0e+00 29 27  0  0  0  29 27  0  0  0   803       0      0
>>>>> 0.00e+00  565 1.35e+03  0
>>>>>
>>>>> which runs on the CPU not the GPU hence the large amount of time in
>>>>> memory copies and poor performance. We are switching the default to be
>>>>> Chebyshev/Jacobi which runs completely on the GPU (may already be switched
>>>>> in the main branch).
>>>>>
>>>>> You can run with -mg_levels_pc_type jacobi You should then see almost
>>>>> the entire solver running on the GPU.
>>>>>
>>>>> You may need to tune the number of smoothing steps or other parameters
>>>>> of GAMG to get the faster solution time.
>>>>>
>>>>>   Barry
>>>>>
>>>>>
>>>>> On Mar 22, 2022, at 10:30 AM, Qi Yang <qiyang at oakland.edu> wrote:
>>>>>
>>>>> To whom it may concern,
>>>>>
>>>>> I have tried petsc ex50(Possion) with cuda, ksp cg solver and
>>>>> gamg precondition, however, it run for about 30s. I also tried NVIDIA AMGX
>>>>> with the same solver and same grid (3000*3000), it only took 2s. I used
>>>>> nsight system software to analyze those two cases, found petsc took much
>>>>> time in the memory process (63% of total time, however, amgx only took
>>>>> 19%). Attached are screenshots of them.
>>>>>
>>>>> The petsc command is : mpiexec -n 1 ./ex50  -da_grid_x 3000 -da_grid_y
>>>>> 3000 -ksp_type cg -pc_type gamg -pc_gamg_type agg -pc_gamg_agg_nsmooths 1
>>>>> -vec_type cuda -mat_type aijcusparse -ksp_monitor -ksp_view -log-view
>>>>>
>>>>> The log file is also attached.
>>>>>
>>>>> Regards,
>>>>> Qi
>>>>>
>>>>> <1.png>
>>>>> <2.png>
>>>>> <log.PETSc_cg_amg_ex50_gpu_cuda>
>>>>>
>>>>>
>>>>> <log.PETSc_cg_amg_jacobi_ex50_gpu_cuda>
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220328/9af3a3b3/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PMIS.PNG
Type: image/png
Size: 64498 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220328/9af3a3b3/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 15908 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220328/9af3a3b3/attachment-0003.png>


More information about the petsc-dev mailing list