[petsc-dev] PETSc amg solver with gpu seems run slowly

Mark Adams mfadams at lbl.gov
Tue Mar 22 20:39:03 CDT 2022


A few points, but first this is a nice start. If you are interested in
working on benchmarking that would be great. If so, read on.

* Barry pointed out the SOR issues that are thrashing the memory system.
This solve would run faster on the CPU (maybe, 9M eqs is a lot).
* Most applications run for some time doing 100-1,000 and more solves with
one configuration and this amortizes the setup costs for each mesh. What I
call "mesh setup" cost.
* Many applications are nonlinear and use a full Newton solver that does a
"matrix setup" for each solve, but many applications can also amortize this
matrix setup (PtAP stuff in the output, which is small for 2D problem but
can be large for 3D problems)
* Now hypre's mesh setup is definitely better that GAMG's and AMGx is out
of this world.
  - AMGx is the result of a serious development effort by NVIDIA about 15
years ago with many 10's of NVIDIA developer years in it (I am guessing but
I know it was a serious effort for a few years)
    + We are currently working with the current AMG developer, Matt, to
provide an AMGx interface in PETSc, like hypre (DOE does not like us
working with non-portable solvers but AMGx is very good)
* Hypre and AMGx use "classic" AMG, which is like geometric multigrid
(fast) for M-matrices (very low order Laplacians, like ex50).
* GAMG uses "smoothed aggregation" AMG  because this algorithm has better
theoretical properties for high order and elasticity problems and the
algorithm's implementations and default parameters have been optimized for
these types of problems.

It would be interesting to add Hypre to your study (Ex50) and add a high
order 3D elasticity problem (eg, snes/tests/ex13, or Jed Brown has some
nice elasticity problems).
If you are interested we can give you Hypre parameters for elasticity
problems.
I have no experience with AMGx on elasticity but the NVIDIA developer is
available and can be looped in.
For that matter we could bring the main hypre developer, Ruipeng, in as
well.
I would also suggest timing the setup (you can combine mesh and matrix if
you like) and solve phase separately. ex13 does this and we should find
another 5-point stencil example that does this if ex50 does not.

BTW, I have been intending to write a benchmarking paper this year with
Matt and Ruipeng, but I am just not getting around to it ...
If you want to lead a paper and the experiments, we can help optimize and
tune our solvers, setup tests, write background material, etc.

Cheers,
Mark








On Tue, Mar 22, 2022 at 12:30 PM Barry Smith <bsmith at petsc.dev> wrote:

>
> Indeed PCSetUp is taking most of the time (79%). In the version of PETSc
> you are running it is doing a great deal of the setup work on the CPU. You
> can see there is a lot of data movement between the CPU and GPU (in both
> directions) during the setup; 64 1.91e+03   54 1.21e+03 90
>
> Clearly, we need help in porting all the parts of the GAMG setup that
> still occur on the CPU to the GPU.
>
>  Barry
>
>
>
>
> On Mar 22, 2022, at 12:07 PM, Qi Yang <qiyang at oakland.edu> wrote:
>
> Dear Barry,
>
> Your advice is helpful, now the total time reduce from 30s to 20s(now all
> matrix run on gpu), actually I have tried other settings for amg
> predicontioner, seems not help that a lot, like  -pc_gamg_threshold 0.05
> -pc_gamg_threshold_scale  0.5.
> it seems the key point is the PCSetup process, from the log, it takes the
> most time, and we can find from the new nsight system analysis, there is a
> big gap before the ksp solver starts, seems like the PCSetup process, not
> sure, am I right?
> <3.png>
>
> PCSetUp                2 1.0 1.5594e+01 1.0 3.06e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 79 78  0  0  0  79 78  0  0  0   196    8433     64 1.91e+03   54
> 1.21e+03 90
>
>
> Regards,
> Qi
>
> On Tue, Mar 22, 2022 at 10:44 PM Barry Smith <bsmith at petsc.dev> wrote:
>
>>
>>   It is using
>>
>> MatSOR               369 1.0 9.1214e+00 1.0 7.32e+09 1.0 0.0e+00 0.0e+00
>> 0.0e+00 29 27  0  0  0  29 27  0  0  0   803       0      0 0.00e+00  565
>> 1.35e+03  0
>>
>> which runs on the CPU not the GPU hence the large amount of time in
>> memory copies and poor performance. We are switching the default to be
>> Chebyshev/Jacobi which runs completely on the GPU (may already be switched
>> in the main branch).
>>
>> You can run with -mg_levels_pc_type jacobi You should then see almost
>> the entire solver running on the GPU.
>>
>> You may need to tune the number of smoothing steps or other parameters of
>> GAMG to get the faster solution time.
>>
>>   Barry
>>
>>
>> On Mar 22, 2022, at 10:30 AM, Qi Yang <qiyang at oakland.edu> wrote:
>>
>> To whom it may concern,
>>
>> I have tried petsc ex50(Possion) with cuda, ksp cg solver and
>> gamg precondition, however, it run for about 30s. I also tried NVIDIA AMGX
>> with the same solver and same grid (3000*3000), it only took 2s. I used
>> nsight system software to analyze those two cases, found petsc took much
>> time in the memory process (63% of total time, however, amgx only took
>> 19%). Attached are screenshots of them.
>>
>> The petsc command is : mpiexec -n 1 ./ex50  -da_grid_x 3000 -da_grid_y
>> 3000 -ksp_type cg -pc_type gamg -pc_gamg_type agg -pc_gamg_agg_nsmooths 1
>> -vec_type cuda -mat_type aijcusparse -ksp_monitor -ksp_view -log-view
>>
>> The log file is also attached.
>>
>> Regards,
>> Qi
>>
>> <1.png>
>> <2.png>
>> <log.PETSc_cg_amg_ex50_gpu_cuda>
>>
>>
>> <log.PETSc_cg_amg_jacobi_ex50_gpu_cuda>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20220322/6893763a/attachment.html>


More information about the petsc-dev mailing list