[petsc-users] About recent changes in GAMG

Thu Apr 18 15:30:13 CDT 2024

Yikes, it looks like we have been off the list this whole time.
I am not the only PETSC developer nor the only person that knows about
PETSc!

These folks have some strange behavior with GAMG going from 1 to 2 cores,
using lots of memory, but one question that they have, that I
don't understand either is this:

>> Yea, my interpretation of these methods is also that "
PetscMemoryGetMaximumUsage" should be >= "PetscMallocGetMaximumUsage".
>> But you are seeing the opposite.

We are using PETSc main and have found a case where memory consumption
explodes in parallel.
Also, we see a non-negligible difference between PetscMemoryGetMaximumUsage()
and PetscMallocGetMaximumUsage().
Running in serial through /usr/bin/time, the max. resident set size matches
the PetscMallocGetMaximumUsage() result.
I would have expected it to match PetscMemoryGetMaximumUsage() instead.

PetscMemoryGetMaximumUsage
PetscMallocGetMaximumUsage
 Time
Serial + Option 1
 4.8 GB
 7.4 GB
112 sec
2 core + Option1
15.2 GB
45.5 GB
150 sec
Serial + Option 2
3.1 GB
3.8 GB
 167 sec
2 core + Option2
13.1 GB
17.4 GB
89 sec
Serial + Option 3
4.7GB
5.2GB
693 sec
2 core + Option 3
23.2 GB
26.4 GB
411 sec

On Thu, Apr 18, 2024 at 4:13 PM Mark Adams <mfadams at lbl.gov> wrote:

> The next thing you might try is not using the null space argument.
> Hypre does not use it, but GAMG does.
> You could also run with -malloc_view to see some info on mallocs. It is
> probably in the Mat objects.
> You can also run with "-info" and grep on GAMG in the output and send that.
>
> Mark
>
> On Thu, Apr 18, 2024 at 12:03 PM Ashish Patel <ashish.patel at ansys.com>
> wrote:
>
>> Hi Mark,
>>
>> Thanks for your response and suggestion. With hypre both memory and time
>> looks good, here is the data for that
>>
>> PetscMemoryGetMaximumUsage
>> PetscMallocGetMaximumUsage
>>  Time
>> Serial + Option 4
>> 5.55 GB
>>  5.17 GB
>> 15.7 sec
>> 2 core + Option 4
>> 5.85 GB
>> 4.69 GB
>> 21.9 sec
>>
>> Option 4
>> mpirun -n _ ./ex1 -A_name matrix.dat -b_name vector.dat -n_name
>> _null_space.dat -num_near_nullspace 6 -ksp_type cg -pc_type hypre
>> -pc_hypre_boomeramg_strong_threshold 0.9 -ksp_view -log_view
>> -log_view_memory -info :pc
>>
>> I am also attaching a standalone program to reproduce these options and
>> the link to matrix, rhs and near null spaces (serial.tar 2.xz
>> <https://urldefense.us/v3/__https://ansys-my.sharepoint.com/:u:/p/ashish_patel/EbUM5Ahp-epNi4xDxR9mnN0B1dceuVzGhVXQQYJzI5Py2g__;!!G_uCfscf7eWS!ar7t_MsQ-W6SXcDyEWpSDZP_YngFSqVsz2D-8chGJHSz7IZzkLBvN4UpJ1GXrRBGyhEHqmDUQGBfqTKf5x_BPXo$ >
>> ) if you would like to try as well. Please let me know if you have
>> trouble accessing the link.
>>
>> Ashish
>> ------------------------------
>> *From:* Mark Adams <mfadams at lbl.gov>
>> *Sent:* Wednesday, April 17, 2024 7:52 PM
>> *To:* Jeremy Theler (External) <jeremy.theler-ext at ansys.com>
>> *Cc:* Ashish Patel <ashish.patel at ansys.com>; Scott McClennan <
>> scott.mcclennan at ansys.com>
>> *Subject:* Re: About recent changes in GAMG
>>
>>
>> *[External Sender]*
>>
>>
>> On Wed, Apr 17, 2024 at 7:20 AM Jeremy Theler (External) <
>> jeremy.theler-ext at ansys.com> wrote:
>>
>> Hey Mark. Long time no see! How are thing going over there?
>>
>> We are using PETSc main and have found a case where memory consumption
>> explodes in parallel.
>> Also, we see a non-negligible difference between
>> PetscMemoryGetMaximumUsage() and PetscMallocGetMaximumUsage().
>> Running in serial through /usr/bin/time, the max. resident set size
>> matches the PetscMallocGetMaximumUsage() result.
>> I would have expected it to match PetscMemoryGetMaximumUsage() instead.
>>
>>
>> Yea, my interpretation of these methods is also that "Memory" should be
>> >= "Malloc". But you are seeing the opposite.
>>
>> I don't have any idea what is going on with your big memory penalty going
>> from 1 to 2 cores on this test, but the first thing to do is try other
>> solvers and see how that behaves. Hypre in particular would be a good thing
>> to try because it is a similar algorithm.
>>
>> Mark
>>
>>
>>
>> The matrix size is around 1 million. We can share it with you if you
>> want, along with the RHS and the 6 near nullspace vectors and a modified
>> ex1.c which will read these files and show the following behavior.
>>
>> Observations using latest main for elastic matrix with a block size of 3
>> (after removing bonded glue-like DOFs with direct elimination) and near
>> null space provided
>>
>>    - Big memory penalty going from serial to parallel (2 core)
>>    - Big difference between PetscMemoryGetMaximumUsage and
>>    PetscMallocGetMaximumUsage, why?
>>    - The memory penalty decreases with -pc_gamg_aggressive_square_graph false
>>    (option 2)
>>    - The difference between PetscMemoryGetMaximumUsage and
>>    PetscMallocGetMaximumUsage reduces when -pc_gamg_threshold is
>>    increased from 0 to 0.01 (option 3), the solve time increase a lot though.
>>
>>
>>
>>
>>
>> PetscMemoryGetMaximumUsage
>> PetscMallocGetMaximumUsage
>>  Time
>> Serial + Option 1
>>  4.8 GB
>>  7.4 GB
>> 112 sec
>> 2 core + Option1
>> 15.2 GB
>> 45.5 GB
>> 150 sec
>> Serial + Option 2
>> 3.1 GB
>> 3.8 GB
>>  167 sec
>> 2 core + Option2
>> 13.1 GB
>> 17.4 GB
>> 89 sec
>> Serial + Option 3
>> 4.7GB
>> 5.2GB
>> 693 sec
>> 2 core + Option 3
>> 23.2 GB
>> 26.4 GB
>> 411 sec
>>
>> Option 1
>> mpirun -n _ ./ex1 -A_name matrix.dat -b_name vector.dat -n_name
>> _null_space.dat -num_near_nullspace 6 -ksp_type cg -pc_type gamg
>> -pc_gamg_coarse_eq_limit 1000 -ksp_view -log_view -log_view_memory
>> -pc_gamg_aggressive_square_graph true -pc_gamg_threshold 0.0 -info :pc
>>
>> Option 2
>> mpirun -n _ ./ex1 -A_name matrix.dat -b_name vector.dat -n_name
>> _null_space.dat -num_near_nullspace 6 -ksp_type cg -pc_type gamg
>> -pc_gamg_coarse_eq_limit 1000 -ksp_view -log_view -log_view_memory
>> -pc_gamg_aggressive_square_graph *false* -pc_gamg_threshold 0.0 -info :pc
>>
>> Option 3
>> mpirun -n _ ./ex1 -A_name matrix.dat -b_name vector.dat -n_name
>> _null_space.dat -num_near_nullspace 6 -ksp_type cg -pc_type gamg
>> -pc_gamg_coarse_eq_limit 1000 -ksp_view -log_view -log_view_memory
>> -pc_gamg_aggressive_square_graph true -pc_gamg_threshold *0.01* -info :pc
>> ------------------------------
>> *From:* Mark Adams <mfadams at lbl.gov>
>> *Sent:* Tuesday, November 14, 2023 11:28 AM
>> *To:* Jeremy Theler (External) <jeremy.theler-ext at ansys.com>
>> *Cc:* Ashish Patel <ashish.patel at ansys.com>
>> *Subject:* Re: About recent changes in GAMG
>>
>>
>> *[External Sender]*
>> Sounds good,
>>
>> I think the not-square graph "aggressive" coarsening is only issue that I
>> see and you can fix this by using:
>>
>> -mat_coarsen_type mis
>>
>> Aside, '-pc_gamg_aggressive_square_graph' should do it also, and you can
>> use both and they will be ignored in earlier versions.
>>
>> If you see a difference then the first thing to do is run with '-info
>> :pc' and send that to me (you can grep on 'GAMG' and send that if you like
>> to reduce the data).
>>
>> Thanks,
>> Mark
>>
>>
>> On Tue, Nov 14, 2023 at 8:49 AM Jeremy Theler (External) <
>> jeremy.theler-ext at ansys.com> wrote:
>>
>> Hi Mark.
>> Thanks for reaching out. For now, we are going to stick to 3.19 for our
>> production code because the changes in 3.20 impact in our tests in
>> different ways (some of them perform better, some perform worse).
>> I now switched to another task about investigating structural elements in
>> DMplex.
>> I'll go back to analyzing the new changes in GAMG in a couple of weeks so
>> we can then see if we upgrade to 3.20 or we wait until 3.21.
>>
>> Thanks for your work and your kindness.
>> --
>> jeremy
>> ------------------------------
>> *From:* Mark Adams <mfadams at lbl.gov>
>> *Sent:* Tuesday, November 14, 2023 9:35 AM
>> *To:* Jeremy Theler (External) <jeremy.theler-ext at ansys.com>
>> *Cc:* Ashish Patel <ashish.patel at ansys.com>
>> *Subject:* Re: About recent changes in GAMG
>>
>>
>> *[External Sender]*
>> Hi Jeremy,
>>
>> Just following up.
>> I appreciate your digging into performance regressions in GAMG.
>> AMG is really a pain sometimes and we want GAMG to be solid, at least for
>> mainstream options, and your efforts are appreciated.
>> So feel free to start this discussion up.
>>
>> Thanks,
>> Mark
>>
>> On Wed, Oct 25, 2023 at 9:52 PM Jeremy Theler (External) <
>> jeremy.theler-ext at ansys.com> wrote:
>>
>> Dear Mark
>>
>> Thanks for the follow up and sorry for the delay.
>> I'm taking some days off. I'll be back to full throttle next week so can
>> continue the discussion about these changes in GAMG.
>>
>> Regards,
>> Jeremy
>>
>> ------------------------------
>> *From:* Mark Adams <mfadams at lbl.gov>
>> *Sent:* Wednesday, October 18, 2023 9:15 AM
>> *To:* Jeremy Theler (External) <jeremy.theler-ext at ansys.com>; PETSc
>> users list <petsc-users at mcs.anl.gov>
>> *Cc:* Ashish Patel <ashish.patel at ansys.com>
>> *Subject:* Re: About recent changes in GAMG
>>
>>
>> *[External Sender]*
>> Hi Jeremy,
>>
>> I hope you don't mind putting this on the list (w/o data), but this is
>> documentation and you are the second user that found regressions.
>> Sorry for the churn.
>>
>> There is a lot here so we can iterate, but here is a pass at your
>> questions.
>>
>> *** Using MIS-2 instead of square graph was motivated by setup
>> cost/performance but on GPUs with some recent fixes in Kokkos (in a branch)
>> square graph seems OK.
>> My experience was that square graph is better in terms of quality and we
>> have a power user, like you all, that found this also.
>> So I switched the default back to square graph.
>>
>> Interesting that you found that MIS-2 (new method) could be faster, but
>> it might be because the two methods coarsen at different rates and that can
>> make a big difference.
>> (the way to test would be to adjust parameters to get similar coarsen
>> rates, but I digress)
>> It's hard to understand the differences between these two methods in
>> terms of aggregate quality so we need to just experiment and have options.
>>
>> *** As far as your thermal problem. There was a complaint that the eigen
>> estimates for chebyshev smoother were not recomputed for nonlinear problems
>> and I added an option to do that and turned it on by default:
>> Use '-pc_gamg_recompute_esteig false' to get back to the original.
>> (I should have turned it off by default)
>>
>> Now, if your problem is symmetric and you use CG to compute the eigen
>> estimates there should be no difference.
>> If you use CG to compute the eigen estimates in GAMG (and have GAMG give
>> them to cheby, the default) that when you recompute the eigen estimates the
>> cheby eigen estimator is used and that will use gmres by default unless you
>> set the SPD property in your matrix.
>> So if you set '-pc_gamg_esteig_ksp_type cg' you want to also set
>> '-mg_levels_esteig_ksp_type cg' (verify with -ksp_view and -options_left)
>> CG is a much better estimator for SPD.
>>
>> And I found that the cheby eigen estimator uses an LAPACK *eigen* method
>> to compute the eigen bounds and GAMG uses a *singular value* method.
>> The two give very different results on the lid driven cavity test (ex19).
>> eigen is lower, which is safer but not optimal if it is too low.
>> I have a branch to have cheby use the singular value method, but I don't
>> plan on merging it (enough churn and I don't understand these differences).
>>
>> *** '-pc_gamg_low_memory_threshold_filter false' recovers the old
>> filtering method.
>> This is the default now because there is a bug in the (new) low memory
>> filter.
>> This bug is very rare and catastrophic.
>> We are working on it and will turn it on by default when it's fixed.
>> This does not affect the semantics of the solver, just work and memory
>> complexity.
>>
>> *** As far as tet4 vs tet10, I would guess that tet4 wants more
>> aggressive coarsening.
>> The default is to do aggressive on one (1) level.
>> You might want more levels for tet4.
>> And the new MIS-k coarsening can use any k (default is 2) wth
>> '-mat_coarsen_misk_distance k' (eg, k=3)
>> I have not added hooks to have a more complex schedule to specify the
>> method on each level.
>>
>> Thanks,
>> Mark
>>
>> On Tue, Oct 17, 2023 at 9:33 PM Jeremy Theler (External) <
>> jeremy.theler-ext at ansys.com> wrote:
>>
>> Hey Mark
>>
>> Regarding the changes in the coarsening algorithm in 3.20 with respect to
>> 3.19 in general we see that for some problems the MIS strategy gives and
>> overall performance which is slightly better and for some others it is
>> slightly worse than the "baseline" from 3.19.
>> We also saw that current main has switched back to the old square
>> coarsening algorithm by default, which again, in some cases is better and
>> in others is worse than 3.19 without any extra command-line option.
>>
>> Now what seems weird to us is that we have a test case which is a heat
>> conduction problem with radiation boundary conditions (so it is non linear)
>> using tet10 and we see
>>
>>    1. that in parallel v3.20 is way worse than v3.19, although the
>>    memory usage is similar
>>    2. that petsc main (with no extra flags, just the defaults) recover
>>    the 3.19 performance but memory usage is significantly larger
>>
>>
>> I tried using the -pc_gamg_low_memory_threshold_filter flag and the
>> results were the same.
>>
>> Find attached the log and snes views of 3.19, 3.20 and main using 4 MPI
>> ranks.
>> Is there any explanation about these two points we are seeing?
>> Another weird finding is that if we use tet4 instead of tet10, v3.20 is
>> only 10% slower than the other two and main does not need more memory than
>> the other two.
>>
>> BTW, I have dozens of other log view outputs comparing 3.19, 3.20 and
>> main should you be interested.
>>
>> Let me know if it is better to move this discussion into the PETSc
>> mailing list.
>>
>> Regards,
>> jeremy theler
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240418/a8a1a171/attachment-0001.html>