[petsc-dev] [petsc-users] performance regression with GAMG

Fri Oct 6 08:15:33 CDT 2023

Stephan Kramer has a reproducer (see this thread) with 1500 processors.
They have lots of tests and this seemed to be the only, or the smallest,
one that failed.

I added a switch in GAMG to use your "low memory filter" to use this.
Stephan is accommodating and would test a fix.

Stephan has stack traces (4 Oct) in this thread and one, as I recall, hung
waiting for MPI receives on an "nrecv" member of Mat.
I think "nrecv" has to be recomputed because, my guess, the filter removed
a processor edge and the send processor did not have any data to send
anymore and did not send an empty message.
Just a guess.

Thanks,
Mark

On Fri, Oct 6, 2023 at 12:30 AM Pierre Jolivet <pierre at joliv.et> wrote:

>
>
> On Oct 6, 2023, at 3:48 AM, Mark Adams <mfadams at lbl.gov> wrote:
>
> 
> Pierre, (moved to dev)
>
> It looks like there is a subtle bug in the new MatFilter.
> My guess is that after the compression/filter the communication buffers
> and lists need to be recomputed because the graph has changed.
>
>
> Maybe an issue with MatHeaderReplace()?
> Do you have a reproducer?
> I use this routine for AIJ, BAIJ, and SBAIJ and never ran into this
> (though the subsequent Mat is not involved in the same kind of operations
> as in GAMG).
>
> Thanks,
> Pierre
>
> And, the Mat-Mat Mults failed or hung because the communication
> requirements, as seen in the graph, did not match the cached communication
> lists.
> The old way just created a whole new matrix, which took care of that.
>
> Mark
>
>
>
> On Thu, Oct 5, 2023 at 8:51 PM Mark Adams <mfadams at lbl.gov> wrote:
>
>> Fantastic, it will get merged soon.
>>
>> Thank you for your diligence and patience.
>> This would have been a time bomb waiting to explode.
>>
>> Mark
>>
>> On Thu, Oct 5, 2023 at 7:23 PM Stephan Kramer <s.kramer at imperial.ac.uk>
>> wrote:
>>
>>> Great, that seems to fix the issue indeed - i.e. on the branch with the
>>> low memory filtering switched off (by default) we no longer see the
>>> "inconsistent data" error or hangs, and going back to the square graph
>>> aggressive coarsening brings us back the old performance. So we'd be
>>> keen to have that branch merged indeed
>>> Many thanks for your assistance with this
>>> Stephan
>>>
>>> On 05/10/2023 01:11, Mark Adams wrote:
>>> > Thanks Stephan,
>>> >
>>> > It looks like the matrix is in a bad/incorrect state and parallel
>>> Mat-Mat
>>> > is waiting for messages that were not sent. A bug.
>>> >
>>> > Can you try my branch, which is ready to merge, adams/gamg-fast-filter.
>>> > We added a new filtering method in main that uses low memory but I
>>> found it
>>> > was slow, so this branch brings back the old filter code, used by
>>> default,
>>> > and keeps the low memory version as an option.
>>> > It is possible this low memory filtering messed up the internals of
>>> the Mat
>>> > in some way.
>>> > I hope this is it, but if not we can continue.
>>> >
>>> > This MR also makes square graph the default.
>>> > I have found it does create better aggregates and on GPUs, with Kokkos
>>> bug
>>> > fixes from Junchao, Mat-Mat is fast. (it might be slow on CPUs)
>>> >
>>> > Mark
>>> >
>>> >
>>> >
>>> >
>>> > On Wed, Oct 4, 2023 at 12:30 AM Stephan Kramer <
>>> s.kramer at imperial.ac.uk>
>>> > wrote:
>>> >
>>> >> Hi Mark
>>> >>
>>> >> Thanks again for re-enabling the square graph aggressive coarsening
>>> >> option which seems to have restored performance for most of our cases.
>>> >> Unfortunately we do have a remaining issue, which only seems to occur
>>> >> for the larger mesh size ("level 7" which has 6,389,890 vertices and
>>> we
>>> >> normally run on 1536 cpus): we either get a "Petsc has generated
>>> >> inconsistent data" error, or a hang - both when constructing the
>>> square
>>> >> graph matrix. So this is with the new
>>> >> -pc_gamg_aggressive_square_graph=true option, without the option
>>> there's
>>> >> no error but of course we would get back to the worse performance.
>>> >>
>>> >> Backtrace for the "inconsistent data" error. Note this is actually
>>> just
>>> >> petsc main from 17 Sep, git 9a75acf6e50cfe213617e - so after your
>>> merge
>>> >> of adams/gamg-add-old-coarsening into main - with one unrelated commit
>>> >> from firedrake
>>> >>
>>> >> [0]PETSC ERROR: --------------------- Error Message
>>> >> --------------------------------------------------------------
>>> >> [0]PETSC ERROR: Petsc has generated inconsistent data
>>> >> [0]PETSC ERROR: j 8 not equal to expected number of sends 9
>>> >> [0]PETSC ERROR: Petsc Development GIT revision:
>>> >> v3.4.2-43104-ga3b76b71a1  GIT Date: 2023-09-18 10:26:04 +0100
>>> >> [0]PETSC ERROR: stokes_cubed_sphere_7e3_A3_TS1.py on a  named
>>> >> gadi-cpu-clx-0241.gadi.nci.org.au by sck551 Wed Oct  4 14:30:41 2023
>>> >> [0]PETSC ERROR: Configure options --prefix=/tmp/firedrake-prefix
>>> >> --with-make-np=4 --with-debugging=0 --with-shared-libraries=1
>>> >> --with-fortran-bindings=0 --with-zlib --with-c2html=0
>>> >> --with-mpiexec=mpiexec --with-cc=mpicc --with-cxx=mpicxx
>>> >> --with-fc=mpifort --download-hdf5 --download-hypre
>>> >> --download-superlu_dist --download-ptscotch --download-suitesparse
>>> >> --download-pastix --download-hwloc --download-metis
>>> --download-scalapack
>>> >> --download-mumps --download-chaco --download-ml
>>> >> CFLAGS=-diag-disable=10441 CXXFLAGS=-diag-disable=10441
>>> >> [0]PETSC ERROR: #1 PetscGatherMessageLengths2() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/sys/utils/mpimesg.c:270
>>> >> [0]PETSC ERROR: #2 MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ() at
>>> >>
>>> /jobfs/95504034.gadi-pbs/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c:1867
>>> >> [0]PETSC ERROR: #3 MatProductSymbolic_AtB_MPIAIJ_MPIAIJ() at
>>> >>
>>> /jobfs/95504034.gadi-pbs/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c:2071
>>> >> [0]PETSC ERROR: #4 MatProductSymbolic() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/mat/interface/matproduct.c:795
>>> >> [0]PETSC ERROR: #5 PCGAMGSquareGraph_GAMG() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/impls/gamg/gamg.c:489
>>> >> [0]PETSC ERROR: #6 PCGAMGCoarsen_AGG() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/impls/gamg/agg.c:969
>>> >> [0]PETSC ERROR: #7 PCSetUp_GAMG() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/impls/gamg/gamg.c:645
>>> >> [0]PETSC ERROR: #8 PCSetUp() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/interface/precon.c:1069
>>> >> [0]PETSC ERROR: #9 PCApply() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/interface/precon.c:484
>>> >> [0]PETSC ERROR: #10 PCApply() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/interface/precon.c:487
>>> >> [0]PETSC ERROR: #11 KSP_PCApply() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/include/petsc/private/kspimpl.h:383
>>> >> [0]PETSC ERROR: #12 KSPSolve_CG() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/ksp/impls/cg/cg.c:162
>>> >> [0]PETSC ERROR: #13 KSPSolve_Private() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/ksp/interface/itfunc.c:910
>>> >> [0]PETSC ERROR: #14 KSPSolve() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/ksp/interface/itfunc.c:1082
>>> >> [0]PETSC ERROR: #15 PCApply_FieldSplit_Schur() at
>>> >>
>>> >>
>>> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/impls/fieldsplit/fieldsplit.c:1175
>>> >> [0]PETSC ERROR: #16 PCApply() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/interface/precon.c:487
>>> >> [0]PETSC ERROR: #17 KSP_PCApply() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/include/petsc/private/kspimpl.h:383
>>> >> [0]PETSC ERROR: #18 KSPSolve_PREONLY() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/ksp/impls/preonly/preonly.c:25
>>> >> [0]PETSC ERROR: #19 KSPSolve_Private() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/ksp/interface/itfunc.c:910
>>> >> [0]PETSC ERROR: #20 KSPSolve() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/ksp/interface/itfunc.c:1082
>>> >> [0]PETSC ERROR: #21 SNESSolve_KSPONLY() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/snes/impls/ksponly/ksponly.c:49
>>> >> [0]PETSC ERROR: #22 SNESSolve() at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/snes/interface/snes.c:4635
>>> >>
>>> >> Last -info :pc messages:
>>> >>
>>> >> [0] <pc:gamg> PCSetUp(): Setting up PC for first time
>>> >> [0] <pc:gamg> PCSetUp_GAMG(): Stokes_fieldsplit_0_assembled_: level 0)
>>> >> N=152175366, n data rows=3, n data cols=6, nnz/row (ave)=191, np=1536
>>> >> [0] <pc:gamg> PCGAMGCreateGraph_AGG(): Filtering left 100. % edges in
>>> >> graph (1.588710e+07 1.765233e+06)
>>> >> [0] <pc:gamg> PCGAMGSquareGraph_GAMG():
>>> Stokes_fieldsplit_0_assembled_:
>>> >> Square Graph on level 1
>>> >> [0] <pc:gamg> fixAggregatesWithSquare(): isMPI = yes
>>> >> [0] <pc:gamg> PCGAMGProlongator_AGG(): Stokes_fieldsplit_0_assembled_:
>>> >> New grid 380144 nodes
>>> >> [0] <pc:gamg> PCGAMGOptProlongator_AGG():
>>> >> Stokes_fieldsplit_0_assembled_: Smooth P0: max eigen=4.489376e+00
>>> >> min=9.015236e-02 PC=jacobi
>>> >> [0] <pc:gamg> PCGAMGOptProlongator_AGG():
>>> >> Stokes_fieldsplit_0_assembled_: Smooth P0: level 0, cache spectra
>>> >> 0.0901524 4.48938
>>> >> [0] <pc:gamg> PCGAMGCreateLevel_GAMG():
>>> Stokes_fieldsplit_0_assembled_:
>>> >> Coarse grid reduction from 1536 to 1536 active processes
>>> >> [0] <pc:gamg> PCSetUp_GAMG(): Stokes_fieldsplit_0_assembled_: 1)
>>> >> N=2280864, n data cols=6, nnz/row (ave)=503, 1536 active pes
>>> >> [0] <pc:gamg> PCGAMGCreateGraph_AGG(): Filtering left 36.2891 % edges
>>> in
>>> >> graph (5.310360e+05 5.353000e+03)
>>> >> [0] <pc:gamg> PCGAMGSquareGraph_GAMG():
>>> Stokes_fieldsplit_0_assembled_:
>>> >> Square Graph on level 2
>>> >>
>>> >> The hang (on a slightly different model configuration but on the same
>>> >> mesh and n/o cores) seems to occur in the same location. If I use gdb
>>> to
>>> >> attach to the running processes, it seems on some cores it has somehow
>>> >> manages to fall out of the pcsetup and is waiting in the first norm
>>> >> calculation in the outside CG iteration:
>>> >>
>>> >> #0  0x000014cce9999119 in
>>> >> hmca_bcol_basesmuma_bcast_k_nomial_knownroot_progress () from
>>> >> /apps/hcoll/4.7.3202/lib/hcoll/hmca_bcol_basesmuma.so
>>> >> #1  0x000014ccef2c2737 in _coll_ml_allreduce () from
>>> >> /apps/hcoll/4.7.3202/lib/libhcoll.so.1
>>> >> #2  0x000014ccef5dd95b in mca_coll_hcoll_allreduce (sbuf=0x1,
>>> >> rbuf=0x7fff74ecbee8, count=1, dtype=0x14cd26ce6f80 <ompi_mpi_double>,
>>> >> op=0x14cd26cfbc20 <ompi_mpi_op_sum>, comm=0x3076fb0, module=0x43a0110)
>>> >> at
>>> >>
>>> >>
>>> /jobfs/35226956.gadi-pbs/0/openmpi/4.0.7/source/openmpi-4.0.7/ompi/mca/coll/hcoll/coll_hcoll_ops.c:228
>>> >> #3  0x000014cd26a1de28 in PMPI_Allreduce (sendbuf=0x1,
>>> >> recvbuf=<optimized out>, count=1, datatype=<optimized out>,
>>> >> op=0x14cd26cfbc20 <ompi_mpi_op_sum>, comm=0x3076fb0) at
>>> pallreduce.c:113
>>> >> #4  0x000014cd271c9889 in VecNorm_MPI_Default (xin=<optimized out>,
>>> >> type=<optimized out>, z=<optimized out>, VecNorm_SeqFn=<optimized
>>> out>)
>>> >> at
>>> >>
>>> >>
>>> /jobfs/95504034.gadi-pbs/petsc/include/../src/vec/vec/impls/mpi/pvecimpl.h:168
>>> >> #5  VecNorm_MPI (xin=0x14ccee1ddb80, type=3924123648, z=0x22d) at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/vec/vec/impls/mpi/pvec2.c:39
>>> >> #6  0x000014cd2718cddd in VecNorm (x=0x14ccee1ddb80, type=3924123648,
>>> >> val=0x22d) at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/vec/vec/interface/rvector.c:214
>>> >> #7  0x000014cd27f5a0b9 in KSPSolve_CG (ksp=0x14ccee1ddb80) at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/ksp/impls/cg/cg.c:163
>>> >> etc.
>>> >>
>>> >> but with other cores still stuck at:
>>> >>
>>> >> #0  0x000015375cf41e8a in ucp_worker_progress () from
>>> >> /apps/ucx/1.12.0/lib/libucp.so.0
>>> >> #1  0x000015377d4bd57b in opal_progress () at
>>> >>
>>> >>
>>> /jobfs/35226956.gadi-pbs/0/openmpi/4.0.7/source/openmpi-4.0.7/opal/runtime/opal_progress.c:231
>>> >> #2  0x000015377d4c3ba5 in ompi_sync_wait_mt
>>> >> (sync=sync at entry=0x7ffd6aedf6f0) at
>>> >>
>>> >>
>>> /jobfs/35226956.gadi-pbs/0/openmpi/4.0.7/source/openmpi-4.0.7/opal/threads/wait_sync.c:85
>>> >> #3  0x000015378bf7cf38 in ompi_request_default_wait_any (count=8,
>>> >> requests=0x8d465a0, index=0x7ffd6aedfa60, status=0x7ffd6aedfa10) at
>>> >>
>>> >>
>>> /jobfs/35226956.gadi-pbs/0/openmpi/4.0.7/source/openmpi-4.0.7/ompi/request/req_wait.c:124
>>> >> #4  0x000015378bfc1b4b in PMPI_Waitany (count=8, requests=0x8d465a0,
>>> >> indx=0x7ffd6aedfa60, status=<optimized out>) at pwaitany.c:86
>>> >> #5  0x000015378c88ef2c in MatTransposeMatMultSymbolic_MPIAIJ_MPIAIJ
>>> >> (P=0x2cc7500, A=0x1, fill=2.1219957934356005e-314, C=0xc0fe132c) at
>>> >>
>>> /jobfs/95504034.gadi-pbs/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c:1884
>>> >> #6  0x000015378c88dd4f in MatProductSymbolic_AtB_MPIAIJ_MPIAIJ
>>> >> (C=0x2cc7500) at
>>> >>
>>> /jobfs/95504034.gadi-pbs/petsc/src/mat/impls/aij/mpi/mpimatmatmult.c:2071
>>> >> #7  0x000015378cc665b8 in MatProductSymbolic (mat=0x2cc7500) at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/mat/interface/matproduct.c:795
>>> >> #8  0x000015378d294473 in PCGAMGSquareGraph_GAMG (a_pc=0x2cc7500,
>>> >> Gmat1=0x1, Gmat2=0xc0fe132c) at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/impls/gamg/gamg.c:489
>>> >> #9  0x000015378d27b83e in PCGAMGCoarsen_AGG (a_pc=0x2cc7500,
>>> >> a_Gmat1=0x1, agg_lists=0xc0fe132c) at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/impls/gamg/agg.c:969
>>> >> #10 0x000015378d294c73 in PCSetUp_GAMG (pc=0x2cc7500) at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/impls/gamg/gamg.c:645
>>> >> #11 0x000015378d215721 in PCSetUp (pc=0x2cc7500) at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/interface/precon.c:1069
>>> >> #12 0x000015378d216b82 in PCApply (pc=0x2cc7500, x=0x1, y=0xc0fe132c)
>>> at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/interface/precon.c:484
>>> >> #13 0x000015378eb91b2f in __pyx_pw_8petsc4py_5PETSc_2PC_45apply
>>> >> (__pyx_v_self=0x2cc7500, __pyx_args=0x1, __pyx_nargs=3237876524,
>>> >> __pyx_kwds=0x1) at src/petsc4py/PETSc.c:259082
>>> >> #14 0x000015379e0a69f7 in method_vectorcall_FASTCALL_KEYWORDS
>>> >> (func=0x15378f302890, args=0x83b3218, nargsf=<optimized out>,
>>> >> kwnames=<optimized out>) at ../Objects/descrobject.c:405
>>> >> #15 0x000015379e11d435 in _PyObject_VectorcallTstate (kwnames=0x0,
>>> >> nargsf=<optimized out>, args=0x83b3218, callable=0x15378f302890,
>>> >> tstate=0x23e0020) at ../Include/cpython/abstract.h:114
>>> >> #16 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>,
>>> >> args=0x83b3218, callable=0x15378f302890) at
>>> >> ../Include/cpython/abstract.h:123
>>> >> #17 call_function (kwnames=0x0, oparg=<optimized out>,
>>> >> pp_stack=<synthetic pointer>, trace_info=0x7ffd6aee0390,
>>> >> tstate=<optimized out>) at ../Python/ceval.c:5867
>>> >> #18 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized
>>> out>,
>>> >> throwflag=<optimized out>) at ../Python/ceval.c:4198
>>> >> #19 0x000015379e11b63b in _PyEval_EvalFrame (throwflag=0, f=0x83b3080,
>>> >> tstate=0x23e0020) at ../Include/internal/pycore_ceval.h:46
>>> >> #20 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>,
>>> >> locals=<optimized out>, args=<optimized out>, argcount=4,
>>> >> kwnames=<optimized out>) at ../Python/ceval.c:5065
>>> >> #21 0x000015378ee1e057 in __Pyx_PyObject_FastCallDict (func=<optimized
>>> >> out>, args=0x1, _nargs=<optimized out>, kwargs=<optimized out>) at
>>> >> src/petsc4py/PETSc.c:548022
>>> >> #22 __pyx_f_8petsc4py_5PETSc_PCApply_Python (__pyx_v_pc=0x2cc7500,
>>> >> __pyx_v_x=0x1, __pyx_v_y=0xc0fe132c) at src/petsc4py/PETSc.c:31979
>>> >> #23 0x000015378d216cba in PCApply (pc=0x2cc7500, x=0x1, y=0xc0fe132c)
>>> at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/pc/interface/precon.c:487
>>> >> #24 0x000015378d4d153c in KSP_PCApply (ksp=0x2cc7500, x=0x1,
>>> >> y=0xc0fe132c) at
>>> >> /jobfs/95504034.gadi-pbs/petsc/include/petsc/private/kspimpl.h:383
>>> >> #25 0x000015378d4d1097 in KSPSolve_CG (ksp=0x2cc7500) at
>>> >> /jobfs/95504034.gadi-pbs/petsc/src/ksp/ksp/impls/cg/cg.c:162
>>> >>
>>> >> Let me know if there is anything further we can try to debug this
>>> issue
>>> >>
>>> >> Kind regards
>>> >> Stephan Kramer
>>> >>
>>> >>
>>> >> On 02/09/2023 01:58, Mark Adams wrote:
>>> >>> Fantastic!
>>> >>>
>>> >>> I fixed a memory free problem. You should be OK now.
>>> >>> I am pretty sure you are good but I would like to wait to get any
>>> >> feedback
>>> >>> from you.
>>> >>> We should have a release at the end of the month and it would be
>>> nice to
>>> >>> get this into it.
>>> >>>
>>> >>> Thanks,
>>> >>> Mark
>>> >>>
>>> >>>
>>> >>> On Fri, Sep 1, 2023 at 7:07 AM Stephan Kramer <
>>> s.kramer at imperial.ac.uk>
>>> >>> wrote:
>>> >>>
>>> >>>> Hi Mark
>>> >>>>
>>> >>>> Sorry took a while to report back. We have tried your branch but
>>> hit a
>>> >>>> few issues, some of which we're not entirely sure are related.
>>> >>>>
>>> >>>> First switching off minimum degree ordering, and then switching to
>>> the
>>> >>>> old version of aggressive coarsening, as you suggested, got us back
>>> to
>>> >>>> the coarsening behaviour that we had previously, but then we also
>>> >>>> observed an even further worsening of the iteration count: it had
>>> >>>> previously gone up by 50% already (with the newer main petsc), but
>>> now
>>> >>>> was more than double "old" petsc. Took us a while to realize this
>>> was
>>> >>>> due to the default smoother changing from Cheby+SOR to Cheby+Jacobi.
>>> >>>> Switching this also back to the old default we get back to very
>>> similar
>>> >>>> coarsening levels (see below for more details if it is of interest)
>>> and
>>> >>>> iteration counts.
>>> >>>>
>>> >>>> So that's all very good news. However, we were also starting seeing
>>> >>>> memory errors (double free or corruption) when we switched off the
>>> >>>> minimum degree ordering. Because this was at an earlier version of
>>> your
>>> >>>> branch we then rebuild, hoping this was just an earlier bug that had
>>> >>>> been fixed, but then we were having MPI-lockup issues. We have now
>>> >>>> figured out the MPI issues are completely unrelated - some
>>> combination
>>> >>>> with a newer mpi build and firedrake on our cluster which also occur
>>> >>>> using main branches of everything. So switching back to an older MPI
>>> >>>> build we are hoping to now test your most recent version of
>>> >>>> adams/gamg-add-old-coarsening with these options and see whether the
>>> >>>> memory errors are still there. Will let you know
>>> >>>>
>>> >>>> Best wishes
>>> >>>> Stephan Kramer
>>> >>>>
>>> >>>> Coarsening details with various options for Level 6 of the test
>>> case:
>>> >>>>
>>> >>>> In our original setup (using "old" petsc), we had:
>>> >>>>
>>> >>>>              rows=516, cols=516, bs=6
>>> >>>>              rows=12660, cols=12660, bs=6
>>> >>>>              rows=346974, cols=346974, bs=6
>>> >>>>              rows=19169670, cols=19169670, bs=3
>>> >>>>
>>> >>>> Then with the newer main petsc we had
>>> >>>>
>>> >>>>              rows=666, cols=666, bs=6
>>> >>>>              rows=7740, cols=7740, bs=6
>>> >>>>              rows=34902, cols=34902, bs=6
>>> >>>>              rows=736578, cols=736578, bs=6
>>> >>>>              rows=19169670, cols=19169670, bs=3
>>> >>>>
>>> >>>> Then on your branch with minimum_degree_ordering False:
>>> >>>>
>>> >>>>              rows=504, cols=504, bs=6
>>> >>>>              rows=2274, cols=2274, bs=6
>>> >>>>              rows=11010, cols=11010, bs=6
>>> >>>>              rows=35790, cols=35790, bs=6
>>> >>>>              rows=430686, cols=430686, bs=6
>>> >>>>              rows=19169670, cols=19169670, bs=3
>>> >>>>
>>> >>>> And with minimum_degree_ordering False and
>>> use_aggressive_square_graph
>>> >>>> True:
>>> >>>>
>>> >>>>              rows=498, cols=498, bs=6
>>> >>>>              rows=12672, cols=12672, bs=6
>>> >>>>              rows=346974, cols=346974, bs=6
>>> >>>>              rows=19169670, cols=19169670, bs=3
>>> >>>>
>>> >>>> So that is indeed pretty much back to what it was before
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On 31/08/2023 23:40, Mark Adams wrote:
>>> >>>>> Hi Stephan,
>>> >>>>>
>>> >>>>> This branch is settling down.  adams/gamg-add-old-coarsening
>>> >>>>> <
>>> >>
>>> https://gitlab.com/petsc/petsc/-/commits/adams/gamg-add-old-coarsening>
>>> >>>>> I made the old, not minimum degree, ordering the default but kept
>>> the
>>> >> new
>>> >>>>> "aggressive" coarsening as the default, so I am hoping that just
>>> adding
>>> >>>>> "-pc_gamg_use_aggressive_square_graph true" to your regression
>>> tests
>>> >> will
>>> >>>>> get you back to where you were before.
>>> >>>>> Fingers crossed ... let me know if you have any success or not.
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Mark
>>> >>>>>
>>> >>>>>
>>> >>>>> On Tue, Aug 15, 2023 at 1:45 PM Mark Adams <mfadams at lbl.gov>
>>> wrote:
>>> >>>>>
>>> >>>>>> Hi Stephan,
>>> >>>>>>
>>> >>>>>> I have a branch that you can try: adams/gamg-add-old-coarsening
>>> >>>>>> <
>>> >>
>>> https://gitlab.com/petsc/petsc/-/commits/adams/gamg-add-old-coarsening
>>> >>>>>> Things to test:
>>> >>>>>> * First, verify that nothing unintended changed by reproducing
>>> your
>>> >> bad
>>> >>>>>> results with this branch (the defaults are the same)
>>> >>>>>> * Try not using the minimum degree ordering that I suggested
>>> >>>>>> with: -pc_gamg_use_minimum_degree_ordering false
>>> >>>>>>      -- I am eager to see if that is the main problem.
>>> >>>>>> * Go back to what I think is the old method:
>>> >>>>>> -pc_gamg_use_minimum_degree_ordering
>>> >>>>>> false -pc_gamg_use_aggressive_square_graph true
>>> >>>>>>
>>> >>>>>> When we get back to where you were, I would like to try to get
>>> modern
>>> >>>>>> stuff working.
>>> >>>>>> I did add a -pc_gamg_aggressive_mis_k <2>
>>> >>>>>> You could to another step of MIS coarsening with
>>> >>>> -pc_gamg_aggressive_mis_k
>>> >>>>>> 3
>>> >>>>>>
>>> >>>>>> Anyway, lots to look at but, alas, AMG does have a lot of
>>> parameters.
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>> Mark
>>> >>>>>>
>>> >>>>>> On Mon, Aug 14, 2023 at 4:26 PM Mark Adams <mfadams at lbl.gov>
>>> wrote:
>>> >>>>>>
>>> >>>>>>> On Mon, Aug 14, 2023 at 11:03 AM Stephan Kramer <
>>> >>>> s.kramer at imperial.ac.uk>
>>> >>>>>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> Many thanks for looking into this, Mark
>>> >>>>>>>>> My 3D tests were not that different and I see you lowered the
>>> >>>>>>>> threshold.
>>> >>>>>>>>> Note, you can set the threshold to zero, but your test is
>>> running
>>> >> so
>>> >>>>>>>> much
>>> >>>>>>>>> differently than mine there is something else going on.
>>> >>>>>>>>> Note, the new, bad, coarsening rate of 30:1 is what we tend to
>>> >> shoot
>>> >>>>>>>> for
>>> >>>>>>>>> in 3D.
>>> >>>>>>>>>
>>> >>>>>>>>> So it is not clear what the problem is.  Some questions:
>>> >>>>>>>>>
>>> >>>>>>>>> * do you have a picture of this mesh to show me?
>>> >>>>>>>> It's just a standard hexahedral cubed sphere mesh with the
>>> >> refinement
>>> >>>>>>>> level giving the number of times each of the six sides have been
>>> >>>>>>>> subdivided: so Level_5 mean 2^5 x 2^5 squares which is extruded
>>> to
>>> >> 16
>>> >>>>>>>> layers. So the total number of elements at Level_5 is 6 x 32 x
>>> 32 x
>>> >>>> 16 =
>>> >>>>>>>> 98304  hexes. And everything doubles in all 3 dimensions (so
>>> 2^3)
>>> >>>> going
>>> >>>>>>>> to the next Level
>>> >>>>>>>>
>>> >>>>>>> I see, and I assume these are pretty stretched elements.
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>>> * what do you mean by Q1-Q2 elements?
>>> >>>>>>>> Q2-Q1, basically Taylor hood on hexes, so (tri)quadratic for
>>> >> velocity
>>> >>>>>>>> and (tri)linear for pressure
>>> >>>>>>>>
>>> >>>>>>>> I guess you could argue we could/should just do good old
>>> geometric
>>> >>>>>>>> multigrid instead. More generally we do use this solver
>>> >> configuration
>>> >>>> a
>>> >>>>>>>> lot for tetrahedral Taylor Hood (P2-P1) in particular also for
>>> our
>>> >>>>>>>> adaptive mesh runs - would it be worth to see if we have the
>>> same
>>> >>>>>>>> performance issues with tetrahedral P2-P1?
>>> >>>>>>>>
>>> >>>>>>> No, you have a clear reproducer, if not minimal.
>>> >>>>>>> The first coarsening is very different.
>>> >>>>>>>
>>> >>>>>>> I am working on this and I see that I added a heuristic for thin
>>> >> bodies
>>> >>>>>>> where you order the vertices in greedy algorithms with minimum
>>> degree
>>> >>>> first.
>>> >>>>>>> This will tend to pick corners first, edges then faces, etc.
>>> >>>>>>> That may be the problem. I would like to understand it better
>>> (see
>>> >>>> below).
>>> >>>>>>>
>>> >>>>>>>>> It would be nice to see if the new and old codes are similar
>>> >> without
>>> >>>>>>>>> aggressive coarsening.
>>> >>>>>>>>> This was the intended change of the major change in this time
>>> frame
>>> >>>> as
>>> >>>>>>>> you
>>> >>>>>>>>> noticed.
>>> >>>>>>>>> If these jobs are easy to run, could you check that the old
>>> and new
>>> >>>>>>>>> versions are similar with "-pc_gamg_square_graph  0 ",  ( and
>>> you
>>> >>>> only
>>> >>>>>>>> need
>>> >>>>>>>>> one time step).
>>> >>>>>>>>> All you need to do is check that the first coarse grid has
>>> about
>>> >> the
>>> >>>>>>>> same
>>> >>>>>>>>> number of equations (large).
>>> >>>>>>>> Unfortunately we're seeing some memory errors when we use this
>>> >> option,
>>> >>>>>>>> and I'm not entirely clear whether we're just running out of
>>> memory
>>> >>>> and
>>> >>>>>>>> need to put it on a special queue.
>>> >>>>>>>>
>>> >>>>>>>> The run with square_graph 0 using new PETSc managed to get
>>> through
>>> >> one
>>> >>>>>>>> solve at level 5, and is giving the following mg levels:
>>> >>>>>>>>
>>> >>>>>>>>             rows=174, cols=174, bs=6
>>> >>>>>>>>               total: nonzeros=30276, allocated nonzeros=30276
>>> >>>>>>>> --
>>> >>>>>>>>               rows=2106, cols=2106, bs=6
>>> >>>>>>>>               total: nonzeros=4238532, allocated
>>> nonzeros=4238532
>>> >>>>>>>> --
>>> >>>>>>>>               rows=21828, cols=21828, bs=6
>>> >>>>>>>>               total: nonzeros=62588232, allocated
>>> nonzeros=62588232
>>> >>>>>>>> --
>>> >>>>>>>>               rows=589824, cols=589824, bs=6
>>> >>>>>>>>               total: nonzeros=1082528928, allocated
>>> >> nonzeros=1082528928
>>> >>>>>>>> --
>>> >>>>>>>>               rows=2433222, cols=2433222, bs=3
>>> >>>>>>>>               total: nonzeros=456526098, allocated
>>> nonzeros=456526098
>>> >>>>>>>>
>>> >>>>>>>> comparing with square_graph 100 with new PETSc
>>> >>>>>>>>
>>> >>>>>>>>               rows=96, cols=96, bs=6
>>> >>>>>>>>               total: nonzeros=9216, allocated nonzeros=9216
>>> >>>>>>>> --
>>> >>>>>>>>               rows=1440, cols=1440, bs=6
>>> >>>>>>>>               total: nonzeros=647856, allocated nonzeros=647856
>>> >>>>>>>> --
>>> >>>>>>>>               rows=97242, cols=97242, bs=6
>>> >>>>>>>>               total: nonzeros=65656836, allocated
>>> nonzeros=65656836
>>> >>>>>>>> --
>>> >>>>>>>>               rows=2433222, cols=2433222, bs=3
>>> >>>>>>>>               total: nonzeros=456526098, allocated
>>> nonzeros=456526098
>>> >>>>>>>>
>>> >>>>>>>> and old PETSc with square_graph 100
>>> >>>>>>>>
>>> >>>>>>>>               rows=90, cols=90, bs=6
>>> >>>>>>>>               total: nonzeros=8100, allocated nonzeros=8100
>>> >>>>>>>> --
>>> >>>>>>>>               rows=1872, cols=1872, bs=6
>>> >>>>>>>>               total: nonzeros=1234080, allocated
>>> nonzeros=1234080
>>> >>>>>>>> --
>>> >>>>>>>>               rows=47652, cols=47652, bs=6
>>> >>>>>>>>               total: nonzeros=23343264, allocated
>>> nonzeros=23343264
>>> >>>>>>>> --
>>> >>>>>>>>               rows=2433222, cols=2433222, bs=3
>>> >>>>>>>>               total: nonzeros=456526098, allocated
>>> nonzeros=456526098
>>> >>>>>>>> --
>>> >>>>>>>>
>>> >>>>>>>> Unfortunately old PETSc with square_graph 0 did not complete a
>>> >> single
>>> >>>>>>>> solve before giving the memory error
>>> >>>>>>>>
>>> >>>>>>> OK, thanks for trying.
>>> >>>>>>>
>>> >>>>>>> I am working on this and I will give you a branch to test, but
>>> if you
>>> >>>> can
>>> >>>>>>> rebuild PETSc here is a quick test that might fix your problem.
>>> >>>>>>> In src/ksp/pc/impls/gamg/agg.c you will see:
>>> >>>>>>>
>>> >>>>>>>        PetscCall(PetscSortIntWithArray(nloc, degree, permute));
>>> >>>>>>>
>>> >>>>>>> If you can comment this out in the new code and compare with the
>>> old,
>>> >>>>>>> that might fix the problem.
>>> >>>>>>>
>>> >>>>>>> Thanks,
>>> >>>>>>> Mark
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>>> BTW, I am starting to think I should add the old method back
>>> as an
>>> >>>>>>>> option.
>>> >>>>>>>>> I did not think this change would cause large differences.
>>> >>>>>>>> Yes, I think that would be much appreciated. Let us know if we
>>> can
>>> >> do
>>> >>>>>>>> any testing
>>> >>>>>>>>
>>> >>>>>>>> Best wishes
>>> >>>>>>>> Stephan
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>> Thanks,
>>> >>>>>>>>> Mark
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>> Note that we are providing the rigid body near nullspace,
>>> >>>>>>>>>> hence the bs=3 to bs=6.
>>> >>>>>>>>>> We have tried different values for the gamg_threshold but it
>>> >> doesn't
>>> >>>>>>>>>> really seem to significantly alter the coarsening amount in
>>> that
>>> >>>> first
>>> >>>>>>>>>> step.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Do you have any suggestions for further things we should
>>> try/look
>>> >>>> at?
>>> >>>>>>>>>> Any feedback would be much appreciated
>>> >>>>>>>>>>
>>> >>>>>>>>>> Best wishes
>>> >>>>>>>>>> Stephan Kramer
>>> >>>>>>>>>>
>>> >>>>>>>>>> Full logs including log_view timings available from
>>> >>>>>>>>>> https://github.com/stephankramer/petsc-scaling/
>>> >>>>>>>>>>
>>> >>>>>>>>>> In particular:
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>
>>> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_5/output_2.dat
>>> >>
>>> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_5/output_2.dat
>>> >>
>>> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_6/output_2.dat
>>> >>
>>> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_6/output_2.dat
>>> >>
>>> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_7/output_2.dat
>>> >>
>>> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_7/output_2.dat
>>> >>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20231006/8c86ceba/attachment-0001.html>