[petsc-users] GAMG parallel convergence sensitivity

Tue Mar 12 16:55:17 CDT 2019

Hi all, I've run into an unexpected issue with GAMG stagnating for a
certain condition. I'm running a 3D high order DG discretization for
compressible navier-stokes, using matrix-free gmres+amg, with the relevant
petsc configuration:

-pc_type gamg
-ksp_type fgmres
-pc_gamg_agg_nsmooths 0
-mg_levels_ksp_type gmres
-mg_levels_pc_type bjacobi
-mg_levels_ksp_max_it 20
-mg_levels_ksp_rtol 0.0001
-pc_mg_cycle_type v
-pc_mg_type full

So FGMRES on top, with AMG using ILU block jacobi + GMRES as a smoother.
-ksp_view output pasted at the bottom here. This setup has been working
fairly robustly.

I'm testing two small mesh resolutions, with 1,536 cells and 6,144 cells
each, where in the jacobian each cell is a 50x50 dense block, with 4
off-diagonal block neighbors each. With that, I'm testing 2 configurations
of the same problem, one with mach 0.1 and the other with mach 0.01 (where
the latter makes system much worse conditioned, a kind of stress test.)

In serial everything converges well to relative tolerance 0.01:
1,536 cells, Mach 0.1:  2 iterations
6,144 cells, Mach 0.1:  2 iterations
1,536 cells, Mach 0.01: 5 iterations
6,144 cells, Mach 0.01: 5 iterations

In parallel most things converge well, with -np 16 cores here:
1,536 cells, Mach 0.1:  3 iterations
6,144 cells, Mach 0.1:  4 iterations
1,536 cells, Mach 0.01: 11 iterations

but for the 6,144 cell Mach 0.01 case, it's catastrophically worse:
    0 SNES Function norm 6.934657276072e+05
      0 KSP Residual norm 6.934657276072e+05
      1 KSP Residual norm 6.934440650708e+05
      2 KSP Residual norm 6.934157525695e+05
      3 KSP Residual norm 6.934145135179e+05
...
     48 KSP Residual norm 6.830785654915e+05
     49 KSP Residual norm 6.821332742917e+05
     50 KSP Residual norm 6.807807049444e+05

and quickly stalls entirely and won't converge in 100s of iterations. The
exact same case in serial shows nice convergence:
    0 SNES Function norm 6.934657276072e+05
      0 KSP Residual norm 6.934657276072e+05
      1 KSP Residual norm 1.705989154365e+05
      2 KSP Residual norm 3.183292610749e+04
      3 KSP Residual norm 1.568738082749e+04
      4 KSP Residual norm 9.875297457387e+03
      5 KSP Residual norm 6.489083537720e+03
    Linear solve converged due to CONVERGED_RTOL iterations 5

And the marginally coarser 1,536 cell case with the same physics is also
healthy with parallel -np 16:

    0 SNES Function norm 2.400990060398e+05
      0 KSP Residual norm 2.400990060398e+05
      1 KSP Residual norm 2.391625967890e+05
      2 KSP Residual norm 1.388195699805e+05
      3 KSP Residual norm 3.072388366914e+04
      4 KSP Residual norm 2.151010198865e+04
      5 KSP Residual norm 1.305330349765e+04
      6 KSP Residual norm 8.126579575968e+03
      7 KSP Residual norm 6.186198840355e+03
      8 KSP Residual norm 4.673764041449e+03
      9 KSP Residual norm 3.332141521573e+03
     10 KSP Residual norm 2.811481187948e+03
     11 KSP Residual norm 2.189632613389e+03
    Linear solve converged due to CONVERGED_RTOL iterations 11

Any thoughts here? Is there anything obviously wrong with my setup? Any way
to reduce the dependence of the convergence iterations on the parallelism?
-- obviously I expect the iteration count to be higher in parallel, but I
didn't expect such catastrophic failure.

Thanks as always,
Mark

-ksp_view:

0 TS dt 30. time 0.
    0 SNES Function norm 2.856641938332e+04
      0 KSP Residual norm 2.856641938332e+04
      1 KSP Residual norm 1.562096645358e+03
      2 KSP Residual norm 3.008746074553e+02
      3 KSP Residual norm 1.463990835793e+02
    Linear solve converged due to CONVERGED_RTOL iterations 3
KSP Object: 16 MPI processes
  type: fgmres
    restart=100, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
    happy breakdown tolerance 1e-30
  maximum iterations=100, initial guess is zero
  tolerances:  relative=0.01, absolute=1e-06, divergence=10.
  right preconditioning
  using UNPRECONDITIONED norm type for convergence test
PC Object: 16 MPI processes
  type: gamg
    type is FULL, levels=5 cycles=v
      Using externally compute Galerkin coarse grid matrices
      GAMG specific options
        Threshold for dropping small values in graph on each level =   0.
 0.   0.
        Threshold scaling factor for each level not specified = 1.
        AGG specific options
          Symmetric graph false
          Number of levels to square graph 1
          Number smoothing steps 0
  Coarse grid solver -- level -------------------------------
    KSP Object: (mg_coarse_) 16 MPI processes
      type: preonly
      maximum iterations=10000, initial guess is zero
      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_coarse_) 16 MPI processes
      type: bjacobi
        number of blocks = 16
        Local solve is same for all blocks, in the following KSP and PC
objects:
      KSP Object: (mg_coarse_sub_) 1 MPI processes
        type: preonly
        maximum iterations=1, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
        using NONE norm type for convergence test
      PC Object: (mg_coarse_sub_) 1 MPI processes
        type: lu
          out-of-place factorization
          tolerance for zero pivot 2.22045e-14
          using diagonal shift on blocks to prevent zero pivot [INBLOCKS]
          matrix ordering: nd
          factor fill ratio given 5., needed 1.10526
            Factored matrix follows:
              Mat Object: 1 MPI processes
                type: seqaij
                rows=25, cols=25, bs=5
                package used to perform factorization: petsc
                total: nonzeros=525, allocated nonzeros=525
                total number of mallocs used during MatSetValues calls =0
                  using I-node routines: found 5 nodes, limit used is 5
        linear system matrix = precond matrix:
        Mat Object: 1 MPI processes
          type: seqaij
          rows=25, cols=25, bs=5
          total: nonzeros=475, allocated nonzeros=475
          total number of mallocs used during MatSetValues calls =0
            using I-node routines: found 5 nodes, limit used is 5
      linear system matrix = precond matrix:
      Mat Object: 16 MPI processes
        type: mpiaij
        rows=25, cols=25, bs=5
        total: nonzeros=475, allocated nonzeros=475
        total number of mallocs used during MatSetValues calls =0
          using I-node (on process 0) routines: found 5 nodes, limit used
is 5
  Down solver (pre-smoother) on level 1 -------------------------------
    KSP Object: (mg_levels_1_) 16 MPI processes
      type: gmres
        restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
        happy breakdown tolerance 1e-30
      maximum iterations=20, nonzero initial guess
      tolerances:  relative=0.0001, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_1_) 16 MPI processes
      type: bjacobi
        number of blocks = 16
        Local solve is same for all blocks, in the following KSP and PC
objects:
      KSP Object: (mg_levels_1_sub_) 1 MPI processes
        type: preonly
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
        using NONE norm type for convergence test
      PC Object: (mg_levels_1_sub_) 1 MPI processes
        type: ilu
          out-of-place factorization
          0 levels of fill
          tolerance for zero pivot 2.22045e-14
          matrix ordering: natural
          factor fill ratio given 1., needed 1.
            Factored matrix follows:
              Mat Object: 1 MPI processes
                type: seqaij
                rows=75, cols=75, bs=5
                package used to perform factorization: petsc
                total: nonzeros=1925, allocated nonzeros=1925
                total number of mallocs used during MatSetValues calls =0
                  using I-node routines: found 15 nodes, limit used is 5
        linear system matrix = precond matrix:
        Mat Object: 1 MPI processes
          type: seqaij
          rows=75, cols=75, bs=5
          total: nonzeros=1925, allocated nonzeros=1925
          total number of mallocs used during MatSetValues calls =0
            using I-node routines: found 15 nodes, limit used is 5
      linear system matrix = precond matrix:
      Mat Object: 16 MPI processes
        type: mpiaij
        rows=75, cols=75, bs=5
        total: nonzeros=1925, allocated nonzeros=1925
        total number of mallocs used during MatSetValues calls =0
          using I-node (on process 0) routines: found 15 nodes, limit used
is 5
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 2 -------------------------------
    KSP Object: (mg_levels_2_) 16 MPI processes
      type: gmres
        restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
        happy breakdown tolerance 1e-30
      maximum iterations=20, nonzero initial guess
      tolerances:  relative=0.0001, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_2_) 16 MPI processes
      type: bjacobi
        number of blocks = 16
        Local solve is same for all blocks, in the following KSP and PC
objects:
      KSP Object: (mg_levels_2_sub_) 1 MPI processes
        type: preonly
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
        using NONE norm type for convergence test
      PC Object: (mg_levels_2_sub_) 1 MPI processes
        type: ilu
          out-of-place factorization
          0 levels of fill
          tolerance for zero pivot 2.22045e-14
          matrix ordering: natural
          factor fill ratio given 1., needed 1.
            Factored matrix follows:
              Mat Object: 1 MPI processes
                type: seqaij
                rows=35, cols=35, bs=5
                package used to perform factorization: petsc
                total: nonzeros=675, allocated nonzeros=675
                total number of mallocs used during MatSetValues calls =0
                  using I-node routines: found 7 nodes, limit used is 5
        linear system matrix = precond matrix:
        Mat Object: 1 MPI processes
          type: seqaij
          rows=35, cols=35, bs=5
          total: nonzeros=675, allocated nonzeros=675
          total number of mallocs used during MatSetValues calls =0
            using I-node routines: found 7 nodes, limit used is 5
      linear system matrix = precond matrix:
      Mat Object: 16 MPI processes
        type: mpiaij
        rows=305, cols=305, bs=5
        total: nonzeros=8675, allocated nonzeros=8675
        total number of mallocs used during MatSetValues calls =0
          using I-node (on process 0) routines: found 7 nodes, limit used
is 5
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 3 -------------------------------
    KSP Object: (mg_levels_3_) 16 MPI processes
      type: gmres
        restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
        happy breakdown tolerance 1e-30
      maximum iterations=20, nonzero initial guess
      tolerances:  relative=0.0001, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_3_) 16 MPI processes
      type: bjacobi
        number of blocks = 16
        Local solve is same for all blocks, in the following KSP and PC
objects:
      KSP Object: (mg_levels_3_sub_) 1 MPI processes
        type: preonly
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
        using NONE norm type for convergence test
      PC Object: (mg_levels_3_sub_) 1 MPI processes
        type: ilu
          out-of-place factorization
          0 levels of fill
          tolerance for zero pivot 2.22045e-14
          matrix ordering: natural
          factor fill ratio given 1., needed 1.
            Factored matrix follows:
              Mat Object: 1 MPI processes
                type: seqaij
                rows=50, cols=50, bs=5
                package used to perform factorization: petsc
                total: nonzeros=1050, allocated nonzeros=1050
                total number of mallocs used during MatSetValues calls =0
                  using I-node routines: found 10 nodes, limit used is 5
        linear system matrix = precond matrix:
        Mat Object: 1 MPI processes
          type: seqaij
          rows=50, cols=50, bs=5
          total: nonzeros=1050, allocated nonzeros=1050
          total number of mallocs used during MatSetValues calls =0
            using I-node routines: found 10 nodes, limit used is 5
      linear system matrix = precond matrix:
      Mat Object: 16 MPI processes
        type: mpiaij
        rows=1090, cols=1090, bs=5
        total: nonzeros=32050, allocated nonzeros=32050
        total number of mallocs used during MatSetValues calls =0
          using nonscalable MatPtAP() implementation
          using I-node (on process 0) routines: found 10 nodes, limit used
is 5
  Up solver (post-smoother) same as down solver (pre-smoother)
  Down solver (pre-smoother) on level 4 -------------------------------
    KSP Object: (mg_levels_4_) 16 MPI processes
      type: gmres
        restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
        happy breakdown tolerance 1e-30
      maximum iterations=20, nonzero initial guess
      tolerances:  relative=0.0001, absolute=1e-50, divergence=10000.
      left preconditioning
      using NONE norm type for convergence test
    PC Object: (mg_levels_4_) 16 MPI processes
      type: bjacobi
        number of blocks = 16
        Local solve is same for all blocks, in the following KSP and PC
objects:
      KSP Object: (mg_levels_4_sub_) 1 MPI processes
        type: preonly
        maximum iterations=10000, initial guess is zero
        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
        left preconditioning
        using NONE norm type for convergence test
      PC Object: (mg_levels_4_sub_) 1 MPI processes
        type: ilu
          out-of-place factorization
          0 levels of fill
          tolerance for zero pivot 2.22045e-14
          matrix ordering: natural
          factor fill ratio given 1., needed 1.
            Factored matrix follows:
              Mat Object: 1 MPI processes
                type: seqaij
                rows=4850, cols=4850, bs=5
                package used to perform factorization: petsc
                total: nonzeros=1117500, allocated nonzeros=1117500
                total number of mallocs used during MatSetValues calls =0
                  using I-node routines: found 970 nodes, limit used is 5
        linear system matrix = precond matrix:
        Mat Object: 1 MPI processes
          type: seqaij
          rows=4850, cols=4850, bs=5
          total: nonzeros=1117500, allocated nonzeros=1117500
          total number of mallocs used during MatSetValues calls =0
            using I-node routines: found 970 nodes, limit used is 5
      linear system matrix followed by preconditioner matrix:
      Mat Object: 16 MPI processes
        type: mffd
        rows=76800, cols=76800
          Matrix-free approximation:
            err=1.49012e-08 (relative error in function evaluation)
            Using wp compute h routine
                Does not compute normU
      Mat Object: 16 MPI processes
        type: mpiaij
        rows=76800, cols=76800, bs=5
        total: nonzeros=18880000, allocated nonzeros=18880000
        total number of mallocs used during MatSetValues calls =0
          using I-node (on process 0) routines: found 970 nodes, limit used
is 5
  Up solver (post-smoother) same as down solver (pre-smoother)
  linear system matrix followed by preconditioner matrix:
  Mat Object: 16 MPI processes
    type: mffd
    rows=76800, cols=76800
      Matrix-free approximation:
        err=1.49012e-08 (relative error in function evaluation)
        Using wp compute h routine
            Does not compute normU
  Mat Object: 16 MPI processes
    type: mpiaij
    rows=76800, cols=76800, bs=5
    total: nonzeros=18880000, allocated nonzeros=18880000
    total number of mallocs used during MatSetValues calls =0
      using I-node (on process 0) routines: found 970 nodes, limit used is 5
        Line search: Using full step: fnorm 2.856641938332e+04 gnorm
3.868815397561e+03
    1 SNES Function norm 3.868815397561e+03
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190312/d9d41b01/attachment-0001.html>