[petsc-users] Strange GAMG performance for mixed FE formulation

Wed Mar 2 16:28:49 CST 2016

Dear all,

Using the firedrake project, I am solving this simple mixed poisson problem:

mesh = UnitCubeMesh(40,40,40)
V = FunctionSpace(mesh,"RT",1)
Q = FunctionSpace(mesh,"DG",0)
W = V*Q

v, p = TrialFunctions(W)
w, q = TestFunctions(W)

f = Function(Q)
f.interpolate(Expression("12*pi*pi*sin(pi*x[0]*2)*sin(pi*x[1]*2)*sin(2*pi*x[2])"))

a = dot(v,w)*dx - p*div(w)*dx + div(v)*q*dx
L = f*q*dx

u = Function(W)
solve(a==L,u,solver_parameters={...})

This problem has 1161600 degrees of freedom. The solver_parameters are:

-ksp_type gmres
-pc_type fieldsplit
-pc_fieldsplit_type schur
-pc_fieldsplit_schur_fact_type: upper
-pc_fieldsplit_schur_precondition selfp
-fieldsplit_0_ksp_type preonly
-fieldsplit_0_pc_type bjacobi
-fieldsplit_1_ksp_type preonly
-fieldsplit_1_pc_type hypre/ml/gamg

for the last option, I compared the wall-clock timings for hypre, ml,and
gamg. Here are the strong-scaling results (across 64 cores, 8 cores per
Intel Xeon E5-2670 node) for hypre, ml, and gamg:

hypre:
1 core: 47.5 s, 12 solver iters
2 cores: 34.1 s, 15 solver iters
4 cores: 21.5 s, 15 solver iters
8 cores: 16.6 s, 15 solver iters
16 cores: 10.2 s, 15 solver iters
24 cores: 7.66 s, 15 solver iters
32 cores: 6.31 s, 15 solver iters
40 cores: 5.68 s, 15 solver iters
48 cores: 5.36 s, 16 solver iters
56 cores: 5.12 s, 16 solver iters
64 cores: 4.99 s, 16 solver iters

ml:
1 core: 4.44 s, 14 solver iters
2 cores: 2.85 s, 16 solver iters
4 cores: 1.6 s, 17 solver iters
8 cores: 0.966 s, 17 solver iters
16 cores: 0.585 s, 18 solver iters
24 cores: 0.440 s, 18 solver iters
32 cores: 0.375 s, 18 solver iters
40 cores: 0.332 s, 18 solver iters
48 cores: 0.307 s, 17 solver iters
56 cores: 0.290 s, 18 solver iters
64 cores: 0.281 s, 18 solver items

gamg:
1 core: 613 s, 12 solver iters
2 cores: 204 s, 15 solver iters
4 cores: 77.1 s, 15 solver iters
8 cores: 38.1 s, 15 solver iters
16 cores: 15.9 s, 16 solver iters
24 cores: 9.24 s, 16 solver iters
32 cores: 5.92 s, 16 solver iters
40 cores: 4.72 s, 16 solver iters
48 cores: 3.89 s, 16 solver iters
56 cores: 3.65 s, 16 solver iters
64 cores: 3.46 s, 16 solver iters

The performance difference between ML and HYPRE makes sense to me, but what
I am really confused about is GAMG. It seems GAMG is really slow on a
single core but something internally is causing it to speed up
super-linearly as I increase the number of MPI processes. Shouldn't ML and
GAMG have the same performance? I am not sure what log outputs to give you
guys, but for starters, below is -ksp_view for the single core case with
GAMG

KSP Object:(solver_) 1 MPI processes

  type: gmres

    GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement

    GMRES: happy breakdown tolerance 1e-30

  maximum iterations=10000, initial guess is zero

  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.

  left preconditioning

  using PRECONDITIONED norm type for convergence test

PC Object:(solver_) 1 MPI processes

  type: fieldsplit

    FieldSplit with Schur preconditioner, factorization UPPER

    Preconditioner for the Schur complement formed from Sp, an assembled
approximation to S, which uses (lumped, if requested) A00's diagonal's
inverse

    Split info:

    Split number 0 Defined by IS

    Split number 1 Defined by IS

    KSP solver for A00 block

      KSP Object:      (solver_fieldsplit_0_)       1 MPI processes

        type: preonly

        maximum iterations=10000, initial guess is zero

        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.

        left preconditioning

        using NONE norm type for convergence test

      PC Object:      (solver_fieldsplit_0_)       1 MPI processes

        type: bjacobi

          block Jacobi: number of blocks = 1

          Local solve is same for all blocks, in the following KSP and PC
objects:

          KSP Object:          (solver_fieldsplit_0_sub_)           1 MPI
processes

            type: preonly

            maximum iterations=10000, initial guess is zero

            tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.

            left preconditioning

            using NONE norm type for convergence test

          PC Object:          (solver_fieldsplit_0_sub_)           1 MPI
processes

            type: ilu

              ILU: out-of-place factorization

              0 levels of fill

              tolerance for zero pivot 2.22045e-14

              matrix ordering: natural

              factor fill ratio given 1., needed 1.

                Factored matrix follows:

                  Mat Object:                   1 MPI processes

                    type: seqaij

                    rows=777600, cols=777600

                    package used to perform factorization: petsc

                    total: nonzeros=5385600, allocated nonzeros=5385600

                    total number of mallocs used during MatSetValues calls
=0

                      not using I-node routines

            linear system matrix = precond matrix:

            Mat Object:            (solver_fieldsplit_0_)             1 MPI
processes

              type: seqaij

              rows=777600, cols=777600

              total: nonzeros=5385600, allocated nonzeros=5385600

              total number of mallocs used during MatSetValues calls =0

                not using I-node routines

        linear system matrix = precond matrix:

        Mat Object:        (solver_fieldsplit_0_)         1 MPI processes

          type: seqaij

          rows=777600, cols=777600

          total: nonzeros=5385600, allocated nonzeros=5385600

          total number of mallocs used during MatSetValues calls =0

            not using I-node routines

    KSP solver for S = A11 - A10 inv(A00) A01

      KSP Object:      (solver_fieldsplit_1_)       1 MPI processes

        type: preonly

        maximum iterations=10000, initial guess is zero

        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.

        left preconditioning

        using NONE norm type for convergence test

      PC Object:      (solver_fieldsplit_1_)       1 MPI processes

        type: gamg

          MG: type is MULTIPLICATIVE, levels=5 cycles=v

            Cycles per PCApply=1

            Using Galerkin computed coarse grid matrices

            GAMG specific options

              Threshold for dropping small values from graph 0.

              AGG specific options

                Symmetric graph false

        Coarse grid solver -- level -------------------------------

          KSP Object:          (solver_fieldsplit_1_mg_coarse_)           1
MPI processes

            type: preonly

            maximum iterations=1, initial guess is zero

            tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.

            left preconditioning

            using NONE norm type for convergence test

          PC Object:          (solver_fieldsplit_1_mg_coarse_)           1
MPI processes

            type: bjacobi

              block Jacobi: number of blocks = 1

              Local solve is same for all blocks, in the following KSP and
PC objects:

              KSP Object:              (solver_fieldsplit_1_mg_coarse_sub_)
              1 MPI processes

                type: preonly

                maximum iterations=1, initial guess is zero

                tolerances:  relative=1e-05, absolute=1e-50,
divergence=10000.

                left preconditioning

                using NONE norm type for convergence test

              PC Object:              (solver_fieldsplit_1_mg_coarse_sub_)
              1 MPI processes

                type: lu

                  LU: out-of-place factorization

                  tolerance for zero pivot 2.22045e-14

                  using diagonal shift on blocks to prevent zero pivot
[INBLOCKS]

                  matrix ordering: nd

                  factor fill ratio given 5., needed 1.

                    Factored matrix follows:

                      Mat Object:                       1 MPI processes

                        type: seqaij

                        rows=9, cols=9

                        package used to perform factorization: petsc

                        total: nonzeros=81, allocated nonzeros=81

                        total number of mallocs used during MatSetValues
calls =0

                          using I-node routines: found 2 nodes, limit used
is 5

                linear system matrix = precond matrix:

                Mat Object:                 1 MPI processes

                  type: seqaij

                  rows=9, cols=9

                  total: nonzeros=81, allocated nonzeros=81

                  total number of mallocs used during MatSetValues calls =0

                    using I-node routines: found 2 nodes, limit used is 5

            linear system matrix = precond matrix:

            Mat Object:             1 MPI processes

              type: seqaij

              rows=9, cols=9

              total: nonzeros=81, allocated nonzeros=81

              total number of mallocs used during MatSetValues calls =0

                using I-node routines: found 2 nodes, limit used is 5

        Down solver (pre-smoother) on level 1
-------------------------------

          KSP Object:          (solver_fieldsplit_1_mg_levels_1_)
1 MPI processes

            type: chebyshev

              Chebyshev: eigenvalue estimates:  min = 0.0999525, max =
1.09948

              Chebyshev: eigenvalues estimated using gmres with
translations  [0. 0.1; 0. 1.1]

              KSP Object:
(solver_fieldsplit_1_mg_levels_1_esteig_)               1 MPI processes

                type: gmres

                  GMRES: restart=30, using Classical (unmodified)
Gram-Schmidt Orthogonalization with no iterative refinement

                  GMRES: happy breakdown tolerance 1e-30

                maximum iterations=10, initial guess is zero

                tolerances:  relative=1e-12, absolute=1e-50,
divergence=10000.

                left preconditioning

                using PRECONDITIONED norm type for convergence test

            maximum iterations=2

            tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.

            left preconditioning

            using nonzero initial guess

            using NONE norm type for convergence test

          PC Object:          (solver_fieldsplit_1_mg_levels_1_)
1 MPI processes

            type: sor

              SOR: type = local_symmetric, iterations = 1, local iterations
= 1, omega = 1.

            linear system matrix = precond matrix:

            Mat Object:             1 MPI processes

              type: seqaij

              rows=207, cols=207

              total: nonzeros=42849, allocated nonzeros=42849

              total number of mallocs used during MatSetValues calls =0

                using I-node routines: found 42 nodes, limit used is 5

        Up solver (post-smoother) same as down solver (pre-smoother)

        Down solver (pre-smoother) on level 2
-------------------------------

          KSP Object:          (solver_fieldsplit_1_mg_levels_2_)
1 MPI processes

            type: chebyshev

              Chebyshev: eigenvalue estimates:  min = 0.0996628, max =
1.09629

              Chebyshev: eigenvalues estimated using gmres with
translations  [0. 0.1; 0. 1.1]

              KSP Object:
(solver_fieldsplit_1_mg_levels_2_esteig_)               1 MPI processes

                type: gmres

                  GMRES: restart=30, using Classical (unmodified)
Gram-Schmidt Orthogonalization with no iterative refinement

                  GMRES: happy breakdown tolerance 1e-30

                maximum iterations=10, initial guess is zero

                tolerances:  relative=1e-12, absolute=1e-50,
divergence=10000.

                left preconditioning

                using PRECONDITIONED norm type for convergence test

            maximum iterations=2

            tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.

            left preconditioning

            using nonzero initial guess

            using NONE norm type for convergence test

          PC Object:          (solver_fieldsplit_1_mg_levels_2_)
1 MPI processes

            type: sor

              SOR: type = local_symmetric, iterations = 1, local iterations
= 1, omega = 1.

            linear system matrix = precond matrix:

            Mat Object:             1 MPI processes

              type: seqaij

              rows=5373, cols=5373

              total: nonzeros=28852043, allocated nonzeros=28852043

              total number of mallocs used during MatSetValues calls =0

                using I-node routines: found 1481 nodes, limit used is 5

        Up solver (post-smoother) same as down solver (pre-smoother)

        Down solver (pre-smoother) on level 3
-------------------------------

          KSP Object:          (solver_fieldsplit_1_mg_levels_3_)
1 MPI processes

            type: chebyshev

              Chebyshev: eigenvalue estimates:  min = 0.0994294, max =
1.09372

              Chebyshev: eigenvalues estimated using gmres with
translations  [0. 0.1; 0. 1.1]

              KSP Object:
(solver_fieldsplit_1_mg_levels_3_esteig_)               1 MPI processes

                type: gmres

                  GMRES: restart=30, using Classical (unmodified)
Gram-Schmidt Orthogonalization with no iterative refinement

                  GMRES: happy breakdown tolerance 1e-30

                maximum iterations=10, initial guess is zero

                tolerances:  relative=1e-12, absolute=1e-50,
divergence=10000.

                left preconditioning

                using PRECONDITIONED norm type for convergence test

            maximum iterations=2

            tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.

            left preconditioning

            using nonzero initial guess

            using NONE norm type for convergence test

          PC Object:          (solver_fieldsplit_1_mg_levels_3_)
1 MPI processes

            type: sor

              SOR: type = local_symmetric, iterations = 1, local iterations
= 1, omega = 1.

            linear system matrix = precond matrix:

            Mat Object:             1 MPI processes

              type: seqaij

              rows=52147, cols=52147

              total: nonzeros=38604909, allocated nonzeros=38604909

              total number of mallocs used during MatSetValues calls =2

                not using I-node routines

        Up solver (post-smoother) same as down solver (pre-smoother)

        Down solver (pre-smoother) on level 4
-------------------------------

          KSP Object:          (solver_fieldsplit_1_mg_levels_4_)
1 MPI processes

            type: chebyshev

              Chebyshev: eigenvalue estimates:  min = 0.158979, max =
1.74876

              Chebyshev: eigenvalues estimated using gmres with
translations  [0. 0.1; 0. 1.1]

              KSP Object:
(solver_fieldsplit_1_mg_levels_4_esteig_)               1 MPI processes

                type: gmres

                  GMRES: restart=30, using Classical (unmodified)
Gram-Schmidt Orthogonalization with no iterative refinement

                  GMRES: happy breakdown tolerance 1e-30

                maximum iterations=10, initial guess is zero

                tolerances:  relative=1e-12, absolute=1e-50,
divergence=10000.

                left preconditioning

                using PRECONDITIONED norm type for convergence test

            maximum iterations=2

            tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.

            left preconditioning

            using nonzero initial guess

            using NONE norm type for convergence test

          PC Object:          (solver_fieldsplit_1_mg_levels_4_)
1 MPI processes

            type: sor

              SOR: type = local_symmetric, iterations = 1, local iterations
= 1, omega = 1.

            linear system matrix followed by preconditioner matrix:

            Mat Object:            (solver_fieldsplit_1_)             1 MPI
processes

              type: schurcomplement

              rows=384000, cols=384000

                Schur complement A11 - A10 inv(A00) A01

                A11

                  Mat Object:                  (solver_fieldsplit_1_)
            1 MPI processes

                    type: seqaij

                    rows=384000, cols=384000

                    total: nonzeros=384000, allocated nonzeros=384000

                    total number of mallocs used during MatSetValues calls
=0

                      not using I-node routines

                A10

                  Mat Object:                   1 MPI processes

                    type: seqaij

                    rows=384000, cols=777600

                    total: nonzeros=1919999, allocated nonzeros=1919999

                    total number of mallocs used during MatSetValues calls
=0

                      not using I-node routines

                KSP of A00

                  KSP Object:                  (solver_fieldsplit_0_)
            1 MPI processes

                    type: preonly

                    maximum iterations=10000, initial guess is zero

                    tolerances:  relative=1e-05, absolute=1e-50,
divergence=10000.

                    left preconditioning

                    using NONE norm type for convergence test

                  PC Object:                  (solver_fieldsplit_0_)
            1 MPI processes

                    type: bjacobi

                      block Jacobi: number of blocks = 1

                      Local solve is same for all blocks, in the following
KSP and PC objects:

                      KSP Object:
(solver_fieldsplit_0_sub_)                       1 MPI processes

                        type: preonly

                        maximum iterations=10000, initial guess is zero

                        tolerances:  relative=1e-05, absolute=1e-50,
divergence=10000.

                        left preconditioning

                        using NONE norm type for convergence test

                      PC Object:
(solver_fieldsplit_0_sub_)                       1 MPI processes

                        type: ilu

                          ILU: out-of-place factorization

                          0 levels of fill

                          tolerance for zero pivot 2.22045e-14

                          matrix ordering: natural

                          factor fill ratio given 1., needed 1.

                            Factored matrix follows:

                              Mat Object:                               1
MPI processes

                                type: seqaij

                                rows=777600, cols=777600

                                package used to perform factorization: petsc

                                total: nonzeros=5385600, allocated
nonzeros=5385600

                                total number of mallocs used during
MatSetValues calls =0

                                  not using I-node routines

                        linear system matrix = precond matrix:

                        Mat Object:
(solver_fieldsplit_0_)                         1 MPI processes

                          type: seqaij

                          rows=777600, cols=777600

                          total: nonzeros=5385600, allocated
nonzeros=5385600

                          total number of mallocs used during MatSetValues
calls =0

                            not using I-node routines

                    linear system matrix = precond matrix:

                    Mat Object:                    (solver_fieldsplit_0_)
                  1 MPI processes

                      type: seqaij

                      rows=777600, cols=777600

                      total: nonzeros=5385600, allocated nonzeros=5385600

                      total number of mallocs used during MatSetValues
calls =0

                        not using I-node routines

                A01

                  Mat Object:                   1 MPI processes

                    type: seqaij

                    rows=777600, cols=384000

                    total: nonzeros=1919999, allocated nonzeros=1919999

                    total number of mallocs used during MatSetValues calls
=0

                      not using I-node routines

            Mat Object:             1 MPI processes

              type: seqaij

              rows=384000, cols=384000

              total: nonzeros=3416452, allocated nonzeros=3416452

              total number of mallocs used during MatSetValues calls =0

                not using I-node routines

        Up solver (post-smoother) same as down solver (pre-smoother)

        linear system matrix followed by preconditioner matrix:

        Mat Object:        (solver_fieldsplit_1_)         1 MPI processes

          type: schurcomplement

          rows=384000, cols=384000

            Schur complement A11 - A10 inv(A00) A01

            A11

              Mat Object:              (solver_fieldsplit_1_)
1 MPI processes

                type: seqaij

                rows=384000, cols=384000

                total: nonzeros=384000, allocated nonzeros=384000

                total number of mallocs used during MatSetValues calls =0

                  not using I-node routines

            A10

              Mat Object:               1 MPI processes

                type: seqaij

                rows=384000, cols=777600

                total: nonzeros=1919999, allocated nonzeros=1919999

                total number of mallocs used during MatSetValues calls =0

                  not using I-node routines

            KSP of A00

              KSP Object:              (solver_fieldsplit_0_)
1 MPI processes

                type: preonly

                maximum iterations=10000, initial guess is zero

                tolerances:  relative=1e-05, absolute=1e-50,
divergence=10000.

                left preconditioning

                using NONE norm type for convergence test

              PC Object:              (solver_fieldsplit_0_)
1 MPI processes

                type: bjacobi

                  block Jacobi: number of blocks = 1

                  Local solve is same for all blocks, in the following KSP
and PC objects:

                  KSP Object:                  (solver_fieldsplit_0_sub_)
                1 MPI processes

                    type: preonly

                    maximum iterations=10000, initial guess is zero

                    tolerances:  relative=1e-05, absolute=1e-50,
divergence=10000.

                    left preconditioning

                    using NONE norm type for convergence test

                  PC Object:                  (solver_fieldsplit_0_sub_)
                1 MPI processes

                    type: ilu

                      ILU: out-of-place factorization

                      0 levels of fill

                      tolerance for zero pivot 2.22045e-14

                      matrix ordering: natural

                      factor fill ratio given 1., needed 1.

                        Factored matrix follows:

                          Mat Object:                           1 MPI
processes

                            type: seqaij

                            rows=777600, cols=777600

                            package used to perform factorization: petsc

                            total: nonzeros=5385600, allocated
nonzeros=5385600

                            total number of mallocs used during
MatSetValues calls =0

                              not using I-node routines

                    linear system matrix = precond matrix:

                    Mat Object:                    (solver_fieldsplit_0_)
                  1 MPI processes

                      type: seqaij

                      rows=777600, cols=777600

                      total: nonzeros=5385600, allocated nonzeros=5385600

                      total number of mallocs used during MatSetValues
calls =0

                        not using I-node routines

                linear system matrix = precond matrix:

                Mat Object:                (solver_fieldsplit_0_)
      1 MPI processes

                  type: seqaij

                  rows=777600, cols=777600

                  total: nonzeros=5385600, allocated nonzeros=5385600

                  total number of mallocs used during MatSetValues calls =0

                    not using I-node routines

            A01

              Mat Object:               1 MPI processes

                type: seqaij

                rows=777600, cols=384000

                total: nonzeros=1919999, allocated nonzeros=1919999

                total number of mallocs used during MatSetValues calls =0

                  not using I-node routines

        Mat Object:         1 MPI processes

          type: seqaij

          rows=384000, cols=384000

          total: nonzeros=3416452, allocated nonzeros=3416452

          total number of mallocs used during MatSetValues calls =0

            not using I-node routines

  linear system matrix = precond matrix:

  Mat Object:   1 MPI processes

    type: nest

    rows=1161600, cols=1161600

      Matrix object:

        type=nest, rows=2, cols=2

        MatNest structure:

        (0,0) : prefix="solver_fieldsplit_0_", type=seqaij, rows=777600,
cols=777600

        (0,1) : type=seqaij, rows=777600, cols=384000

        (1,0) : type=seqaij, rows=384000, cols=777600

        (1,1) : prefix="solver_fieldsplit_1_", type=seqaij, rows=384000,
cols=384000

Any insight as to what's happening? Btw this firedrake/petsc-mapdes is from
way back in october 2015 (yes much has changed since but
reinstalling/updating firedrake and petsc on LANL's firewall HPC machines
is a big pain in the ass).

Thanks,
Justin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160302/f3a7ee50/attachment-0001.html>