[petsc-users] Strange GAMG performance for mixed FE formulation
Barry Smith
bsmith at mcs.anl.gov
Wed Mar 2 17:11:35 CST 2016
Justin,
Do you have the -log_summary output for these runs?
Barry
> On Mar 2, 2016, at 4:28 PM, Justin Chang <jychang48 at gmail.com> wrote:
>
> Dear all,
>
> Using the firedrake project, I am solving this simple mixed poisson problem:
>
> mesh = UnitCubeMesh(40,40,40)
> V = FunctionSpace(mesh,"RT",1)
> Q = FunctionSpace(mesh,"DG",0)
> W = V*Q
>
> v, p = TrialFunctions(W)
> w, q = TestFunctions(W)
>
> f = Function(Q)
> f.interpolate(Expression("12*pi*pi*sin(pi*x[0]*2)*sin(pi*x[1]*2)*sin(2*pi*x[2])"))
>
> a = dot(v,w)*dx - p*div(w)*dx + div(v)*q*dx
> L = f*q*dx
>
> u = Function(W)
> solve(a==L,u,solver_parameters={...})
>
> This problem has 1161600 degrees of freedom. The solver_parameters are:
>
> -ksp_type gmres
> -pc_type fieldsplit
> -pc_fieldsplit_type schur
> -pc_fieldsplit_schur_fact_type: upper
> -pc_fieldsplit_schur_precondition selfp
> -fieldsplit_0_ksp_type preonly
> -fieldsplit_0_pc_type bjacobi
> -fieldsplit_1_ksp_type preonly
> -fieldsplit_1_pc_type hypre/ml/gamg
>
> for the last option, I compared the wall-clock timings for hypre, ml,and gamg. Here are the strong-scaling results (across 64 cores, 8 cores per Intel Xeon E5-2670 node) for hypre, ml, and gamg:
>
> hypre:
> 1 core: 47.5 s, 12 solver iters
> 2 cores: 34.1 s, 15 solver iters
> 4 cores: 21.5 s, 15 solver iters
> 8 cores: 16.6 s, 15 solver iters
> 16 cores: 10.2 s, 15 solver iters
> 24 cores: 7.66 s, 15 solver iters
> 32 cores: 6.31 s, 15 solver iters
> 40 cores: 5.68 s, 15 solver iters
> 48 cores: 5.36 s, 16 solver iters
> 56 cores: 5.12 s, 16 solver iters
> 64 cores: 4.99 s, 16 solver iters
>
> ml:
> 1 core: 4.44 s, 14 solver iters
> 2 cores: 2.85 s, 16 solver iters
> 4 cores: 1.6 s, 17 solver iters
> 8 cores: 0.966 s, 17 solver iters
> 16 cores: 0.585 s, 18 solver iters
> 24 cores: 0.440 s, 18 solver iters
> 32 cores: 0.375 s, 18 solver iters
> 40 cores: 0.332 s, 18 solver iters
> 48 cores: 0.307 s, 17 solver iters
> 56 cores: 0.290 s, 18 solver iters
> 64 cores: 0.281 s, 18 solver items
>
> gamg:
> 1 core: 613 s, 12 solver iters
> 2 cores: 204 s, 15 solver iters
> 4 cores: 77.1 s, 15 solver iters
> 8 cores: 38.1 s, 15 solver iters
> 16 cores: 15.9 s, 16 solver iters
> 24 cores: 9.24 s, 16 solver iters
> 32 cores: 5.92 s, 16 solver iters
> 40 cores: 4.72 s, 16 solver iters
> 48 cores: 3.89 s, 16 solver iters
> 56 cores: 3.65 s, 16 solver iters
> 64 cores: 3.46 s, 16 solver iters
>
> The performance difference between ML and HYPRE makes sense to me, but what I am really confused about is GAMG. It seems GAMG is really slow on a single core but something internally is causing it to speed up super-linearly as I increase the number of MPI processes. Shouldn't ML and GAMG have the same performance? I am not sure what log outputs to give you guys, but for starters, below is -ksp_view for the single core case with GAMG
>
> KSP Object:(solver_) 1 MPI processes
> type: gmres
> GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
> GMRES: happy breakdown tolerance 1e-30
> maximum iterations=10000, initial guess is zero
> tolerances: relative=1e-07, absolute=1e-50, divergence=10000.
> left preconditioning
> using PRECONDITIONED norm type for convergence test
> PC Object:(solver_) 1 MPI processes
> type: fieldsplit
> FieldSplit with Schur preconditioner, factorization UPPER
> Preconditioner for the Schur complement formed from Sp, an assembled approximation to S, which uses (lumped, if requested) A00's diagonal's inverse
> Split info:
> Split number 0 Defined by IS
> Split number 1 Defined by IS
> KSP solver for A00 block
> KSP Object: (solver_fieldsplit_0_) 1 MPI processes
> type: preonly
> maximum iterations=10000, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_0_) 1 MPI processes
> type: bjacobi
> block Jacobi: number of blocks = 1
> Local solve is same for all blocks, in the following KSP and PC objects:
> KSP Object: (solver_fieldsplit_0_sub_) 1 MPI processes
> type: preonly
> maximum iterations=10000, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_0_sub_) 1 MPI processes
> type: ilu
> ILU: out-of-place factorization
> 0 levels of fill
> tolerance for zero pivot 2.22045e-14
> matrix ordering: natural
> factor fill ratio given 1., needed 1.
> Factored matrix follows:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=777600, cols=777600
> package used to perform factorization: petsc
> total: nonzeros=5385600, allocated nonzeros=5385600
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> linear system matrix = precond matrix:
> Mat Object: (solver_fieldsplit_0_) 1 MPI processes
> type: seqaij
> rows=777600, cols=777600
> total: nonzeros=5385600, allocated nonzeros=5385600
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> linear system matrix = precond matrix:
> Mat Object: (solver_fieldsplit_0_) 1 MPI processes
> type: seqaij
> rows=777600, cols=777600
> total: nonzeros=5385600, allocated nonzeros=5385600
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> KSP solver for S = A11 - A10 inv(A00) A01
> KSP Object: (solver_fieldsplit_1_) 1 MPI processes
> type: preonly
> maximum iterations=10000, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_1_) 1 MPI processes
> type: gamg
> MG: type is MULTIPLICATIVE, levels=5 cycles=v
> Cycles per PCApply=1
> Using Galerkin computed coarse grid matrices
> GAMG specific options
> Threshold for dropping small values from graph 0.
> AGG specific options
> Symmetric graph false
> Coarse grid solver -- level -------------------------------
> KSP Object: (solver_fieldsplit_1_mg_coarse_) 1 MPI processes
> type: preonly
> maximum iterations=1, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_1_mg_coarse_) 1 MPI processes
> type: bjacobi
> block Jacobi: number of blocks = 1
> Local solve is same for all blocks, in the following KSP and PC objects:
> KSP Object: (solver_fieldsplit_1_mg_coarse_sub_) 1 MPI processes
> type: preonly
> maximum iterations=1, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_1_mg_coarse_sub_) 1 MPI processes
> type: lu
> LU: out-of-place factorization
> tolerance for zero pivot 2.22045e-14
> using diagonal shift on blocks to prevent zero pivot [INBLOCKS]
> matrix ordering: nd
> factor fill ratio given 5., needed 1.
> Factored matrix follows:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=9, cols=9
> package used to perform factorization: petsc
> total: nonzeros=81, allocated nonzeros=81
> total number of mallocs used during MatSetValues calls =0
> using I-node routines: found 2 nodes, limit used is 5
> linear system matrix = precond matrix:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=9, cols=9
> total: nonzeros=81, allocated nonzeros=81
> total number of mallocs used during MatSetValues calls =0
> using I-node routines: found 2 nodes, limit used is 5
> linear system matrix = precond matrix:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=9, cols=9
> total: nonzeros=81, allocated nonzeros=81
> total number of mallocs used during MatSetValues calls =0
> using I-node routines: found 2 nodes, limit used is 5
> Down solver (pre-smoother) on level 1 -------------------------------
> KSP Object: (solver_fieldsplit_1_mg_levels_1_) 1 MPI processes
> type: chebyshev
> Chebyshev: eigenvalue estimates: min = 0.0999525, max = 1.09948
> Chebyshev: eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1]
> KSP Object: (solver_fieldsplit_1_mg_levels_1_esteig_) 1 MPI processes
> type: gmres
> GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
> GMRES: happy breakdown tolerance 1e-30
> maximum iterations=10, initial guess is zero
> tolerances: relative=1e-12, absolute=1e-50, divergence=10000.
> left preconditioning
> using PRECONDITIONED norm type for convergence test
> maximum iterations=2
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using nonzero initial guess
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_1_mg_levels_1_) 1 MPI processes
> type: sor
> SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
> linear system matrix = precond matrix:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=207, cols=207
> total: nonzeros=42849, allocated nonzeros=42849
> total number of mallocs used during MatSetValues calls =0
> using I-node routines: found 42 nodes, limit used is 5
> Up solver (post-smoother) same as down solver (pre-smoother)
> Down solver (pre-smoother) on level 2 -------------------------------
> KSP Object: (solver_fieldsplit_1_mg_levels_2_) 1 MPI processes
> type: chebyshev
> Chebyshev: eigenvalue estimates: min = 0.0996628, max = 1.09629
> Chebyshev: eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1]
> KSP Object: (solver_fieldsplit_1_mg_levels_2_esteig_) 1 MPI processes
> type: gmres
> GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
> GMRES: happy breakdown tolerance 1e-30
> maximum iterations=10, initial guess is zero
> tolerances: relative=1e-12, absolute=1e-50, divergence=10000.
> left preconditioning
> using PRECONDITIONED norm type for convergence test
> maximum iterations=2
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using nonzero initial guess
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_1_mg_levels_2_) 1 MPI processes
> type: sor
> SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
> linear system matrix = precond matrix:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=5373, cols=5373
> total: nonzeros=28852043, allocated nonzeros=28852043
> total number of mallocs used during MatSetValues calls =0
> using I-node routines: found 1481 nodes, limit used is 5
> Up solver (post-smoother) same as down solver (pre-smoother)
> Down solver (pre-smoother) on level 3 -------------------------------
> KSP Object: (solver_fieldsplit_1_mg_levels_3_) 1 MPI processes
> type: chebyshev
> Chebyshev: eigenvalue estimates: min = 0.0994294, max = 1.09372
> Chebyshev: eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1]
> KSP Object: (solver_fieldsplit_1_mg_levels_3_esteig_) 1 MPI processes
> type: gmres
> GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
> GMRES: happy breakdown tolerance 1e-30
> maximum iterations=10, initial guess is zero
> tolerances: relative=1e-12, absolute=1e-50, divergence=10000.
> left preconditioning
> using PRECONDITIONED norm type for convergence test
> maximum iterations=2
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using nonzero initial guess
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_1_mg_levels_3_) 1 MPI processes
> type: sor
> SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
> linear system matrix = precond matrix:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=52147, cols=52147
> total: nonzeros=38604909, allocated nonzeros=38604909
> total number of mallocs used during MatSetValues calls =2
> not using I-node routines
> Up solver (post-smoother) same as down solver (pre-smoother)
> Down solver (pre-smoother) on level 4 -------------------------------
> KSP Object: (solver_fieldsplit_1_mg_levels_4_) 1 MPI processes
> type: chebyshev
> Chebyshev: eigenvalue estimates: min = 0.158979, max = 1.74876
> Chebyshev: eigenvalues estimated using gmres with translations [0. 0.1; 0. 1.1]
> KSP Object: (solver_fieldsplit_1_mg_levels_4_esteig_) 1 MPI processes
> type: gmres
> GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
> GMRES: happy breakdown tolerance 1e-30
> maximum iterations=10, initial guess is zero
> tolerances: relative=1e-12, absolute=1e-50, divergence=10000.
> left preconditioning
> using PRECONDITIONED norm type for convergence test
> maximum iterations=2
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using nonzero initial guess
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_1_mg_levels_4_) 1 MPI processes
> type: sor
> SOR: type = local_symmetric, iterations = 1, local iterations = 1, omega = 1.
> linear system matrix followed by preconditioner matrix:
> Mat Object: (solver_fieldsplit_1_) 1 MPI processes
> type: schurcomplement
> rows=384000, cols=384000
> Schur complement A11 - A10 inv(A00) A01
> A11
> Mat Object: (solver_fieldsplit_1_) 1 MPI processes
> type: seqaij
> rows=384000, cols=384000
> total: nonzeros=384000, allocated nonzeros=384000
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> A10
> Mat Object: 1 MPI processes
> type: seqaij
> rows=384000, cols=777600
> total: nonzeros=1919999, allocated nonzeros=1919999
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> KSP of A00
> KSP Object: (solver_fieldsplit_0_) 1 MPI processes
> type: preonly
> maximum iterations=10000, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_0_) 1 MPI processes
> type: bjacobi
> block Jacobi: number of blocks = 1
> Local solve is same for all blocks, in the following KSP and PC objects:
> KSP Object: (solver_fieldsplit_0_sub_) 1 MPI processes
> type: preonly
> maximum iterations=10000, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_0_sub_) 1 MPI processes
> type: ilu
> ILU: out-of-place factorization
> 0 levels of fill
> tolerance for zero pivot 2.22045e-14
> matrix ordering: natural
> factor fill ratio given 1., needed 1.
> Factored matrix follows:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=777600, cols=777600
> package used to perform factorization: petsc
> total: nonzeros=5385600, allocated nonzeros=5385600
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> linear system matrix = precond matrix:
> Mat Object: (solver_fieldsplit_0_) 1 MPI processes
> type: seqaij
> rows=777600, cols=777600
> total: nonzeros=5385600, allocated nonzeros=5385600
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> linear system matrix = precond matrix:
> Mat Object: (solver_fieldsplit_0_) 1 MPI processes
> type: seqaij
> rows=777600, cols=777600
> total: nonzeros=5385600, allocated nonzeros=5385600
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> A01
> Mat Object: 1 MPI processes
> type: seqaij
> rows=777600, cols=384000
> total: nonzeros=1919999, allocated nonzeros=1919999
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> Mat Object: 1 MPI processes
> type: seqaij
> rows=384000, cols=384000
> total: nonzeros=3416452, allocated nonzeros=3416452
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> Up solver (post-smoother) same as down solver (pre-smoother)
> linear system matrix followed by preconditioner matrix:
> Mat Object: (solver_fieldsplit_1_) 1 MPI processes
> type: schurcomplement
> rows=384000, cols=384000
> Schur complement A11 - A10 inv(A00) A01
> A11
> Mat Object: (solver_fieldsplit_1_) 1 MPI processes
> type: seqaij
> rows=384000, cols=384000
> total: nonzeros=384000, allocated nonzeros=384000
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> A10
> Mat Object: 1 MPI processes
> type: seqaij
> rows=384000, cols=777600
> total: nonzeros=1919999, allocated nonzeros=1919999
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> KSP of A00
> KSP Object: (solver_fieldsplit_0_) 1 MPI processes
> type: preonly
> maximum iterations=10000, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_0_) 1 MPI processes
> type: bjacobi
> block Jacobi: number of blocks = 1
> Local solve is same for all blocks, in the following KSP and PC objects:
> KSP Object: (solver_fieldsplit_0_sub_) 1 MPI processes
> type: preonly
> maximum iterations=10000, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using NONE norm type for convergence test
> PC Object: (solver_fieldsplit_0_sub_) 1 MPI processes
> type: ilu
> ILU: out-of-place factorization
> 0 levels of fill
> tolerance for zero pivot 2.22045e-14
> matrix ordering: natural
> factor fill ratio given 1., needed 1.
> Factored matrix follows:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=777600, cols=777600
> package used to perform factorization: petsc
> total: nonzeros=5385600, allocated nonzeros=5385600
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> linear system matrix = precond matrix:
> Mat Object: (solver_fieldsplit_0_) 1 MPI processes
> type: seqaij
> rows=777600, cols=777600
> total: nonzeros=5385600, allocated nonzeros=5385600
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> linear system matrix = precond matrix:
> Mat Object: (solver_fieldsplit_0_) 1 MPI processes
> type: seqaij
> rows=777600, cols=777600
> total: nonzeros=5385600, allocated nonzeros=5385600
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> A01
> Mat Object: 1 MPI processes
> type: seqaij
> rows=777600, cols=384000
> total: nonzeros=1919999, allocated nonzeros=1919999
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> Mat Object: 1 MPI processes
> type: seqaij
> rows=384000, cols=384000
> total: nonzeros=3416452, allocated nonzeros=3416452
> total number of mallocs used during MatSetValues calls =0
> not using I-node routines
> linear system matrix = precond matrix:
> Mat Object: 1 MPI processes
> type: nest
> rows=1161600, cols=1161600
> Matrix object:
> type=nest, rows=2, cols=2
> MatNest structure:
> (0,0) : prefix="solver_fieldsplit_0_", type=seqaij, rows=777600, cols=777600
> (0,1) : type=seqaij, rows=777600, cols=384000
> (1,0) : type=seqaij, rows=384000, cols=777600
> (1,1) : prefix="solver_fieldsplit_1_", type=seqaij, rows=384000, cols=384000
>
> Any insight as to what's happening? Btw this firedrake/petsc-mapdes is from way back in october 2015 (yes much has changed since but reinstalling/updating firedrake and petsc on LANL's firewall HPC machines is a big pain in the ass).
>
> Thanks,
> Justin
More information about the petsc-users
mailing list