[petsc-users] Strange GAMG performance for mixed FE formulation
Justin Chang
jychang48 at gmail.com
Wed Mar 2 16:28:49 CST 2016
Dear all,
Using the firedrake project, I am solving this simple mixed poisson problem:
mesh = UnitCubeMesh(40,40,40)
V = FunctionSpace(mesh,"RT",1)
Q = FunctionSpace(mesh,"DG",0)
W = V*Q
v, p = TrialFunctions(W)
w, q = TestFunctions(W)
f = Function(Q)
f.interpolate(Expression("12*pi*pi*sin(pi*x[0]*2)*sin(pi*x[1]*2)*sin(2*pi*x[2])"))
a = dot(v,w)*dx - p*div(w)*dx + div(v)*q*dx
L = f*q*dx
u = Function(W)
solve(a==L,u,solver_parameters={...})
This problem has 1161600 degrees of freedom. The solver_parameters are:
-ksp_type gmres
-pc_type fieldsplit
-pc_fieldsplit_type schur
-pc_fieldsplit_schur_fact_type: upper
-pc_fieldsplit_schur_precondition selfp
-fieldsplit_0_ksp_type preonly
-fieldsplit_0_pc_type bjacobi
-fieldsplit_1_ksp_type preonly
-fieldsplit_1_pc_type hypre/ml/gamg
for the last option, I compared the wall-clock timings for hypre, ml,and
gamg. Here are the strong-scaling results (across 64 cores, 8 cores per
Intel Xeon E5-2670 node) for hypre, ml, and gamg:
hypre:
1 core: 47.5 s, 12 solver iters
2 cores: 34.1 s, 15 solver iters
4 cores: 21.5 s, 15 solver iters
8 cores: 16.6 s, 15 solver iters
16 cores: 10.2 s, 15 solver iters
24 cores: 7.66 s, 15 solver iters
32 cores: 6.31 s, 15 solver iters
40 cores: 5.68 s, 15 solver iters
48 cores: 5.36 s, 16 solver iters
56 cores: 5.12 s, 16 solver iters
64 cores: 4.99 s, 16 solver iters
ml:
1 core: 4.44 s, 14 solver iters
2 cores: 2.85 s, 16 solver iters
4 cores: 1.6 s, 17 solver iters
8 cores: 0.966 s, 17 solver iters
16 cores: 0.585 s, 18 solver iters
24 cores: 0.440 s, 18 solver iters
32 cores: 0.375 s, 18 solver iters
40 cores: 0.332 s, 18 solver iters
48 cores: 0.307 s, 17 solver iters
56 cores: 0.290 s, 18 solver iters
64 cores: 0.281 s, 18 solver items
gamg:
1 core: 613 s, 12 solver iters
2 cores: 204 s, 15 solver iters
4 cores: 77.1 s, 15 solver iters
8 cores: 38.1 s, 15 solver iters
16 cores: 15.9 s, 16 solver iters
24 cores: 9.24 s, 16 solver iters
32 cores: 5.92 s, 16 solver iters
40 cores: 4.72 s, 16 solver iters
48 cores: 3.89 s, 16 solver iters
56 cores: 3.65 s, 16 solver iters
64 cores: 3.46 s, 16 solver iters
The performance difference between ML and HYPRE makes sense to me, but what
I am really confused about is GAMG. It seems GAMG is really slow on a
single core but something internally is causing it to speed up
super-linearly as I increase the number of MPI processes. Shouldn't ML and
GAMG have the same performance? I am not sure what log outputs to give you
guys, but for starters, below is -ksp_view for the single core case with
GAMG
KSP Object:(solver_) 1 MPI processes
type: gmres
GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
GMRES: happy breakdown tolerance 1e-30
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-07, absolute=1e-50, divergence=10000.
left preconditioning
using PRECONDITIONED norm type for convergence test
PC Object:(solver_) 1 MPI processes
type: fieldsplit
FieldSplit with Schur preconditioner, factorization UPPER
Preconditioner for the Schur complement formed from Sp, an assembled
approximation to S, which uses (lumped, if requested) A00's diagonal's
inverse
Split info:
Split number 0 Defined by IS
Split number 1 Defined by IS
KSP solver for A00 block
KSP Object: (solver_fieldsplit_0_) 1 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_0_) 1 MPI processes
type: bjacobi
block Jacobi: number of blocks = 1
Local solve is same for all blocks, in the following KSP and PC
objects:
KSP Object: (solver_fieldsplit_0_sub_) 1 MPI
processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_0_sub_) 1 MPI
processes
type: ilu
ILU: out-of-place factorization
0 levels of fill
tolerance for zero pivot 2.22045e-14
matrix ordering: natural
factor fill ratio given 1., needed 1.
Factored matrix follows:
Mat Object: 1 MPI processes
type: seqaij
rows=777600, cols=777600
package used to perform factorization: petsc
total: nonzeros=5385600, allocated nonzeros=5385600
total number of mallocs used during MatSetValues calls
=0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: (solver_fieldsplit_0_) 1 MPI
processes
type: seqaij
rows=777600, cols=777600
total: nonzeros=5385600, allocated nonzeros=5385600
total number of mallocs used during MatSetValues calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: (solver_fieldsplit_0_) 1 MPI processes
type: seqaij
rows=777600, cols=777600
total: nonzeros=5385600, allocated nonzeros=5385600
total number of mallocs used during MatSetValues calls =0
not using I-node routines
KSP solver for S = A11 - A10 inv(A00) A01
KSP Object: (solver_fieldsplit_1_) 1 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_1_) 1 MPI processes
type: gamg
MG: type is MULTIPLICATIVE, levels=5 cycles=v
Cycles per PCApply=1
Using Galerkin computed coarse grid matrices
GAMG specific options
Threshold for dropping small values from graph 0.
AGG specific options
Symmetric graph false
Coarse grid solver -- level -------------------------------
KSP Object: (solver_fieldsplit_1_mg_coarse_) 1
MPI processes
type: preonly
maximum iterations=1, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_1_mg_coarse_) 1
MPI processes
type: bjacobi
block Jacobi: number of blocks = 1
Local solve is same for all blocks, in the following KSP and
PC objects:
KSP Object: (solver_fieldsplit_1_mg_coarse_sub_)
1 MPI processes
type: preonly
maximum iterations=1, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50,
divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_1_mg_coarse_sub_)
1 MPI processes
type: lu
LU: out-of-place factorization
tolerance for zero pivot 2.22045e-14
using diagonal shift on blocks to prevent zero pivot
[INBLOCKS]
matrix ordering: nd
factor fill ratio given 5., needed 1.
Factored matrix follows:
Mat Object: 1 MPI processes
type: seqaij
rows=9, cols=9
package used to perform factorization: petsc
total: nonzeros=81, allocated nonzeros=81
total number of mallocs used during MatSetValues
calls =0
using I-node routines: found 2 nodes, limit used
is 5
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
type: seqaij
rows=9, cols=9
total: nonzeros=81, allocated nonzeros=81
total number of mallocs used during MatSetValues calls =0
using I-node routines: found 2 nodes, limit used is 5
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
type: seqaij
rows=9, cols=9
total: nonzeros=81, allocated nonzeros=81
total number of mallocs used during MatSetValues calls =0
using I-node routines: found 2 nodes, limit used is 5
Down solver (pre-smoother) on level 1
-------------------------------
KSP Object: (solver_fieldsplit_1_mg_levels_1_)
1 MPI processes
type: chebyshev
Chebyshev: eigenvalue estimates: min = 0.0999525, max =
1.09948
Chebyshev: eigenvalues estimated using gmres with
translations [0. 0.1; 0. 1.1]
KSP Object:
(solver_fieldsplit_1_mg_levels_1_esteig_) 1 MPI processes
type: gmres
GMRES: restart=30, using Classical (unmodified)
Gram-Schmidt Orthogonalization with no iterative refinement
GMRES: happy breakdown tolerance 1e-30
maximum iterations=10, initial guess is zero
tolerances: relative=1e-12, absolute=1e-50,
divergence=10000.
left preconditioning
using PRECONDITIONED norm type for convergence test
maximum iterations=2
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_1_mg_levels_1_)
1 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations
= 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
type: seqaij
rows=207, cols=207
total: nonzeros=42849, allocated nonzeros=42849
total number of mallocs used during MatSetValues calls =0
using I-node routines: found 42 nodes, limit used is 5
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 2
-------------------------------
KSP Object: (solver_fieldsplit_1_mg_levels_2_)
1 MPI processes
type: chebyshev
Chebyshev: eigenvalue estimates: min = 0.0996628, max =
1.09629
Chebyshev: eigenvalues estimated using gmres with
translations [0. 0.1; 0. 1.1]
KSP Object:
(solver_fieldsplit_1_mg_levels_2_esteig_) 1 MPI processes
type: gmres
GMRES: restart=30, using Classical (unmodified)
Gram-Schmidt Orthogonalization with no iterative refinement
GMRES: happy breakdown tolerance 1e-30
maximum iterations=10, initial guess is zero
tolerances: relative=1e-12, absolute=1e-50,
divergence=10000.
left preconditioning
using PRECONDITIONED norm type for convergence test
maximum iterations=2
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_1_mg_levels_2_)
1 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations
= 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
type: seqaij
rows=5373, cols=5373
total: nonzeros=28852043, allocated nonzeros=28852043
total number of mallocs used during MatSetValues calls =0
using I-node routines: found 1481 nodes, limit used is 5
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 3
-------------------------------
KSP Object: (solver_fieldsplit_1_mg_levels_3_)
1 MPI processes
type: chebyshev
Chebyshev: eigenvalue estimates: min = 0.0994294, max =
1.09372
Chebyshev: eigenvalues estimated using gmres with
translations [0. 0.1; 0. 1.1]
KSP Object:
(solver_fieldsplit_1_mg_levels_3_esteig_) 1 MPI processes
type: gmres
GMRES: restart=30, using Classical (unmodified)
Gram-Schmidt Orthogonalization with no iterative refinement
GMRES: happy breakdown tolerance 1e-30
maximum iterations=10, initial guess is zero
tolerances: relative=1e-12, absolute=1e-50,
divergence=10000.
left preconditioning
using PRECONDITIONED norm type for convergence test
maximum iterations=2
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_1_mg_levels_3_)
1 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations
= 1, omega = 1.
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
type: seqaij
rows=52147, cols=52147
total: nonzeros=38604909, allocated nonzeros=38604909
total number of mallocs used during MatSetValues calls =2
not using I-node routines
Up solver (post-smoother) same as down solver (pre-smoother)
Down solver (pre-smoother) on level 4
-------------------------------
KSP Object: (solver_fieldsplit_1_mg_levels_4_)
1 MPI processes
type: chebyshev
Chebyshev: eigenvalue estimates: min = 0.158979, max =
1.74876
Chebyshev: eigenvalues estimated using gmres with
translations [0. 0.1; 0. 1.1]
KSP Object:
(solver_fieldsplit_1_mg_levels_4_esteig_) 1 MPI processes
type: gmres
GMRES: restart=30, using Classical (unmodified)
Gram-Schmidt Orthogonalization with no iterative refinement
GMRES: happy breakdown tolerance 1e-30
maximum iterations=10, initial guess is zero
tolerances: relative=1e-12, absolute=1e-50,
divergence=10000.
left preconditioning
using PRECONDITIONED norm type for convergence test
maximum iterations=2
tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
left preconditioning
using nonzero initial guess
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_1_mg_levels_4_)
1 MPI processes
type: sor
SOR: type = local_symmetric, iterations = 1, local iterations
= 1, omega = 1.
linear system matrix followed by preconditioner matrix:
Mat Object: (solver_fieldsplit_1_) 1 MPI
processes
type: schurcomplement
rows=384000, cols=384000
Schur complement A11 - A10 inv(A00) A01
A11
Mat Object: (solver_fieldsplit_1_)
1 MPI processes
type: seqaij
rows=384000, cols=384000
total: nonzeros=384000, allocated nonzeros=384000
total number of mallocs used during MatSetValues calls
=0
not using I-node routines
A10
Mat Object: 1 MPI processes
type: seqaij
rows=384000, cols=777600
total: nonzeros=1919999, allocated nonzeros=1919999
total number of mallocs used during MatSetValues calls
=0
not using I-node routines
KSP of A00
KSP Object: (solver_fieldsplit_0_)
1 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50,
divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_0_)
1 MPI processes
type: bjacobi
block Jacobi: number of blocks = 1
Local solve is same for all blocks, in the following
KSP and PC objects:
KSP Object:
(solver_fieldsplit_0_sub_) 1 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50,
divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object:
(solver_fieldsplit_0_sub_) 1 MPI processes
type: ilu
ILU: out-of-place factorization
0 levels of fill
tolerance for zero pivot 2.22045e-14
matrix ordering: natural
factor fill ratio given 1., needed 1.
Factored matrix follows:
Mat Object: 1
MPI processes
type: seqaij
rows=777600, cols=777600
package used to perform factorization: petsc
total: nonzeros=5385600, allocated
nonzeros=5385600
total number of mallocs used during
MatSetValues calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object:
(solver_fieldsplit_0_) 1 MPI processes
type: seqaij
rows=777600, cols=777600
total: nonzeros=5385600, allocated
nonzeros=5385600
total number of mallocs used during MatSetValues
calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: (solver_fieldsplit_0_)
1 MPI processes
type: seqaij
rows=777600, cols=777600
total: nonzeros=5385600, allocated nonzeros=5385600
total number of mallocs used during MatSetValues
calls =0
not using I-node routines
A01
Mat Object: 1 MPI processes
type: seqaij
rows=777600, cols=384000
total: nonzeros=1919999, allocated nonzeros=1919999
total number of mallocs used during MatSetValues calls
=0
not using I-node routines
Mat Object: 1 MPI processes
type: seqaij
rows=384000, cols=384000
total: nonzeros=3416452, allocated nonzeros=3416452
total number of mallocs used during MatSetValues calls =0
not using I-node routines
Up solver (post-smoother) same as down solver (pre-smoother)
linear system matrix followed by preconditioner matrix:
Mat Object: (solver_fieldsplit_1_) 1 MPI processes
type: schurcomplement
rows=384000, cols=384000
Schur complement A11 - A10 inv(A00) A01
A11
Mat Object: (solver_fieldsplit_1_)
1 MPI processes
type: seqaij
rows=384000, cols=384000
total: nonzeros=384000, allocated nonzeros=384000
total number of mallocs used during MatSetValues calls =0
not using I-node routines
A10
Mat Object: 1 MPI processes
type: seqaij
rows=384000, cols=777600
total: nonzeros=1919999, allocated nonzeros=1919999
total number of mallocs used during MatSetValues calls =0
not using I-node routines
KSP of A00
KSP Object: (solver_fieldsplit_0_)
1 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50,
divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_0_)
1 MPI processes
type: bjacobi
block Jacobi: number of blocks = 1
Local solve is same for all blocks, in the following KSP
and PC objects:
KSP Object: (solver_fieldsplit_0_sub_)
1 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50,
divergence=10000.
left preconditioning
using NONE norm type for convergence test
PC Object: (solver_fieldsplit_0_sub_)
1 MPI processes
type: ilu
ILU: out-of-place factorization
0 levels of fill
tolerance for zero pivot 2.22045e-14
matrix ordering: natural
factor fill ratio given 1., needed 1.
Factored matrix follows:
Mat Object: 1 MPI
processes
type: seqaij
rows=777600, cols=777600
package used to perform factorization: petsc
total: nonzeros=5385600, allocated
nonzeros=5385600
total number of mallocs used during
MatSetValues calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: (solver_fieldsplit_0_)
1 MPI processes
type: seqaij
rows=777600, cols=777600
total: nonzeros=5385600, allocated nonzeros=5385600
total number of mallocs used during MatSetValues
calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: (solver_fieldsplit_0_)
1 MPI processes
type: seqaij
rows=777600, cols=777600
total: nonzeros=5385600, allocated nonzeros=5385600
total number of mallocs used during MatSetValues calls =0
not using I-node routines
A01
Mat Object: 1 MPI processes
type: seqaij
rows=777600, cols=384000
total: nonzeros=1919999, allocated nonzeros=1919999
total number of mallocs used during MatSetValues calls =0
not using I-node routines
Mat Object: 1 MPI processes
type: seqaij
rows=384000, cols=384000
total: nonzeros=3416452, allocated nonzeros=3416452
total number of mallocs used during MatSetValues calls =0
not using I-node routines
linear system matrix = precond matrix:
Mat Object: 1 MPI processes
type: nest
rows=1161600, cols=1161600
Matrix object:
type=nest, rows=2, cols=2
MatNest structure:
(0,0) : prefix="solver_fieldsplit_0_", type=seqaij, rows=777600,
cols=777600
(0,1) : type=seqaij, rows=777600, cols=384000
(1,0) : type=seqaij, rows=384000, cols=777600
(1,1) : prefix="solver_fieldsplit_1_", type=seqaij, rows=384000,
cols=384000
Any insight as to what's happening? Btw this firedrake/petsc-mapdes is from
way back in october 2015 (yes much has changed since but
reinstalling/updating firedrake and petsc on LANL's firewall HPC machines
is a big pain in the ass).
Thanks,
Justin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160302/f3a7ee50/attachment-0001.html>
More information about the petsc-users
mailing list