I have some scaling problem in KSPSetUp, maybe some of you can help me
to fix it. It takes 4.5 seconds on 64 cores, and 4.0 cores on 128 cores.
The matrix has around 11 million rows and is not perfectly balanced, but
the number of maximum rows per core in the 128 cases is exactly halfe of
the number in the case when using 64 cores. Besides the scaling, why
does the setup takes so long? I though that just some objects are
created but no calculation is going on!
The KSPView on the corresponding solver objects is as follows:
KSP Object:(ns_) 64 MPI processes
type: fgmres
GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
GMRES: happy breakdown tolerance 1e-30
maximum iterations=100, initial guess is zero
tolerances: relative=1e-06, absolute=1e-08, divergence=10000
right preconditioning
has attached null space
using UNPRECONDITIONED norm type for convergence test
PC Object:(ns_) 64 MPI processes
type: fieldsplit
FieldSplit with Schur preconditioner, factorization FULL
Preconditioner for the Schur complement formed from the block
diagonal part of A11
Split info:
Split number 0 Defined by IS
Split number 1 Defined by IS
KSP solver for A00 block
KSP Object: (ns_fieldsplit_velocity_) 64 MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000
left preconditioning
using DEFAULT norm type for convergence test
PC Object: (ns_fieldsplit_velocity_) 64 MPI processes
type: none
linear system matrix = precond matrix:
Matrix Object: 64 MPI processes
type: mpiaij
rows=11068107, cols=11068107
total: nonzeros=315206535, allocated nonzeros=315206535
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
KSP solver for S = A11 - A10 inv(A00) A01
KSP Object: (ns_fieldsplit_pressure_) 64 MPI processes
type: gmres
GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
Orthogonalization with no iterative refinement
GMRES: happy breakdown tolerance 1e-30
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50, divergence=10000
left preconditioning
using DEFAULT norm type for convergence test
PC Object: (ns_fieldsplit_pressure_) 64 MPI processes
type: none
linear system matrix followed by preconditioner matrix:
Matrix Object: 64 MPI processes
type: schurcomplement
rows=469678, cols=469678
Schur complement A11 - A10 inv(A00) A01
A11
Matrix Object: 64 MPI processes
type: mpiaij
rows=469678, cols=469678
total: nonzeros=0, allocated nonzeros=0
total number of mallocs used during MatSetValues calls =0
using I-node (on process 0) routines: found 1304
nodes, limit used is 5
A10
Matrix Object: 64 MPI processes
type: mpiaij
rows=469678, cols=11068107
total: nonzeros=89122957, allocated nonzeros=89122957
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
KSP of A00
KSP Object: (ns_fieldsplit_velocity_) 64
MPI processes
type: preonly
maximum iterations=10000, initial guess is zero
tolerances: relative=1e-05, absolute=1e-50,
divergence=10000
left preconditioning
using DEFAULT norm type for convergence test
PC Object: (ns_fieldsplit_velocity_) 64 MPI
processes
type: none
linear system matrix = precond matrix:
Matrix Object: 64 MPI processes
type: mpiaij
rows=11068107, cols=11068107
total: nonzeros=315206535, allocated nonzeros=315206535
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
A01
Matrix Object: 64 MPI processes
type: mpiaij
rows=11068107, cols=469678
total: nonzeros=88821041, allocated nonzeros=88821041
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
Matrix Object: 64 MPI processes
type: mpiaij
rows=469678, cols=469678
total: nonzeros=0, allocated nonzeros=0
total number of mallocs used during MatSetValues calls =0
using I-node (on process 0) routines: found 1304 nodes,
limit used is 5
linear system matrix = precond matrix:
Matrix Object: 64 MPI processes
type: mpiaij
rows=11537785, cols=11537785
total: nonzeros=493150533, allocated nonzeros=510309207
total number of mallocs used during MatSetValues calls =0
not using I-node (on process 0) routines
