[petsc-users] KSPSetUp does not scale
Matthew Knepley
knepley at gmail.com
Mon Nov 19 07:56:31 CST 2012
On Mon, Nov 19, 2012 at 8:40 AM, Thomas Witkowski
<thomas.witkowski at tu-dresden.de> wrote:
> Here are the two files. In this case, maybe you can also give me some hints,
> why the solver at all does not scale here. The solver runtime for 64 cores
> is 206 seconds, with the same problem size on 128 cores it takes 172
> seconds. The number of inner and outer solver iterations are the same for
> both runs. I use CG with jacobi-preconditioner and hypre boomeramg for inner
> solver.
This appears to have nothing at all to do with SetUp(). You have
64 procs
PCSetUp 5 1.0 3.2241e+01 1.0 0.00e+00 0.0 4.6e+03
2.6e+04 1.3e+02 12 0 1 1 10 12 0 1 1 10 0
KSPSolve 1 1.0 2.0766e+02 1.0 6.16e+09 1.3 5.1e+05
1.7e+04 1.1e+03 78100 97 92 83 78100 97 92 84 1698
PCApply 100 1.0 1.9821e+02 1.0 7.54e+08 1.4 3.6e+05
8.4e+03 8.1e+02 75 12 69 33 61 75 12 69 33 61 210
128 procs
PCSetUp 5 1.0 3.0170e+01 1.0 0.00e+00 0.0 1.0e+04
1.2e+04 1.3e+02 15 0 1 1 9 15 0 1 1 9 0
KSPSolve 1 1.0 1.7274e+02 1.0 3.24e+09 1.4 1.2e+06
8.7e+03 1.2e+03 85100 97 92 84 85100 97 92 84 2040
PCApply 100 1.0 1.6804e+02 1.0 4.04e+08 1.5 8.7e+05
4.4e+03 8.5e+02 83 12 70 33 62 83 12 70 33 62 250
The PCApply time is the nonscalable part, and it all Hypre it looks like.
Matt
> Am 19.11.2012 13:41, schrieb Jed Brown:
>
> Just have it do one or a few iterations.
>
>
> On Mon, Nov 19, 2012 at 1:36 PM, Thomas Witkowski
> <thomas.witkowski at tu-dresden.de> wrote:
>>
>> I can do this! Should I stop the run after KSPSetUp? Or do you want to see
>> the log_summary file from the whole run?
>>
>> Thomas
>>
>> Am 19.11.2012 13:33, schrieb Jed Brown:
>>
>> Always, always, always send -log_summary when asking about performance.
>>
>>
>> On Mon, Nov 19, 2012 at 11:26 AM, Thomas Witkowski
>> <thomas.witkowski at tu-dresden.de> wrote:
>>>
>>> I have some scaling problem in KSPSetUp, maybe some of you can help me to
>>> fix it. It takes 4.5 seconds on 64 cores, and 4.0 cores on 128 cores. The
>>> matrix has around 11 million rows and is not perfectly balanced, but the
>>> number of maximum rows per core in the 128 cases is exactly halfe of the
>>> number in the case when using 64 cores. Besides the scaling, why does the
>>> setup takes so long? I though that just some objects are created but no
>>> calculation is going on!
>>>
>>> The KSPView on the corresponding solver objects is as follows:
>>>
>>> KSP Object:(ns_) 64 MPI processes
>>> type: fgmres
>>> GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
>>> Orthogonalization with no iterative refinement
>>> GMRES: happy breakdown tolerance 1e-30
>>> maximum iterations=100, initial guess is zero
>>> tolerances: relative=1e-06, absolute=1e-08, divergence=10000
>>> right preconditioning
>>> has attached null space
>>> using UNPRECONDITIONED norm type for convergence test
>>> PC Object:(ns_) 64 MPI processes
>>> type: fieldsplit
>>> FieldSplit with Schur preconditioner, factorization FULL
>>> Preconditioner for the Schur complement formed from the block
>>> diagonal part of A11
>>> Split info:
>>> Split number 0 Defined by IS
>>> Split number 1 Defined by IS
>>> KSP solver for A00 block
>>> KSP Object: (ns_fieldsplit_velocity_) 64 MPI processes
>>> type: preonly
>>> maximum iterations=10000, initial guess is zero
>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000
>>> left preconditioning
>>> using DEFAULT norm type for convergence test
>>> PC Object: (ns_fieldsplit_velocity_) 64 MPI processes
>>> type: none
>>> linear system matrix = precond matrix:
>>> Matrix Object: 64 MPI processes
>>> type: mpiaij
>>> rows=11068107, cols=11068107
>>> total: nonzeros=315206535, allocated nonzeros=315206535
>>> total number of mallocs used during MatSetValues calls =0
>>> not using I-node (on process 0) routines
>>> KSP solver for S = A11 - A10 inv(A00) A01
>>> KSP Object: (ns_fieldsplit_pressure_) 64 MPI processes
>>> type: gmres
>>> GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
>>> Orthogonalization with no iterative refinement
>>> GMRES: happy breakdown tolerance 1e-30
>>> maximum iterations=10000, initial guess is zero
>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000
>>> left preconditioning
>>> using DEFAULT norm type for convergence test
>>> PC Object: (ns_fieldsplit_pressure_) 64 MPI processes
>>> type: none
>>> linear system matrix followed by preconditioner matrix:
>>> Matrix Object: 64 MPI processes
>>> type: schurcomplement
>>> rows=469678, cols=469678
>>> Schur complement A11 - A10 inv(A00) A01
>>> A11
>>> Matrix Object: 64 MPI processes
>>> type: mpiaij
>>> rows=469678, cols=469678
>>> total: nonzeros=0, allocated nonzeros=0
>>> total number of mallocs used during MatSetValues calls =0
>>> using I-node (on process 0) routines: found 1304 nodes,
>>> limit used is 5
>>> A10
>>> Matrix Object: 64 MPI processes
>>> type: mpiaij
>>> rows=469678, cols=11068107
>>> total: nonzeros=89122957, allocated nonzeros=89122957
>>> total number of mallocs used during MatSetValues calls =0
>>> not using I-node (on process 0) routines
>>> KSP of A00
>>> KSP Object: (ns_fieldsplit_velocity_) 64 MPI
>>> processes
>>> type: preonly
>>> maximum iterations=10000, initial guess is zero
>>> tolerances: relative=1e-05, absolute=1e-50,
>>> divergence=10000
>>> left preconditioning
>>> using DEFAULT norm type for convergence test
>>> PC Object: (ns_fieldsplit_velocity_) 64 MPI
>>> processes
>>> type: none
>>> linear system matrix = precond matrix:
>>> Matrix Object: 64 MPI processes
>>> type: mpiaij
>>> rows=11068107, cols=11068107
>>> total: nonzeros=315206535, allocated nonzeros=315206535
>>> total number of mallocs used during MatSetValues calls
>>> =0
>>> not using I-node (on process 0) routines
>>> A01
>>> Matrix Object: 64 MPI processes
>>> type: mpiaij
>>> rows=11068107, cols=469678
>>> total: nonzeros=88821041, allocated nonzeros=88821041
>>> total number of mallocs used during MatSetValues calls =0
>>> not using I-node (on process 0) routines
>>> Matrix Object: 64 MPI processes
>>> type: mpiaij
>>> rows=469678, cols=469678
>>> total: nonzeros=0, allocated nonzeros=0
>>> total number of mallocs used during MatSetValues calls =0
>>> using I-node (on process 0) routines: found 1304 nodes, limit
>>> used is 5
>>> linear system matrix = precond matrix:
>>> Matrix Object: 64 MPI processes
>>> type: mpiaij
>>> rows=11537785, cols=11537785
>>> total: nonzeros=493150533, allocated nonzeros=510309207
>>> total number of mallocs used during MatSetValues calls =0
>>> not using I-node (on process 0) routines
>>>
>>>
>>>
>>>
>>> Thomas
>>
>>
>>
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener
More information about the petsc-users
mailing list