[petsc-users] KSPSetUp does not scale

Mon Nov 19 07:56:31 CST 2012

On Mon, Nov 19, 2012 at 8:40 AM, Thomas Witkowski
<thomas.witkowski at tu-dresden.de> wrote:
> Here are the two files. In this case, maybe you can also give me some hints,
> why the solver at all does not scale here. The solver runtime for 64 cores
> is 206 seconds, with the same problem size on 128 cores it takes 172
> seconds. The number of inner and outer solver iterations are the same for
> both runs. I use CG with jacobi-preconditioner and hypre boomeramg for inner
> solver.

This appears to have nothing at all to do with SetUp(). You have

64 procs
PCSetUp                5 1.0 3.2241e+01 1.0 0.00e+00 0.0 4.6e+03
2.6e+04 1.3e+02 12  0  1  1 10  12  0  1  1 10     0
KSPSolve               1 1.0 2.0766e+02 1.0 6.16e+09 1.3 5.1e+05
1.7e+04 1.1e+03 78100 97 92 83  78100 97 92 84  1698
PCApply              100 1.0 1.9821e+02 1.0 7.54e+08 1.4 3.6e+05
8.4e+03 8.1e+02 75 12 69 33 61  75 12 69 33 61   210

128 procs
PCSetUp                5 1.0 3.0170e+01 1.0 0.00e+00 0.0 1.0e+04
1.2e+04 1.3e+02 15  0  1  1  9  15  0  1  1  9     0
KSPSolve               1 1.0 1.7274e+02 1.0 3.24e+09 1.4 1.2e+06
8.7e+03 1.2e+03 85100 97 92 84  85100 97 92 84  2040
PCApply              100 1.0 1.6804e+02 1.0 4.04e+08 1.5 8.7e+05
4.4e+03 8.5e+02 83 12 70 33 62  83 12 70 33 62   250

The PCApply time is the nonscalable part, and it all Hypre it looks like.

   Matt

> Am 19.11.2012 13:41, schrieb Jed Brown:
>
> Just have it do one or a few iterations.
>
>
> On Mon, Nov 19, 2012 at 1:36 PM, Thomas Witkowski
> <thomas.witkowski at tu-dresden.de> wrote:
>>
>> I can do this! Should I stop the run after KSPSetUp? Or do you want to see
>> the log_summary file from the whole run?
>>
>> Thomas
>>
>> Am 19.11.2012 13:33, schrieb Jed Brown:
>>
>> Always, always, always send -log_summary when asking about performance.
>>
>>
>> On Mon, Nov 19, 2012 at 11:26 AM, Thomas Witkowski
>> <thomas.witkowski at tu-dresden.de> wrote:
>>>
>>> I have some scaling problem in KSPSetUp, maybe some of you can help me to
>>> fix it. It takes 4.5 seconds on 64 cores, and 4.0 cores on 128 cores. The
>>> matrix has around 11 million rows and is not perfectly balanced, but the
>>> number of maximum rows per core in the 128 cases is exactly halfe of the
>>> number in the case when using 64 cores. Besides the scaling, why does the
>>> setup takes so long? I though that just some objects are created but no
>>> calculation is going on!
>>>
>>> The KSPView on the corresponding solver objects is as follows:
>>>
>>> KSP Object:(ns_) 64 MPI processes
>>>   type: fgmres
>>>     GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
>>> Orthogonalization with no iterative refinement
>>>     GMRES: happy breakdown tolerance 1e-30
>>>   maximum iterations=100, initial guess is zero
>>>   tolerances:  relative=1e-06, absolute=1e-08, divergence=10000
>>>   right preconditioning
>>>   has attached null space
>>>   using UNPRECONDITIONED norm type for convergence test
>>> PC Object:(ns_) 64 MPI processes
>>>   type: fieldsplit
>>>     FieldSplit with Schur preconditioner, factorization FULL
>>>     Preconditioner for the Schur complement formed from the block
>>> diagonal part of A11
>>>     Split info:
>>>     Split number 0 Defined by IS
>>>     Split number 1 Defined by IS
>>>     KSP solver for A00 block
>>>       KSP Object:      (ns_fieldsplit_velocity_)       64 MPI processes
>>>         type: preonly
>>>         maximum iterations=10000, initial guess is zero
>>>         tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
>>>         left preconditioning
>>>         using DEFAULT norm type for convergence test
>>>       PC Object:      (ns_fieldsplit_velocity_)       64 MPI processes
>>>         type: none
>>>         linear system matrix = precond matrix:
>>>         Matrix Object:         64 MPI processes
>>>           type: mpiaij
>>>           rows=11068107, cols=11068107
>>>           total: nonzeros=315206535, allocated nonzeros=315206535
>>>           total number of mallocs used during MatSetValues calls =0
>>>             not using I-node (on process 0) routines
>>>     KSP solver for S = A11 - A10 inv(A00) A01
>>>       KSP Object:      (ns_fieldsplit_pressure_)       64 MPI processes
>>>         type: gmres
>>>           GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
>>> Orthogonalization with no iterative refinement
>>>           GMRES: happy breakdown tolerance 1e-30
>>>         maximum iterations=10000, initial guess is zero
>>>         tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
>>>         left preconditioning
>>>         using DEFAULT norm type for convergence test
>>>       PC Object:      (ns_fieldsplit_pressure_)       64 MPI processes
>>>         type: none
>>>         linear system matrix followed by preconditioner matrix:
>>>         Matrix Object:         64 MPI processes
>>>           type: schurcomplement
>>>           rows=469678, cols=469678
>>>             Schur complement A11 - A10 inv(A00) A01
>>>             A11
>>>               Matrix Object:               64 MPI processes
>>>                 type: mpiaij
>>>                 rows=469678, cols=469678
>>>                 total: nonzeros=0, allocated nonzeros=0
>>>                 total number of mallocs used during MatSetValues calls =0
>>>                   using I-node (on process 0) routines: found 1304 nodes,
>>> limit used is 5
>>>             A10
>>>               Matrix Object:               64 MPI processes
>>>                 type: mpiaij
>>>                 rows=469678, cols=11068107
>>>                 total: nonzeros=89122957, allocated nonzeros=89122957
>>>                 total number of mallocs used during MatSetValues calls =0
>>>                   not using I-node (on process 0) routines
>>>             KSP of A00
>>>               KSP Object: (ns_fieldsplit_velocity_)               64 MPI
>>> processes
>>>                 type: preonly
>>>                 maximum iterations=10000, initial guess is zero
>>>                 tolerances:  relative=1e-05, absolute=1e-50,
>>> divergence=10000
>>>                 left preconditioning
>>>                 using DEFAULT norm type for convergence test
>>>               PC Object: (ns_fieldsplit_velocity_)               64 MPI
>>> processes
>>>                 type: none
>>>                 linear system matrix = precond matrix:
>>>                 Matrix Object:                 64 MPI processes
>>>                   type: mpiaij
>>>                   rows=11068107, cols=11068107
>>>                   total: nonzeros=315206535, allocated nonzeros=315206535
>>>                   total number of mallocs used during MatSetValues calls
>>> =0
>>>                     not using I-node (on process 0) routines
>>>             A01
>>>               Matrix Object:               64 MPI processes
>>>                 type: mpiaij
>>>                 rows=11068107, cols=469678
>>>                 total: nonzeros=88821041, allocated nonzeros=88821041
>>>                 total number of mallocs used during MatSetValues calls =0
>>>                   not using I-node (on process 0) routines
>>>         Matrix Object:         64 MPI processes
>>>           type: mpiaij
>>>           rows=469678, cols=469678
>>>           total: nonzeros=0, allocated nonzeros=0
>>>           total number of mallocs used during MatSetValues calls =0
>>>             using I-node (on process 0) routines: found 1304 nodes, limit
>>> used is 5
>>>   linear system matrix = precond matrix:
>>>   Matrix Object:   64 MPI processes
>>>     type: mpiaij
>>>     rows=11537785, cols=11537785
>>>     total: nonzeros=493150533, allocated nonzeros=510309207
>>>     total number of mallocs used during MatSetValues calls =0
>>>       not using I-node (on process 0) routines
>>>
>>>
>>>
>>>
>>> Thomas
>>
>>
>>
>
>

--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener