[petsc-users] KSPSetUp does not scale

Mon Nov 19 08:47:24 CST 2012

Your assumption is right. A removed the Hypre preconditioner with an 
iterative non-preconditioned solver. The solver is now much slower, but 
it scales.

But, even in this case, the KSPSetUp takes 4.5 seconds on 64 cores and 
3.8 seconds on 128 cores!

Thomas

Am 19.11.2012 14:56, schrieb Matthew Knepley:
> On Mon, Nov 19, 2012 at 8:40 AM, Thomas Witkowski
> <thomas.witkowski at tu-dresden.de> wrote:
>> Here are the two files. In this case, maybe you can also give me some hints,
>> why the solver at all does not scale here. The solver runtime for 64 cores
>> is 206 seconds, with the same problem size on 128 cores it takes 172
>> seconds. The number of inner and outer solver iterations are the same for
>> both runs. I use CG with jacobi-preconditioner and hypre boomeramg for inner
>> solver.
> This appears to have nothing at all to do with SetUp(). You have
>
> 64 procs
> PCSetUp                5 1.0 3.2241e+01 1.0 0.00e+00 0.0 4.6e+03
> 2.6e+04 1.3e+02 12  0  1  1 10  12  0  1  1 10     0
> KSPSolve               1 1.0 2.0766e+02 1.0 6.16e+09 1.3 5.1e+05
> 1.7e+04 1.1e+03 78100 97 92 83  78100 97 92 84  1698
> PCApply              100 1.0 1.9821e+02 1.0 7.54e+08 1.4 3.6e+05
> 8.4e+03 8.1e+02 75 12 69 33 61  75 12 69 33 61   210
>
> 128 procs
> PCSetUp                5 1.0 3.0170e+01 1.0 0.00e+00 0.0 1.0e+04
> 1.2e+04 1.3e+02 15  0  1  1  9  15  0  1  1  9     0
> KSPSolve               1 1.0 1.7274e+02 1.0 3.24e+09 1.4 1.2e+06
> 8.7e+03 1.2e+03 85100 97 92 84  85100 97 92 84  2040
> PCApply              100 1.0 1.6804e+02 1.0 4.04e+08 1.5 8.7e+05
> 4.4e+03 8.5e+02 83 12 70 33 62  83 12 70 33 62   250
>
> The PCApply time is the nonscalable part, and it all Hypre it looks like.
>
>     Matt
>
>> Am 19.11.2012 13:41, schrieb Jed Brown:
>>
>> Just have it do one or a few iterations.
>>
>>
>> On Mon, Nov 19, 2012 at 1:36 PM, Thomas Witkowski
>> <thomas.witkowski at tu-dresden.de> wrote:
>>> I can do this! Should I stop the run after KSPSetUp? Or do you want to see
>>> the log_summary file from the whole run?
>>>
>>> Thomas
>>>
>>> Am 19.11.2012 13:33, schrieb Jed Brown:
>>>
>>> Always, always, always send -log_summary when asking about performance.
>>>
>>>
>>> On Mon, Nov 19, 2012 at 11:26 AM, Thomas Witkowski
>>> <thomas.witkowski at tu-dresden.de> wrote:
>>>> I have some scaling problem in KSPSetUp, maybe some of you can help me to
>>>> fix it. It takes 4.5 seconds on 64 cores, and 4.0 cores on 128 cores. The
>>>> matrix has around 11 million rows and is not perfectly balanced, but the
>>>> number of maximum rows per core in the 128 cases is exactly halfe of the
>>>> number in the case when using 64 cores. Besides the scaling, why does the
>>>> setup takes so long? I though that just some objects are created but no
>>>> calculation is going on!
>>>>
>>>> The KSPView on the corresponding solver objects is as follows:
>>>>
>>>> KSP Object:(ns_) 64 MPI processes
>>>>    type: fgmres
>>>>      GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
>>>> Orthogonalization with no iterative refinement
>>>>      GMRES: happy breakdown tolerance 1e-30
>>>>    maximum iterations=100, initial guess is zero
>>>>    tolerances:  relative=1e-06, absolute=1e-08, divergence=10000
>>>>    right preconditioning
>>>>    has attached null space
>>>>    using UNPRECONDITIONED norm type for convergence test
>>>> PC Object:(ns_) 64 MPI processes
>>>>    type: fieldsplit
>>>>      FieldSplit with Schur preconditioner, factorization FULL
>>>>      Preconditioner for the Schur complement formed from the block
>>>> diagonal part of A11
>>>>      Split info:
>>>>      Split number 0 Defined by IS
>>>>      Split number 1 Defined by IS
>>>>      KSP solver for A00 block
>>>>        KSP Object:      (ns_fieldsplit_velocity_)       64 MPI processes
>>>>          type: preonly
>>>>          maximum iterations=10000, initial guess is zero
>>>>          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
>>>>          left preconditioning
>>>>          using DEFAULT norm type for convergence test
>>>>        PC Object:      (ns_fieldsplit_velocity_)       64 MPI processes
>>>>          type: none
>>>>          linear system matrix = precond matrix:
>>>>          Matrix Object:         64 MPI processes
>>>>            type: mpiaij
>>>>            rows=11068107, cols=11068107
>>>>            total: nonzeros=315206535, allocated nonzeros=315206535
>>>>            total number of mallocs used during MatSetValues calls =0
>>>>              not using I-node (on process 0) routines
>>>>      KSP solver for S = A11 - A10 inv(A00) A01
>>>>        KSP Object:      (ns_fieldsplit_pressure_)       64 MPI processes
>>>>          type: gmres
>>>>            GMRES: restart=30, using Classical (unmodified) Gram-Schmidt
>>>> Orthogonalization with no iterative refinement
>>>>            GMRES: happy breakdown tolerance 1e-30
>>>>          maximum iterations=10000, initial guess is zero
>>>>          tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
>>>>          left preconditioning
>>>>          using DEFAULT norm type for convergence test
>>>>        PC Object:      (ns_fieldsplit_pressure_)       64 MPI processes
>>>>          type: none
>>>>          linear system matrix followed by preconditioner matrix:
>>>>          Matrix Object:         64 MPI processes
>>>>            type: schurcomplement
>>>>            rows=469678, cols=469678
>>>>              Schur complement A11 - A10 inv(A00) A01
>>>>              A11
>>>>                Matrix Object:               64 MPI processes
>>>>                  type: mpiaij
>>>>                  rows=469678, cols=469678
>>>>                  total: nonzeros=0, allocated nonzeros=0
>>>>                  total number of mallocs used during MatSetValues calls =0
>>>>                    using I-node (on process 0) routines: found 1304 nodes,
>>>> limit used is 5
>>>>              A10
>>>>                Matrix Object:               64 MPI processes
>>>>                  type: mpiaij
>>>>                  rows=469678, cols=11068107
>>>>                  total: nonzeros=89122957, allocated nonzeros=89122957
>>>>                  total number of mallocs used during MatSetValues calls =0
>>>>                    not using I-node (on process 0) routines
>>>>              KSP of A00
>>>>                KSP Object: (ns_fieldsplit_velocity_)               64 MPI
>>>> processes
>>>>                  type: preonly
>>>>                  maximum iterations=10000, initial guess is zero
>>>>                  tolerances:  relative=1e-05, absolute=1e-50,
>>>> divergence=10000
>>>>                  left preconditioning
>>>>                  using DEFAULT norm type for convergence test
>>>>                PC Object: (ns_fieldsplit_velocity_)               64 MPI
>>>> processes
>>>>                  type: none
>>>>                  linear system matrix = precond matrix:
>>>>                  Matrix Object:                 64 MPI processes
>>>>                    type: mpiaij
>>>>                    rows=11068107, cols=11068107
>>>>                    total: nonzeros=315206535, allocated nonzeros=315206535
>>>>                    total number of mallocs used during MatSetValues calls
>>>> =0
>>>>                      not using I-node (on process 0) routines
>>>>              A01
>>>>                Matrix Object:               64 MPI processes
>>>>                  type: mpiaij
>>>>                  rows=11068107, cols=469678
>>>>                  total: nonzeros=88821041, allocated nonzeros=88821041
>>>>                  total number of mallocs used during MatSetValues calls =0
>>>>                    not using I-node (on process 0) routines
>>>>          Matrix Object:         64 MPI processes
>>>>            type: mpiaij
>>>>            rows=469678, cols=469678
>>>>            total: nonzeros=0, allocated nonzeros=0
>>>>            total number of mallocs used during MatSetValues calls =0
>>>>              using I-node (on process 0) routines: found 1304 nodes, limit
>>>> used is 5
>>>>    linear system matrix = precond matrix:
>>>>    Matrix Object:   64 MPI processes
>>>>      type: mpiaij
>>>>      rows=11537785, cols=11537785
>>>>      total: nonzeros=493150533, allocated nonzeros=510309207
>>>>      total number of mallocs used during MatSetValues calls =0
>>>>        not using I-node (on process 0) routines
>>>>
>>>>
>>>>
>>>>
>>>> Thomas
>>>
>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which
> their experiments lead.
> -- Norbert Wiener