[petsc-users] Using PCFIELDSPLIT with -pc_fieldsplit_type schur

Wed Jan 11 20:31:24 CST 2017

   Thanks, this is very useful information. It means that 

1) the approximate Sp is actually a very good approximation to the true Schur complement S, since using Sp^-1 to precondition S gives iteration counts from 8 to 13.

2)  using ilu(0) as a preconditioner for Sp is not good, since replacing Sp^-1 with ilu(0) of Sp gives absurd iteration counts. This is actually not super surprising since ilu(0) is generally "not so good" for elasticity.

So the next step is to try using -fieldsplit_FE_split_ksp_monitor  -fieldsplit_FE_split_pc_type gamg

the one open question is if any options should be passed to the gamg to tell it that the underly problem comes from "elasticity"; that is something about the null space. 

   Mark Adams, since the GAMG is coming from inside another preconditioner it may not be easy for the easy for the user to attach the near null space to that inner matrix. Would it make sense for there to be a GAMG command line option to indicate that it is a 3d elasticity problem so GAMG could set up the near null space for itself? or does that not make sense?

   Barry

> On Jan 11, 2017, at 7:47 PM, David Knezevic <david.knezevic at akselos.com> wrote:
> 
> I've attached the two log files. Using cholesky for "FE_split" seems to have helped a lot!
> 
> David
> 
> 
> --
> David J. Knezevic | CTO
> Akselos | 210 Broadway, #201 | Cambridge, MA | 02139
> 
> Phone: +1-617-599-4755
>         
> This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
> 
> On Wed, Jan 11, 2017 at 8:32 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>    Can you please run with all the monitoring on? So we can see the convergence of all the inner solvers
> -fieldsplit_FE_split_ksp_monitor
> 
> Then run again with
> 
> -fieldsplit_FE_split_ksp_monitor  -fieldsplit_FE_split_pc_type cholesky
> 
> 
> and send both sets of results
> 
> Barry
> 
> 
> > On Jan 11, 2017, at 6:32 PM, David Knezevic <david.knezevic at akselos.com> wrote:
> >
> > On Wed, Jan 11, 2017 at 5:52 PM, Dave May <dave.mayhem23 at gmail.com> wrote:
> > so I gather that I'll have to look into a user-defined approximation to S.
> >
> > Where does the 2x2 block system come from?
> > Maybe someone on the list knows the right approximation to use for S.
> >
> > The model is 3D linear elasticity using a finite element discretization. I applied substructuring to part of the system to "condense" it, and that results in the small A00 block. The A11 block is just standard 3D elasticity; no substructuring was applied there. There are constraints to connect the degrees of freedom on the interface of the substructured and non-substructured regions.
> >
> > If anyone has suggestions for a good way to precondition this type of system, I'd be most appreciative!
> >
> > Thanks,
> > David
> >
> >
> >
> > -----------------------------------------
> >
> >   0 KSP Residual norm 5.405528187695e+04
> >   1 KSP Residual norm 2.187814910803e+02
> >   2 KSP Residual norm 1.019051577515e-01
> >   3 KSP Residual norm 4.370464012859e-04
> > KSP Object: 1 MPI processes
> >   type: cg
> >   maximum iterations=1000
> >   tolerances:  relative=1e-06, absolute=1e-50, divergence=10000.
> >   left preconditioning
> >   using nonzero initial guess
> >   using PRECONDITIONED norm type for convergence test
> > PC Object: 1 MPI processes
> >   type: fieldsplit
> >     FieldSplit with Schur preconditioner, factorization FULL
> >     Preconditioner for the Schur complement formed from Sp, an assembled approximation to S, which uses (lumped, if requested) A00's diagonal's inverse
> >     Split info:
> >     Split number 0 Defined by IS
> >     Split number 1 Defined by IS
> >     KSP solver for A00 block
> >       KSP Object:      (fieldsplit_RB_split_)       1 MPI processes
> >         type: preonly
> >         maximum iterations=10000, initial guess is zero
> >         tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
> >         left preconditioning
> >         using NONE norm type for convergence test
> >       PC Object:      (fieldsplit_RB_split_)       1 MPI processes
> >         type: cholesky
> >           Cholesky: out-of-place factorization
> >           tolerance for zero pivot 2.22045e-14
> >           matrix ordering: natural
> >           factor fill ratio given 0., needed 0.
> >             Factored matrix follows:
> >               Mat Object:               1 MPI processes
> >                 type: seqaij
> >                 rows=324, cols=324
> >                 package used to perform factorization: mumps
> >                 total: nonzeros=3042, allocated nonzeros=3042
> >                 total number of mallocs used during MatSetValues calls =0
> >                   MUMPS run parameters:
> >                     SYM (matrix type):                   2
> >                     PAR (host participation):            1
> >                     ICNTL(1) (output for error):         6
> >                     ICNTL(2) (output of diagnostic msg): 0
> >                     ICNTL(3) (output for global info):   0
> >                     ICNTL(4) (level of printing):        0
> >                     ICNTL(5) (input mat struct):         0
> >                     ICNTL(6) (matrix prescaling):        7
> >                     ICNTL(7) (sequentia matrix ordering):7
> >                     ICNTL(8) (scalling strategy):        77
> >                     ICNTL(10) (max num of refinements):  0
> >                     ICNTL(11) (error analysis):          0
> >                     ICNTL(12) (efficiency control):                         0
> >                     ICNTL(13) (efficiency control):                         0
> >                     ICNTL(14) (percentage of estimated workspace increase): 20
> >                     ICNTL(18) (input mat struct):                           0
> >                     ICNTL(19) (Shur complement info):                       0
> >                     ICNTL(20) (rhs sparse pattern):                         0
> >                     ICNTL(21) (solution struct):                            0
> >                     ICNTL(22) (in-core/out-of-core facility):               0
> >                     ICNTL(23) (max size of memory can be allocated locally):0
> >                     ICNTL(24) (detection of null pivot rows):               0
> >                     ICNTL(25) (computation of a null space basis):          0
> >                     ICNTL(26) (Schur options for rhs or solution):          0
> >                     ICNTL(27) (experimental parameter):                     -24
> >                     ICNTL(28) (use parallel or sequential ordering):        1
> >                     ICNTL(29) (parallel ordering):                          0
> >                     ICNTL(30) (user-specified set of entries in inv(A)):    0
> >                     ICNTL(31) (factors is discarded in the solve phase):    0
> >                     ICNTL(33) (compute determinant):                        0
> >                     CNTL(1) (relative pivoting threshold):      0.01
> >                     CNTL(2) (stopping criterion of refinement): 1.49012e-08
> >                     CNTL(3) (absolute pivoting threshold):      0.
> >                     CNTL(4) (value of static pivoting):         -1.
> >                     CNTL(5) (fixation for null pivots):         0.
> >                     RINFO(1) (local estimated flops for the elimination after analysis):
> >                       [0] 29394.
> >                     RINFO(2) (local estimated flops for the assembly after factorization):
> >                       [0]  1092.
> >                     RINFO(3) (local estimated flops for the elimination after factorization):
> >                       [0]  29394.
> >                     INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization):
> >                     [0] 1
> >                     INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization):
> >                       [0] 1
> >                     INFO(23) (num of pivots eliminated on this processor after factorization):
> >                       [0] 324
> >                     RINFOG(1) (global estimated flops for the elimination after analysis): 29394.
> >                     RINFOG(2) (global estimated flops for the assembly after factorization): 1092.
> >                     RINFOG(3) (global estimated flops for the elimination after factorization): 29394.
> >                     (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0)
> >                     INFOG(3) (estimated real workspace for factors on all processors after analysis): 3888
> >                     INFOG(4) (estimated integer workspace for factors on all processors after analysis): 2067
> >                     INFOG(5) (estimated maximum front size in the complete tree): 12
> >                     INFOG(6) (number of nodes in the complete tree): 53
> >                     INFOG(7) (ordering option effectively use after analysis): 2
> >                     INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100
> >                     INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 3888
> >                     INFOG(10) (total integer space store the matrix factors after factorization): 2067
> >                     INFOG(11) (order of largest frontal matrix after factorization): 12
> >                     INFOG(12) (number of off-diagonal pivots): 0
> >                     INFOG(13) (number of delayed pivots after factorization): 0
> >                     INFOG(14) (number of memory compress after factorization): 0
> >                     INFOG(15) (number of steps of iterative refinement after solution): 0
> >                     INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 1
> >                     INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 1
> >                     INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 1
> >                     INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 1
> >                     INFOG(20) (estimated number of entries in the factors): 3042
> >                     INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 1
> >                     INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 1
> >                     INFOG(23) (after analysis: value of ICNTL(6) effectively used): 5
> >                     INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1
> >                     INFOG(25) (after factorization: number of pivots modified by static pivoting): 0
> >                     INFOG(28) (after factorization: number of null pivots encountered): 0
> >                     INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 3042
> >                     INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 0, 0
> >                     INFOG(32) (after analysis: type of analysis done): 1
> >                     INFOG(33) (value used for ICNTL(8)): -2
> >                     INFOG(34) (exponent of the determinant if determinant is requested): 0
> >         linear system matrix = precond matrix:
> >         Mat Object:        (fieldsplit_RB_split_)         1 MPI processes
> >           type: seqaij
> >           rows=324, cols=324
> >           total: nonzeros=5760, allocated nonzeros=5760
> >           total number of mallocs used during MatSetValues calls =0
> >             using I-node routines: found 108 nodes, limit used is 5
> >     KSP solver for S = A11 - A10 inv(A00) A01
> >       KSP Object:      (fieldsplit_FE_split_)       1 MPI processes
> >         type: cg
> >         maximum iterations=10000, initial guess is zero
> >         tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
> >         left preconditioning
> >         using PRECONDITIONED norm type for convergence test
> >       PC Object:      (fieldsplit_FE_split_)       1 MPI processes
> >         type: bjacobi
> >           block Jacobi: number of blocks = 1
> >           Local solve is same for all blocks, in the following KSP and PC objects:
> >           KSP Object:          (fieldsplit_FE_split_sub_)           1 MPI processes
> >             type: preonly
> >             maximum iterations=10000, initial guess is zero
> >             tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
> >             left preconditioning
> >             using NONE norm type for convergence test
> >           PC Object:          (fieldsplit_FE_split_sub_)           1 MPI processes
> >             type: ilu
> >               ILU: out-of-place factorization
> >               0 levels of fill
> >               tolerance for zero pivot 2.22045e-14
> >               matrix ordering: natural
> >               factor fill ratio given 1., needed 1.
> >                 Factored matrix follows:
> >                   Mat Object:                   1 MPI processes
> >                     type: seqaij
> >                     rows=28476, cols=28476
> >                     package used to perform factorization: petsc
> >                     total: nonzeros=1037052, allocated nonzeros=1037052
> >                     total number of mallocs used during MatSetValues calls =0
> >                       using I-node routines: found 9489 nodes, limit used is 5
> >             linear system matrix = precond matrix:
> >             Mat Object:             1 MPI processes
> >               type: seqaij
> >               rows=28476, cols=28476
> >               total: nonzeros=1037052, allocated nonzeros=1037052
> >               total number of mallocs used during MatSetValues calls =0
> >                 using I-node routines: found 9489 nodes, limit used is 5
> >         linear system matrix followed by preconditioner matrix:
> >         Mat Object:        (fieldsplit_FE_split_)         1 MPI processes
> >           type: schurcomplement
> >           rows=28476, cols=28476
> >             Schur complement A11 - A10 inv(A00) A01
> >             A11
> >               Mat Object:              (fieldsplit_FE_split_)               1 MPI processes
> >                 type: seqaij
> >                 rows=28476, cols=28476
> >                 total: nonzeros=1017054, allocated nonzeros=1017054
> >                 total number of mallocs used during MatSetValues calls =0
> >                   using I-node routines: found 9492 nodes, limit used is 5
> >             A10
> >               Mat Object:               1 MPI processes
> >                 type: seqaij
> >                 rows=28476, cols=324
> >                 total: nonzeros=936, allocated nonzeros=936
> >                 total number of mallocs used during MatSetValues calls =0
> >                   using I-node routines: found 5717 nodes, limit used is 5
> >             KSP of A00
> >               KSP Object:              (fieldsplit_RB_split_)               1 MPI processes
> >                 type: preonly
> >                 maximum iterations=10000, initial guess is zero
> >                 tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
> >                 left preconditioning
> >                 using NONE norm type for convergence test
> >               PC Object:              (fieldsplit_RB_split_)               1 MPI processes
> >                 type: cholesky
> >                   Cholesky: out-of-place factorization
> >                   tolerance for zero pivot 2.22045e-14
> >                   matrix ordering: natural
> >                   factor fill ratio given 0., needed 0.
> >                     Factored matrix follows:
> >                       Mat Object:                       1 MPI processes
> >                         type: seqaij
> >                         rows=324, cols=324
> >                         package used to perform factorization: mumps
> >                         total: nonzeros=3042, allocated nonzeros=3042
> >                         total number of mallocs used during MatSetValues calls =0
> >                           MUMPS run parameters:
> >                             SYM (matrix type):                   2
> >                             PAR (host participation):            1
> >                             ICNTL(1) (output for error):         6
> >                             ICNTL(2) (output of diagnostic msg): 0
> >                             ICNTL(3) (output for global info):   0
> >                             ICNTL(4) (level of printing):        0
> >                             ICNTL(5) (input mat struct):         0
> >                             ICNTL(6) (matrix prescaling):        7
> >                             ICNTL(7) (sequentia matrix ordering):7
> >                             ICNTL(8) (scalling strategy):        77
> >                             ICNTL(10) (max num of refinements):  0
> >                             ICNTL(11) (error analysis):          0
> >                             ICNTL(12) (efficiency control):                         0
> >                             ICNTL(13) (efficiency control):                         0
> >                             ICNTL(14) (percentage of estimated workspace increase): 20
> >                             ICNTL(18) (input mat struct):                           0
> >                             ICNTL(19) (Shur complement info):                       0
> >                             ICNTL(20) (rhs sparse pattern):                         0
> >                             ICNTL(21) (solution struct):                            0
> >                             ICNTL(22) (in-core/out-of-core facility):               0
> >                             ICNTL(23) (max size of memory can be allocated locally):0
> >                             ICNTL(24) (detection of null pivot rows):               0
> >                             ICNTL(25) (computation of a null space basis):          0
> >                             ICNTL(26) (Schur options for rhs or solution):          0
> >                             ICNTL(27) (experimental parameter):                     -24
> >                             ICNTL(28) (use parallel or sequential ordering):        1
> >                             ICNTL(29) (parallel ordering):                          0
> >                             ICNTL(30) (user-specified set of entries in inv(A)):    0
> >                             ICNTL(31) (factors is discarded in the solve phase):    0
> >                             ICNTL(33) (compute determinant):                        0
> >                             CNTL(1) (relative pivoting threshold):      0.01
> >                             CNTL(2) (stopping criterion of refinement): 1.49012e-08
> >                             CNTL(3) (absolute pivoting threshold):      0.
> >                             CNTL(4) (value of static pivoting):         -1.
> >                             CNTL(5) (fixation for null pivots):         0.
> >                             RINFO(1) (local estimated flops for the elimination after analysis):
> >                               [0] 29394.
> >                             RINFO(2) (local estimated flops for the assembly after factorization):
> >                               [0]  1092.
> >                             RINFO(3) (local estimated flops for the elimination after factorization):
> >                               [0]  29394.
> >                             INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization):
> >                             [0] 1
> >                             INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization):
> >                               [0] 1
> >                             INFO(23) (num of pivots eliminated on this processor after factorization):
> >                               [0] 324
> >                             RINFOG(1) (global estimated flops for the elimination after analysis): 29394.
> >                             RINFOG(2) (global estimated flops for the assembly after factorization): 1092.
> >                             RINFOG(3) (global estimated flops for the elimination after factorization): 29394.
> >                             (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0)
> >                             INFOG(3) (estimated real workspace for factors on all processors after analysis): 3888
> >                             INFOG(4) (estimated integer workspace for factors on all processors after analysis): 2067
> >                             INFOG(5) (estimated maximum front size in the complete tree): 12
> >                             INFOG(6) (number of nodes in the complete tree): 53
> >                             INFOG(7) (ordering option effectively use after analysis): 2
> >                             INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100
> >                             INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 3888
> >                             INFOG(10) (total integer space store the matrix factors after factorization): 2067
> >                             INFOG(11) (order of largest frontal matrix after factorization): 12
> >                             INFOG(12) (number of off-diagonal pivots): 0
> >                             INFOG(13) (number of delayed pivots after factorization): 0
> >                             INFOG(14) (number of memory compress after factorization): 0
> >                             INFOG(15) (number of steps of iterative refinement after solution): 0
> >                             INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 1
> >                             INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 1
> >                             INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 1
> >                             INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 1
> >                             INFOG(20) (estimated number of entries in the factors): 3042
> >                             INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 1
> >                             INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 1
> >                             INFOG(23) (after analysis: value of ICNTL(6) effectively used): 5
> >                             INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1
> >                             INFOG(25) (after factorization: number of pivots modified by static pivoting): 0
> >                             INFOG(28) (after factorization: number of null pivots encountered): 0
> >                             INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 3042
> >                             INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 0, 0
> >                             INFOG(32) (after analysis: type of analysis done): 1
> >                             INFOG(33) (value used for ICNTL(8)): -2
> >                             INFOG(34) (exponent of the determinant if determinant is requested): 0
> >                 linear system matrix = precond matrix:
> >                 Mat Object:                (fieldsplit_RB_split_)                 1 MPI processes
> >                   type: seqaij
> >                   rows=324, cols=324
> >                   total: nonzeros=5760, allocated nonzeros=5760
> >                   total number of mallocs used during MatSetValues calls =0
> >                     using I-node routines: found 108 nodes, limit used is 5
> >             A01
> >               Mat Object:               1 MPI processes
> >                 type: seqaij
> >                 rows=324, cols=28476
> >                 total: nonzeros=936, allocated nonzeros=936
> >                 total number of mallocs used during MatSetValues calls =0
> >                   using I-node routines: found 67 nodes, limit used is 5
> >         Mat Object:         1 MPI processes
> >           type: seqaij
> >           rows=28476, cols=28476
> >           total: nonzeros=1037052, allocated nonzeros=1037052
> >           total number of mallocs used during MatSetValues calls =0
> >             using I-node routines: found 9489 nodes, limit used is 5
> >   linear system matrix = precond matrix:
> >   Mat Object:  ()   1 MPI processes
> >     type: seqaij
> >     rows=28800, cols=28800
> >     total: nonzeros=1024686, allocated nonzeros=1024794
> >     total number of mallocs used during MatSetValues calls =0
> >       using I-node routines: found 9600 nodes, limit used is 5
> >
> > ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
> >
> > /home/dknez/akselos-dev/scrbe/build/bin/fe_solver-opt_real on a arch-linux2-c-opt named david-Lenovo with 1 processor, by dknez Wed Jan 11 17:22:10 2017
> > Using Petsc Release Version 3.7.3, unknown
> >
> >                          Max       Max/Min        Avg      Total
> > Time (sec):           9.638e+01      1.00000   9.638e+01
> > Objects:              2.030e+02      1.00000   2.030e+02
> > Flops:                1.732e+11      1.00000   1.732e+11  1.732e+11
> > Flops/sec:            1.797e+09      1.00000   1.797e+09  1.797e+09
> > MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> > MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> > MPI Reductions:       0.000e+00      0.00000
> >
> > Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
> >                             e.g., VecAXPY() for real vectors of length N --> 2N flops
> >                             and VecAXPY() for complex vectors of length N --> 8N flops
> >
> > Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
> >                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
> >  0:      Main Stage: 9.6379e+01 100.0%  1.7318e+11 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > See the 'Profiling' chapter of the users' manual for details on interpreting output.
> > Phase summary info:
> >    Count: number of times phase was executed
> >    Time and Flops: Max - maximum over all processors
> >                    Ratio - ratio of maximum to minimum over all processors
> >    Mess: number of messages sent
> >    Avg. len: average message length (bytes)
> >    Reduct: number of global reductions
> >    Global: entire computation
> >    Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
> >       %T - percent time in this phase         %F - percent flops in this phase
> >       %M - percent messages in this phase     %L - percent message lengths in this phase
> >       %R - percent reductions in this phase
> >    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> > ------------------------------------------------------------------------------------------------------------------------
> > Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
> >                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> > ------------------------------------------------------------------------------------------------------------------------
> >
> > --- Event Stage 0: Main Stage
> >
> > VecDot                42 1.0 2.2411e-05 1.0 8.53e+03 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   380
> > VecTDot            77761 1.0 1.4294e+00 1.0 4.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1  3  0  0  0   1  3  0  0  0  3098
> > VecNorm            38894 1.0 9.1002e-01 1.0 2.22e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0  2434
> > VecScale           38882 1.0 3.7314e-01 1.0 1.11e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  2967
> > VecCopy            38908 1.0 2.1655e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecSet             77887 1.0 3.2034e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecAXPY            77777 1.0 1.8382e+00 1.0 4.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00  2  3  0  0  0   2  3  0  0  0  2409
> > VecAYPX            38875 1.0 1.2884e+00 1.0 2.21e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0  1718
> > VecAssemblyBegin      68 1.0 1.9407e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecAssemblyEnd        68 1.0 2.6941e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecScatterBegin       48 1.0 4.6349e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatMult            38891 1.0 4.3045e+01 1.0 8.03e+10 1.0 0.0e+00 0.0e+00 0.0e+00 45 46  0  0  0  45 46  0  0  0  1866
> > MatMultAdd         38889 1.0 3.5360e+01 1.0 7.91e+10 1.0 0.0e+00 0.0e+00 0.0e+00 37 46  0  0  0  37 46  0  0  0  2236
> > MatSolve           77769 1.0 4.8780e+01 1.0 7.95e+10 1.0 0.0e+00 0.0e+00 0.0e+00 51 46  0  0  0  51 46  0  0  0  1631
> > MatLUFactorNum         1 1.0 1.9575e-02 1.0 2.49e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1274
> > MatCholFctrSym         1 1.0 9.4891e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatCholFctrNum         1 1.0 3.7885e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatILUFactorSym        1 1.0 4.1780e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatConvert             1 1.0 3.0041e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatScale               2 1.0 2.7180e-05 1.0 2.53e+04 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   930
> > MatAssemblyBegin      32 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatAssemblyEnd        32 1.0 1.2032e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatGetRow         114978 1.0 5.9254e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatGetRowIJ            2 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatGetSubMatrice       6 1.0 1.5707e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatGetOrdering         2 1.0 3.2425e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatZeroEntries         6 1.0 3.0580e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatView                7 1.0 3.5119e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatAXPY                1 1.0 1.9384e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatMatMult             1 1.0 2.7120e-03 1.0 3.16e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   117
> > MatMatMultSym          1 1.0 1.8010e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatMatMultNum          1 1.0 6.1703e-04 1.0 3.16e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   513
> > KSPSetUp               4 1.0 9.8944e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > KSPSolve               1 1.0 9.3380e+01 1.0 1.73e+11 1.0 0.0e+00 0.0e+00 0.0e+00 97100  0  0  0  97100  0  0  0  1855
> > PCSetUp                4 1.0 6.6326e-02 1.0 2.53e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   381
> > PCSetUpOnBlocks        5 1.0 2.4082e-02 1.0 2.49e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1036
> > PCApply                5 1.0 9.3376e+01 1.0 1.73e+11 1.0 0.0e+00 0.0e+00 0.0e+00 97100  0  0  0  97100  0  0  0  1855
> > KSPSolve_FS_0          5 1.0 7.0214e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > KSPSolve_FS_Schu       5 1.0 9.3372e+01 1.0 1.73e+11 1.0 0.0e+00 0.0e+00 0.0e+00 97100  0  0  0  97100  0  0  0  1855
> > KSPSolve_FS_Low        5 1.0 2.1377e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > ------------------------------------------------------------------------------------------------------------------------
> >
> > Memory usage is given in bytes:
> >
> > Object Type          Creations   Destructions     Memory  Descendants' Mem.
> > Reports information only for process 0.
> >
> > --- Event Stage 0: Main Stage
> >
> >               Vector    92             92      9698040     0.
> >       Vector Scatter    24             24        15936     0.
> >            Index Set    51             51       537876     0.
> >    IS L to G Mapping     3              3       240408     0.
> >               Matrix    16             16     77377776     0.
> >        Krylov Solver     6              6         7888     0.
> >       Preconditioner     6              6         6288     0.
> >               Viewer     1              0            0     0.
> >     Distributed Mesh     1              1         4624     0.
> > Star Forest Bipartite Graph     2              2         1616     0.
> >      Discrete System     1              1          872     0.
> > ========================================================================================================================
> > Average time to get PetscTime(): 0.
> > #PETSc Option Table entries:
> > -ksp_monitor
> > -ksp_view
> > -log_view
> > #End of PETSc Option Table entries
> > Compiled without FORTRAN kernels
> > Compiled with full precision matrices (default)
> > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> > Configure options: --with-shared-libraries=1 --with-debugging=0 --download-suitesparse --download-blacs --download-ptscotch=yes --with-blas-lapack-dir=/opt/intel/system_studio_2015.2.050/mkl --CXXFLAGS=-Wl,--no-as-needed --download-scalapack --download-mumps --download-metis --prefix=/home/dknez/software/libmesh_install/opt_real/petsc --download-hypre --download-ml
> > -----------------------------------------
> > Libraries compiled on Wed Sep 21 17:38:52 2016 on david-Lenovo
> > Machine characteristics: Linux-4.4.0-38-generic-x86_64-with-Ubuntu-16.04-xenial
> > Using PETSc directory: /home/dknez/software/petsc-src
> > Using PETSc arch: arch-linux2-c-opt
> > -----------------------------------------
> >
> > Using C compiler: mpicc  -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fvisibility=hidden -g -O  ${COPTFLAGS} ${CFLAGS}
> > Using Fortran compiler: mpif90  -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O   ${FOPTFLAGS} ${FFLAGS}
> > -----------------------------------------
> >
> > Using include paths: -I/home/dknez/software/petsc-src/arch-linux2-c-opt/include -I/home/dknez/software/petsc-src/include -I/home/dknez/software/petsc-src/include -I/home/dknez/software/petsc-src/arch-linux2-c-opt/include -I/home/dknez/software/libmesh_install/opt_real/petsc/include -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi
> > -----------------------------------------
> >
> > Using C linker: mpicc
> > Using Fortran linker: mpif90
> > Using libraries: -Wl,-rpath,/home/dknez/software/petsc-src/arch-linux2-c-opt/lib -L/home/dknez/software/petsc-src/arch-linux2-c-opt/lib -lpetsc -Wl,-rpath,/home/dknez/software/libmesh_install/opt_real/petsc/lib -L/home/dknez/software/libmesh_install/opt_real/petsc/lib -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lmetis -lHYPRE -Wl,-rpath,/usr/lib/openmpi/lib -L/usr/lib/openmpi/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/5 -L/usr/lib/gcc/x86_64-linux-gnu/5 -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/lib/x86_64-linux-gnu -L/lib/x86_64-linux-gnu -lmpi_cxx -lstdc++ -lscalapack -lml -lmpi_cxx -lstdc++ -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -Wl,-rpath,/opt/intel/system_studio_2015.2.050/mkl/lib/intel64 -L/opt/intel/system_studio_2015.2.050/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -lhwloc -lptesmumps -lptscotch -lptscotcherr -lscotch -lscotcherr -lX11 -lm -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lgfortran -lm -lgfortran -lm -lquadmath -lm -lmpi_cxx -lstdc++ -lrt -lm -lpthread -lz -Wl,-rpath,/usr/lib/openmpi/lib -L/usr/lib/openmpi/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/5 -L/usr/lib/gcc/x86_64-linux-gnu/5 -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/lib/x86_64-linux-gnu -L/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -ldl -Wl,-rpath,/usr/lib/openmpi/lib -lmpi -lgcc_s -lpthread -ldl
> > -----------------------------------------
> >
> >
> >
> >
> > On Wed, Jan 11, 2017 at 4:49 PM, Dave May <dave.mayhem23 at gmail.com> wrote:
> > It looks like the Schur solve is requiring a huge number of iterates to converge (based on the instances of MatMult).
> > This is killing the performance.
> >
> > Are you sure that A11 is a good approximation to S? You might consider trying the selfp option
> >
> > http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/PC/PCFieldSplitSetSchurPre.html#PCFieldSplitSetSchurPre
> >
> > Note that the best approx to S is likely both problem and discretisation dependent so if selfp is also terrible, you might want to consider coding up your own approx to S for your specific system.
> >
> >
> > Thanks,
> >   Dave
> >
> >
> > On Wed, 11 Jan 2017 at 22:34, David Knezevic <david.knezevic at akselos.com> wrote:
> > I have a definite block 2x2 system and I figured it'd be good to apply the PCFIELDSPLIT functionality with Schur complement, as described in Section 4.5 of the manual.
> >
> > The A00 block of my matrix is very small so I figured I'd specify a direct solver (i.e. MUMPS) for that block.
> >
> > So I did the following:
> > - PCFieldSplitSetIS to specify the indices of the two splits
> > - PCFieldSplitGetSubKSP to get the two KSP objects, and to set the solver and PC types for each (MUMPS for A00, ILU+CG for A11)
> > - I set -pc_fieldsplit_schur_fact_type full
> >
> > Below I have pasted the output of "-ksp_view -ksp_monitor -log_view" for a test case. It seems to converge well, but I'm concerned about the speed (about 90 seconds, vs. about 1 second if I use a direct solver for the entire system). I just wanted to check if I'm setting this up in a good way?
> >
> > Many thanks,
> > David
> >
> > -----------------------------------------------------------------------------------
> >
> >   0 KSP Residual norm 5.405774214400e+04
> >   1 KSP Residual norm 1.849649014371e+02
> >   2 KSP Residual norm 7.462775074989e-02
> >   3 KSP Residual norm 2.680497175260e-04
> > KSP Object: 1 MPI processes
> >   type: cg
> >   maximum iterations=1000
> >   tolerances:  relative=1e-06, absolute=1e-50, divergence=10000.
> >   left preconditioning
> >   using nonzero initial guess
> >   using PRECONDITIONED norm type for convergence test
> > PC Object: 1 MPI processes
> >   type: fieldsplit
> >     FieldSplit with Schur preconditioner, factorization FULL
> >     Preconditioner for the Schur complement formed from A11
> >     Split info:
> >     Split number 0 Defined by IS
> >     Split number 1 Defined by IS
> >     KSP solver for A00 block
> >       KSP Object:      (fieldsplit_RB_split_)       1 MPI processes
> >         type: preonly
> >         maximum iterations=10000, initial guess is zero
> >         tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
> >         left preconditioning
> >         using NONE norm type for convergence test
> >       PC Object:      (fieldsplit_RB_split_)       1 MPI processes
> >         type: cholesky
> >           Cholesky: out-of-place factorization
> >           tolerance for zero pivot 2.22045e-14
> >           matrix ordering: natural
> >           factor fill ratio given 0., needed 0.
> >             Factored matrix follows:
> >               Mat Object:               1 MPI processes
> >                 type: seqaij
> >                 rows=324, cols=324
> >                 package used to perform factorization: mumps
> >                 total: nonzeros=3042, allocated nonzeros=3042
> >                 total number of mallocs used during MatSetValues calls =0
> >                   MUMPS run parameters:
> >                     SYM (matrix type):                   2
> >                     PAR (host participation):            1
> >                     ICNTL(1) (output for error):         6
> >                     ICNTL(2) (output of diagnostic msg): 0
> >                     ICNTL(3) (output for global info):   0
> >                     ICNTL(4) (level of printing):        0
> >                     ICNTL(5) (input mat struct):         0
> >                     ICNTL(6) (matrix prescaling):        7
> >                     ICNTL(7) (sequentia matrix ordering):7
> >                     ICNTL(8) (scalling strategy):        77
> >                     ICNTL(10) (max num of refinements):  0
> >                     ICNTL(11) (error analysis):          0
> >                     ICNTL(12) (efficiency control):                         0
> >                     ICNTL(13) (efficiency control):                         0
> >                     ICNTL(14) (percentage of estimated workspace increase): 20
> >                     ICNTL(18) (input mat struct):                           0
> >                     ICNTL(19) (Shur complement info):                       0
> >                     ICNTL(20) (rhs sparse pattern):                         0
> >                     ICNTL(21) (solution struct):                            0
> >                     ICNTL(22) (in-core/out-of-core facility):               0
> >                     ICNTL(23) (max size of memory can be allocated locally):0
> >                     ICNTL(24) (detection of null pivot rows):               0
> >                     ICNTL(25) (computation of a null space basis):          0
> >                     ICNTL(26) (Schur options for rhs or solution):          0
> >                     ICNTL(27) (experimental parameter):                     -24
> >                     ICNTL(28) (use parallel or sequential ordering):        1
> >                     ICNTL(29) (parallel ordering):                          0
> >                     ICNTL(30) (user-specified set of entries in inv(A)):    0
> >                     ICNTL(31) (factors is discarded in the solve phase):    0
> >                     ICNTL(33) (compute determinant):                        0
> >                     CNTL(1) (relative pivoting threshold):      0.01
> >                     CNTL(2) (stopping criterion of refinement): 1.49012e-08
> >                     CNTL(3) (absolute pivoting threshold):      0.
> >                     CNTL(4) (value of static pivoting):         -1.
> >                     CNTL(5) (fixation for null pivots):         0.
> >                     RINFO(1) (local estimated flops for the elimination after analysis):
> >                       [0] 29394.
> >                     RINFO(2) (local estimated flops for the assembly after factorization):
> >                       [0]  1092.
> >                     RINFO(3) (local estimated flops for the elimination after factorization):
> >                       [0]  29394.
> >                     INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization):
> >                     [0] 1
> >                     INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization):
> >                       [0] 1
> >                     INFO(23) (num of pivots eliminated on this processor after factorization):
> >                       [0] 324
> >                     RINFOG(1) (global estimated flops for the elimination after analysis): 29394.
> >                     RINFOG(2) (global estimated flops for the assembly after factorization): 1092.
> >                     RINFOG(3) (global estimated flops for the elimination after factorization): 29394.
> >                     (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0)
> >                     INFOG(3) (estimated real workspace for factors on all processors after analysis): 3888
> >                     INFOG(4) (estimated integer workspace for factors on all processors after analysis): 2067
> >                     INFOG(5) (estimated maximum front size in the complete tree): 12
> >                     INFOG(6) (number of nodes in the complete tree): 53
> >                     INFOG(7) (ordering option effectively use after analysis): 2
> >                     INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100
> >                     INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 3888
> >                     INFOG(10) (total integer space store the matrix factors after factorization): 2067
> >                     INFOG(11) (order of largest frontal matrix after factorization): 12
> >                     INFOG(12) (number of off-diagonal pivots): 0
> >                     INFOG(13) (number of delayed pivots after factorization): 0
> >                     INFOG(14) (number of memory compress after factorization): 0
> >                     INFOG(15) (number of steps of iterative refinement after solution): 0
> >                     INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 1
> >                     INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 1
> >                     INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 1
> >                     INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 1
> >                     INFOG(20) (estimated number of entries in the factors): 3042
> >                     INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 1
> >                     INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 1
> >                     INFOG(23) (after analysis: value of ICNTL(6) effectively used): 5
> >                     INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1
> >                     INFOG(25) (after factorization: number of pivots modified by static pivoting): 0
> >                     INFOG(28) (after factorization: number of null pivots encountered): 0
> >                     INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 3042
> >                     INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 0, 0
> >                     INFOG(32) (after analysis: type of analysis done): 1
> >                     INFOG(33) (value used for ICNTL(8)): -2
> >                     INFOG(34) (exponent of the determinant if determinant is requested): 0
> >         linear system matrix = precond matrix:
> >         Mat Object:        (fieldsplit_RB_split_)         1 MPI processes
> >           type: seqaij
> >           rows=324, cols=324
> >           total: nonzeros=5760, allocated nonzeros=5760
> >           total number of mallocs used during MatSetValues calls =0
> >             using I-node routines: found 108 nodes, limit used is 5
> >     KSP solver for S = A11 - A10 inv(A00) A01
> >       KSP Object:      (fieldsplit_FE_split_)       1 MPI processes
> >         type: cg
> >         maximum iterations=10000, initial guess is zero
> >         tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
> >         left preconditioning
> >         using PRECONDITIONED norm type for convergence test
> >       PC Object:      (fieldsplit_FE_split_)       1 MPI processes
> >         type: bjacobi
> >           block Jacobi: number of blocks = 1
> >           Local solve is same for all blocks, in the following KSP and PC objects:
> >           KSP Object:          (fieldsplit_FE_split_sub_)           1 MPI processes
> >             type: preonly
> >             maximum iterations=10000, initial guess is zero
> >             tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
> >             left preconditioning
> >             using NONE norm type for convergence test
> >           PC Object:          (fieldsplit_FE_split_sub_)           1 MPI processes
> >             type: ilu
> >               ILU: out-of-place factorization
> >               0 levels of fill
> >               tolerance for zero pivot 2.22045e-14
> >               matrix ordering: natural
> >               factor fill ratio given 1., needed 1.
> >                 Factored matrix follows:
> >                   Mat Object:                   1 MPI processes
> >                     type: seqaij
> >                     rows=28476, cols=28476
> >                     package used to perform factorization: petsc
> >                     total: nonzeros=1017054, allocated nonzeros=1017054
> >                     total number of mallocs used during MatSetValues calls =0
> >                       using I-node routines: found 9492 nodes, limit used is 5
> >             linear system matrix = precond matrix:
> >             Mat Object:            (fieldsplit_FE_split_)             1 MPI processes
> >               type: seqaij
> >               rows=28476, cols=28476
> >               total: nonzeros=1017054, allocated nonzeros=1017054
> >               total number of mallocs used during MatSetValues calls =0
> >                 using I-node routines: found 9492 nodes, limit used is 5
> >         linear system matrix followed by preconditioner matrix:
> >         Mat Object:        (fieldsplit_FE_split_)         1 MPI processes
> >           type: schurcomplement
> >           rows=28476, cols=28476
> >             Schur complement A11 - A10 inv(A00) A01
> >             A11
> >               Mat Object:              (fieldsplit_FE_split_)               1 MPI processes
> >                 type: seqaij
> >                 rows=28476, cols=28476
> >                 total: nonzeros=1017054, allocated nonzeros=1017054
> >                 total number of mallocs used during MatSetValues calls =0
> >                   using I-node routines: found 9492 nodes, limit used is 5
> >             A10
> >               Mat Object:               1 MPI processes
> >                 type: seqaij
> >                 rows=28476, cols=324
> >                 total: nonzeros=936, allocated nonzeros=936
> >                 total number of mallocs used during MatSetValues calls =0
> >                   using I-node routines: found 5717 nodes, limit used is 5
> >             KSP of A00
> >               KSP Object:              (fieldsplit_RB_split_)               1 MPI processes
> >                 type: preonly
> >                 maximum iterations=10000, initial guess is zero
> >                 tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
> >                 left preconditioning
> >                 using NONE norm type for convergence test
> >               PC Object:              (fieldsplit_RB_split_)               1 MPI processes
> >                 type: cholesky
> >                   Cholesky: out-of-place factorization
> >                   tolerance for zero pivot 2.22045e-14
> >                   matrix ordering: natural
> >                   factor fill ratio given 0., needed 0.
> >                     Factored matrix follows:
> >                       Mat Object:                       1 MPI processes
> >                         type: seqaij
> >                         rows=324, cols=324
> >                         package used to perform factorization: mumps
> >                         total: nonzeros=3042, allocated nonzeros=3042
> >                         total number of mallocs used during MatSetValues calls =0
> >                           MUMPS run parameters:
> >                             SYM (matrix type):                   2
> >                             PAR (host participation):            1
> >                             ICNTL(1) (output for error):         6
> >                             ICNTL(2) (output of diagnostic msg): 0
> >                             ICNTL(3) (output for global info):   0
> >                             ICNTL(4) (level of printing):        0
> >                             ICNTL(5) (input mat struct):         0
> >                             ICNTL(6) (matrix prescaling):        7
> >                             ICNTL(7) (sequentia matrix ordering):7
> >                             ICNTL(8) (scalling strategy):        77
> >                             ICNTL(10) (max num of refinements):  0
> >                             ICNTL(11) (error analysis):          0
> >                             ICNTL(12) (efficiency control):                         0
> >                             ICNTL(13) (efficiency control):                         0
> >                             ICNTL(14) (percentage of estimated workspace increase): 20
> >                             ICNTL(18) (input mat struct):                           0
> >                             ICNTL(19) (Shur complement info):                       0
> >                             ICNTL(20) (rhs sparse pattern):                         0
> >                             ICNTL(21) (solution struct):                            0
> >                             ICNTL(22) (in-core/out-of-core facility):               0
> >                             ICNTL(23) (max size of memory can be allocated locally):0
> >                             ICNTL(24) (detection of null pivot rows):               0
> >                             ICNTL(25) (computation of a null space basis):          0
> >                             ICNTL(26) (Schur options for rhs or solution):          0
> >                             ICNTL(27) (experimental parameter):                     -24
> >                             ICNTL(28) (use parallel or sequential ordering):        1
> >                             ICNTL(29) (parallel ordering):                          0
> >                             ICNTL(30) (user-specified set of entries in inv(A)):    0
> >                             ICNTL(31) (factors is discarded in the solve phase):    0
> >                             ICNTL(33) (compute determinant):                        0
> >                             CNTL(1) (relative pivoting threshold):      0.01
> >                             CNTL(2) (stopping criterion of refinement): 1.49012e-08
> >                             CNTL(3) (absolute pivoting threshold):      0.
> >                             CNTL(4) (value of static pivoting):         -1.
> >                             CNTL(5) (fixation for null pivots):         0.
> >                             RINFO(1) (local estimated flops for the elimination after analysis):
> >                               [0] 29394.
> >                             RINFO(2) (local estimated flops for the assembly after factorization):
> >                               [0]  1092.
> >                             RINFO(3) (local estimated flops for the elimination after factorization):
> >                               [0]  29394.
> >                             INFO(15) (estimated size of (in MB) MUMPS internal data for running numerical factorization):
> >                             [0] 1
> >                             INFO(16) (size of (in MB) MUMPS internal data used during numerical factorization):
> >                               [0] 1
> >                             INFO(23) (num of pivots eliminated on this processor after factorization):
> >                               [0] 324
> >                             RINFOG(1) (global estimated flops for the elimination after analysis): 29394.
> >                             RINFOG(2) (global estimated flops for the assembly after factorization): 1092.
> >                             RINFOG(3) (global estimated flops for the elimination after factorization): 29394.
> >                             (RINFOG(12) RINFOG(13))*2^INFOG(34) (determinant): (0.,0.)*(2^0)
> >                             INFOG(3) (estimated real workspace for factors on all processors after analysis): 3888
> >                             INFOG(4) (estimated integer workspace for factors on all processors after analysis): 2067
> >                             INFOG(5) (estimated maximum front size in the complete tree): 12
> >                             INFOG(6) (number of nodes in the complete tree): 53
> >                             INFOG(7) (ordering option effectively use after analysis): 2
> >                             INFOG(8) (structural symmetry in percent of the permuted matrix after analysis): 100
> >                             INFOG(9) (total real/complex workspace to store the matrix factors after factorization): 3888
> >                             INFOG(10) (total integer space store the matrix factors after factorization): 2067
> >                             INFOG(11) (order of largest frontal matrix after factorization): 12
> >                             INFOG(12) (number of off-diagonal pivots): 0
> >                             INFOG(13) (number of delayed pivots after factorization): 0
> >                             INFOG(14) (number of memory compress after factorization): 0
> >                             INFOG(15) (number of steps of iterative refinement after solution): 0
> >                             INFOG(16) (estimated size (in MB) of all MUMPS internal data for factorization after analysis: value on the most memory consuming processor): 1
> >                             INFOG(17) (estimated size of all MUMPS internal data for factorization after analysis: sum over all processors): 1
> >                             INFOG(18) (size of all MUMPS internal data allocated during factorization: value on the most memory consuming processor): 1
> >                             INFOG(19) (size of all MUMPS internal data allocated during factorization: sum over all processors): 1
> >                             INFOG(20) (estimated number of entries in the factors): 3042
> >                             INFOG(21) (size in MB of memory effectively used during factorization - value on the most memory consuming processor): 1
> >                             INFOG(22) (size in MB of memory effectively used during factorization - sum over all processors): 1
> >                             INFOG(23) (after analysis: value of ICNTL(6) effectively used): 5
> >                             INFOG(24) (after analysis: value of ICNTL(12) effectively used): 1
> >                             INFOG(25) (after factorization: number of pivots modified by static pivoting): 0
> >                             INFOG(28) (after factorization: number of null pivots encountered): 0
> >                             INFOG(29) (after factorization: effective number of entries in the factors (sum over all processors)): 3042
> >                             INFOG(30, 31) (after solution: size in Mbytes of memory used during solution phase): 0, 0
> >                             INFOG(32) (after analysis: type of analysis done): 1
> >                             INFOG(33) (value used for ICNTL(8)): -2
> >                             INFOG(34) (exponent of the determinant if determinant is requested): 0
> >                 linear system matrix = precond matrix:
> >                 Mat Object:                (fieldsplit_RB_split_)                 1 MPI processes
> >                   type: seqaij
> >                   rows=324, cols=324
> >                   total: nonzeros=5760, allocated nonzeros=5760
> >                   total number of mallocs used during MatSetValues calls =0
> >                     using I-node routines: found 108 nodes, limit used is 5
> >             A01
> >               Mat Object:               1 MPI processes
> >                 type: seqaij
> >                 rows=324, cols=28476
> >                 total: nonzeros=936, allocated nonzeros=936
> >                 total number of mallocs used during MatSetValues calls =0
> >                   using I-node routines: found 67 nodes, limit used is 5
> >         Mat Object:        (fieldsplit_FE_split_)         1 MPI processes
> >           type: seqaij
> >           rows=28476, cols=28476
> >           total: nonzeros=1017054, allocated nonzeros=1017054
> >           total number of mallocs used during MatSetValues calls =0
> >             using I-node routines: found 9492 nodes, limit used is 5
> >   linear system matrix = precond matrix:
> >   Mat Object:  ()   1 MPI processes
> >     type: seqaij
> >     rows=28800, cols=28800
> >     total: nonzeros=1024686, allocated nonzeros=1024794
> >     total number of mallocs used during MatSetValues calls =0
> >       using I-node routines: found 9600 nodes, limit used is 5
> >
> >
> > ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
> >
> > /home/dknez/akselos-dev/scrbe/build/bin/fe_solver-opt_real on a arch-linux2-c-opt named david-Lenovo with 1 processor, by dknez Wed Jan 11 16:16:47 2017
> > Using Petsc Release Version 3.7.3, unknown
> >
> >                          Max       Max/Min        Avg      Total
> > Time (sec):           9.179e+01      1.00000   9.179e+01
> > Objects:              1.990e+02      1.00000   1.990e+02
> > Flops:                1.634e+11      1.00000   1.634e+11  1.634e+11
> > Flops/sec:            1.780e+09      1.00000   1.780e+09  1.780e+09
> > MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> > MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> > MPI Reductions:       0.000e+00      0.00000
> >
> > Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
> >                             e.g., VecAXPY() for real vectors of length N --> 2N flops
> >                             and VecAXPY() for complex vectors of length N --> 8N flops
> >
> > Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
> >                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
> >  0:      Main Stage: 9.1787e+01 100.0%  1.6336e+11 100.0%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > See the 'Profiling' chapter of the users' manual for details on interpreting output.
> > Phase summary info:
> >    Count: number of times phase was executed
> >    Time and Flops: Max - maximum over all processors
> >                    Ratio - ratio of maximum to minimum over all processors
> >    Mess: number of messages sent
> >    Avg. len: average message length (bytes)
> >    Reduct: number of global reductions
> >    Global: entire computation
> >    Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
> >       %T - percent time in this phase         %F - percent flops in this phase
> >       %M - percent messages in this phase     %L - percent message lengths in this phase
> >       %R - percent reductions in this phase
> >    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> > ------------------------------------------------------------------------------------------------------------------------
> > Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
> >                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> > ------------------------------------------------------------------------------------------------------------------------
> >
> > --- Event Stage 0: Main Stage
> >
> > VecDot                42 1.0 2.4080e-05 1.0 8.53e+03 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   354
> > VecTDot            74012 1.0 1.2440e+00 1.0 4.22e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1  3  0  0  0   1  3  0  0  0  3388
> > VecNorm            37020 1.0 8.3580e-01 1.0 2.11e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0  2523
> > VecScale           37008 1.0 3.5800e-01 1.0 1.05e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  2944
> > VecCopy            37034 1.0 2.5754e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecSet             74137 1.0 3.0537e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecAXPY            74029 1.0 1.7233e+00 1.0 4.22e+09 1.0 0.0e+00 0.0e+00 0.0e+00  2  3  0  0  0   2  3  0  0  0  2446
> > VecAYPX            37001 1.0 1.2214e+00 1.0 2.11e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0  1725
> > VecAssemblyBegin      68 1.0 2.0432e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecAssemblyEnd        68 1.0 2.5988e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecScatterBegin       48 1.0 4.6921e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatMult            37017 1.0 4.1269e+01 1.0 7.65e+10 1.0 0.0e+00 0.0e+00 0.0e+00 45 47  0  0  0  45 47  0  0  0  1853
> > MatMultAdd         37015 1.0 3.3638e+01 1.0 7.53e+10 1.0 0.0e+00 0.0e+00 0.0e+00 37 46  0  0  0  37 46  0  0  0  2238
> > MatSolve           74021 1.0 4.6602e+01 1.0 7.42e+10 1.0 0.0e+00 0.0e+00 0.0e+00 51 45  0  0  0  51 45  0  0  0  1593
> > MatLUFactorNum         1 1.0 1.7209e-02 1.0 2.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1420
> > MatCholFctrSym         1 1.0 8.8310e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatCholFctrNum         1 1.0 3.6907e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatILUFactorSym        1 1.0 3.7372e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatAssemblyBegin      29 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatAssemblyEnd        29 1.0 9.9473e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatGetRow          58026 1.0 2.8155e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatGetRowIJ            2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatGetSubMatrice       6 1.0 1.5399e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatGetOrdering         2 1.0 3.0112e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatZeroEntries         6 1.0 2.9490e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatView                7 1.0 3.4356e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > KSPSetUp               4 1.0 9.4891e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > KSPSolve               1 1.0 8.8793e+01 1.0 1.63e+11 1.0 0.0e+00 0.0e+00 0.0e+00 97100  0  0  0  97100  0  0  0  1840
> > PCSetUp                4 1.0 3.8375e-02 1.0 2.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   637
> > PCSetUpOnBlocks        5 1.0 2.1250e-02 1.0 2.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1150
> > PCApply                5 1.0 8.8789e+01 1.0 1.63e+11 1.0 0.0e+00 0.0e+00 0.0e+00 97100  0  0  0  97100  0  0  0  1840
> > KSPSolve_FS_0          5 1.0 7.5364e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > KSPSolve_FS_Schu       5 1.0 8.8785e+01 1.0 1.63e+11 1.0 0.0e+00 0.0e+00 0.0e+00 97100  0  0  0  97100  0  0  0  1840
> > KSPSolve_FS_Low        5 1.0 2.1019e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > ------------------------------------------------------------------------------------------------------------------------
> >
> > Memory usage is given in bytes:
> >
> > Object Type          Creations   Destructions     Memory  Descendants' Mem.
> > Reports information only for process 0.
> >
> > --- Event Stage 0: Main Stage
> >
> >               Vector    91             91      9693912     0.
> >       Vector Scatter    24             24        15936     0.
> >            Index Set    51             51       537888     0.
> >    IS L to G Mapping     3              3       240408     0.
> >               Matrix    13             13     64097868     0.
> >        Krylov Solver     6              6         7888     0.
> >       Preconditioner     6              6         6288     0.
> >               Viewer     1              0            0     0.
> >     Distributed Mesh     1              1         4624     0.
> > Star Forest Bipartite Graph     2              2         1616     0.
> >      Discrete System     1              1          872     0.
> > ========================================================================================================================
> > Average time to get PetscTime(): 0.
> > #PETSc Option Table entries:
> > -ksp_monitor
> > -ksp_view
> > -log_view
> > #End of PETSc Option Table entries
> > Compiled without FORTRAN kernels
> > Compiled with full precision matrices (default)
> > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> > Configure options: --with-shared-libraries=1 --with-debugging=0 --download-suitesparse --download-blacs --download-ptscotch=yes --with-blas-lapack-dir=/opt/intel/system_studio_2015.2.050/mkl --CXXFLAGS=-Wl,--no-as-needed --download-scalapack --download-mumps --download-metis --prefix=/home/dknez/software/libmesh_install/opt_real/petsc --download-hypre --download-ml
> > -----------------------------------------
> > Libraries compiled on Wed Sep 21 17:38:52 2016 on david-Lenovo
> > Machine characteristics: Linux-4.4.0-38-generic-x86_64-with-Ubuntu-16.04-xenial
> > Using PETSc directory: /home/dknez/software/petsc-src
> > Using PETSc arch: arch-linux2-c-opt
> > -----------------------------------------
> >
> > Using C compiler: mpicc  -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fvisibility=hidden -g -O  ${COPTFLAGS} ${CFLAGS}
> > Using Fortran compiler: mpif90  -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O   ${FOPTFLAGS} ${FFLAGS}
> > -----------------------------------------
> >
> > Using include paths: -I/home/dknez/software/petsc-src/arch-linux2-c-opt/include -I/home/dknez/software/petsc-src/include -I/home/dknez/software/petsc-src/include -I/home/dknez/software/petsc-src/arch-linux2-c-opt/include -I/home/dknez/software/libmesh_install/opt_real/petsc/include -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi
> > -----------------------------------------
> >
> > Using C linker: mpicc
> > Using Fortran linker: mpif90
> > Using libraries: -Wl,-rpath,/home/dknez/software/petsc-src/arch-linux2-c-opt/lib -L/home/dknez/software/petsc-src/arch-linux2-c-opt/lib -lpetsc -Wl,-rpath,/home/dknez/software/libmesh_install/opt_real/petsc/lib -L/home/dknez/software/libmesh_install/opt_real/petsc/lib -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lmetis -lHYPRE -Wl,-rpath,/usr/lib/openmpi/lib -L/usr/lib/openmpi/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/5 -L/usr/lib/gcc/x86_64-linux-gnu/5 -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/lib/x86_64-linux-gnu -L/lib/x86_64-linux-gnu -lmpi_cxx -lstdc++ -lscalapack -lml -lmpi_cxx -lstdc++ -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -Wl,-rpath,/opt/intel/system_studio_2015.2.050/mkl/lib/intel64 -L/opt/intel/system_studio_2015.2.050/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -lhwloc -lptesmumps -lptscotch -lptscotcherr -lscotch -lscotcherr -lX11 -lm -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lgfortran -lm -lgfortran -lm -lquadmath -lm -lmpi_cxx -lstdc++ -lrt -lm -lpthread -lz -Wl,-rpath,/usr/lib/openmpi/lib -L/usr/lib/openmpi/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/5 -L/usr/lib/gcc/x86_64-linux-gnu/5 -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/lib/x86_64-linux-gnu -L/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -ldl -Wl,-rpath,/usr/lib/openmpi/lib -lmpi -lgcc_s -lpthread -ldl
> > -----------------------------------------
> >
> >
> >
> >
> >
> >
> >
> 
> 
> <logfile_1.txt><logfile_2.txt>