understanding the output from -info

Sat Feb 10 21:26:07 CST 2007

  My recommendation is just to try to optimize sequential runs
by using the most appropriate solver algorithms, the best sequential
processor with the fastest memory and slickest code.

  Parallel computing is to solve big problems, not to solve little problems 
fast. (anything less then 100k unknowns or even more is in my opinion is small).

   Barry 

On Sun, 11 Feb 2007, Ben Tay wrote:

> Hi,
> 
> In other words, for my CFD code, it is not possible to parallelize it
> effectively because the problem is too small?
> 
> Is these true for all parallel solver, or just PETSc? I was hoping to reduce
> the runtime since mine is an unsteady problem which requires many steps to
> reach a periodic state and it takes many hours to reach it.
> 
> Lastly, if I'm running on 2 processors, will there be improvement likely?
> 
> Thank you.
> 
> 
> On 2/11/07, Barry Smith <bsmith at mcs.anl.gov> wrote:
> > 
> > 
> > 
> > On Sat, 10 Feb 2007, Ben Tay wrote:
> > 
> > > Hi,
> > >
> > > I've repeated the test with n,m = 800. Now serial takes around 11mins
> > while
> > > parallel with 4 processors took 6mins. Does it mean that the problem
> > must be
> > > pretty large before it is more superior to use parallel?  Moreover
> > 800x800
> > > means there's 640000 unknowns. My problem is a 2D CFD code which
> > typically
> > > has 200x80=16000 unknowns. Does it mean that I won't be able to benefit
> > from
> >      ^^^^^^^^^^^
> > You'll never get much performance past 2 processors; its not even worth
> > all the work of having a parallel code in this case. I'd just optimize the
> > heck out of the serial code.
> > 
> >   Barry
> > 
> > 
> > 
> > > running in parallel?
> > >
> > > Btw, this is the parallel's log_summary:
> > >
> > >
> > > Event                Count      Time (sec)
> > > Flops/sec                         --- Global ---  --- Stage ---   Total
> > >                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> > > Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> > >
> > 
> > ------------------------------------------------------------------------------------------------------------------------
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > > MatMult             1265 1.0 7.0615e+01 1.2 3.22e+07 1.2 7.6e+03 6.4e+03
> > > 0.0e+00 16 11100100  0  16 11100100  0   103
> > > MatSolve            1265 1.0 4.7820e+01 1.2 4.60e+07 1.2 0.0e+00 0.0e+00
> > > 0.0e+00 11 11  0  0  0  11 11  0  0  0   152
> > > MatLUFactorNum         1 1.0 2.5703e-01 2.3 1.27e+07 2.3 0.0e+00 0.0e+00
> > > 0.0e+00  0  0  0  0  0   0  0  0  0  0    22
> > > MatILUFactorSym        1 1.0 1.8933e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatAssemblyBegin       1 1.0 4.2153e-01 3.5 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatAssemblyEnd         1 1.0 3.6475e-01 1.5 0.00e+00 0.0 6.0e+00 3.2e+03
> > > 1.3e+01  0  0  0  0  0   0  0  0  0  0     0
> > > MatGetOrdering         1 1.0 1.2088e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecMDot             1224 1.0 1.5314e+02 1.2 4.63e+07 1.2 0.0e+00 0.0e+00
> > > 1.2e+03 36 36  0  0 31  36 36  0  0 31   158
> > > VecNorm             1266 1.0 1.0215e+02 1.1 4.31e+06 1.1 0.0e+00 0.0e+00
> > > 1.3e+03 24  2  0  0 33  24  2  0  0 33    16
> > > VecScale            1265 1.0 3.7467e+00 1.5 8.34e+07 1.5 0.0e+00 0.0e+00
> > > 0.0e+00  1  1  0  0  0   1  1  0  0  0   216
> > > VecCopy               41 1.0 2.5530e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecSet              1308 1.0 3.2717e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> > > VecAXPY               82 1.0 5.3338e-01 2.8 1.40e+08 2.8 0.0e+00 0.0e+00
> > > 0.0e+00  0  0  0  0  0   0  0  0  0  0   197
> > > VecMAXPY            1265 1.0 4.6234e+01 1.2 1.74e+08 1.2 0.0e+00 0.0e+00
> > > 0.0e+00 10 38  0  0  0  10 38  0  0  0   557
> > > VecScatterBegin     1265 1.0 1.5684e-01 1.6 0.00e+00 0.0 7.6e+03 6.4e+03
> > > 0.0e+00  0  0100100  0   0  0100100  0     0
> > > VecScatterEnd       1265 1.0 4.3167e+01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 0.0e+00  9  0  0  0  0   9  0  0  0  0     0
> > > VecNormalize        1265 1.0 1.0459e+02 1.1 6.21e+06 1.1 0.0e+00 0.0e+00
> > > 1.3e+03 25  4  0  0 32  25  4  0  0 32    23
> > > KSPGMRESOrthog      1224 1.0 1.9035e+02 1.1 7.00e+07 1.1 0.0e+00 0.0e+00
> > > 1.2e+03 45 72  0  0 31  45 72  0  0 31   254
> > > KSPSetup               2 1.0 5.1674e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> > > 1.0e+01  0  0  0  0  0   0  0  0  0  0     0
> > > KSPSolve               1 1.0 4.0269e+02 1.0 4.16e+07 1.0 7.6e+03 6.4e+03
> > > 3.9e+03 99100100100 99  99100100100 99   166
> > > PCSetUp                2 1.0 4.5924e-01 2.6 8.23e+06 2.6 0.0e+00 0.0e+00
> > > 6.0e+00  0  0  0  0  0   0  0  0  0  0    12
> > > PCSetUpOnBlocks        1 1.0 4.5847e-01 2.6 8.26e+06 2.6 0.0e+00 0.0e+00
> > > 4.0e+00  0  0  0  0  0   0  0  0  0  0    13
> > > PCApply             1265 1.0 5.0990e+01 1.2 4.33e+07 1.2 0.0e+00 0.0e+00
> > > 1.3e+03 12 11  0  0 32  12 11  0  0 32   143
> > >
> > 
> > ------------------------------------------------------------------------------------------------------------------------
> > >
> > > Memory usage is given in bytes:
> > >
> > > Object Type          Creations   Destructions   Memory  Descendants'
> > Mem.
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > >              Matrix     4              4     643208     0
> > >           Index Set     5              5    1924296     0
> > >                 Vec    41             41   47379984     0
> > >         Vec Scatter     1              1          0     0
> > >       Krylov Solver     2              2      16880     0
> > >      Preconditioner     2              2        196     0
> > >
> > ========================================================================================================================
> > > Average time to get PetscTime(): 1.00136e-06
> > > Average time for MPI_Barrier(): 4.00066e-05
> > > Average time for zero size MPI_Send(): 1.70469e-05
> > > OptionTable: -log_summary
> > > Compiled without FORTRAN kernels
> > > Compiled with full precision matrices (default)
> > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4
> > > sizeof(PetscScalar) 8
> > > Configure run at: Thu Jan 18 12:23:31 2007
> > > Configure options: --with-vendor-compilers=intel --with-x=0
> > --with-shared
> > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32
> > > --with-mpi-dir=/opt/mpich/myrinet/intel/
> > > -----------------------------------------
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 2/10/07, Ben Tay <zonexo at gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I tried to use ex2f.F as a test code. I've changed the number n,m from
> > 3
> > > > to 500 each. I ran the code using 1 processor and then with 4
> > processor. I
> > > > then repeat the same with the following modification:
> > > >
> > > >
> > > > do i=1,10
> > > >
> > > >       call KSPSolve(ksp,b,x,ierr)
> > > >
> > > > end do
> > > > I've added to do loop to make the solving repeat 10 times.
> > > >
> > > > In both cases, the serial code is faster, e.g. 1 taking 2.4 min while
> > the
> > > > other 3.3 min.
> > > >
> > > > Here's the log_summary:
> > > >
> > > >
> > > > ---------------------------------------------- PETSc Performance
> > Summary:
> > > > ----------------------------------------------
> > > >
> > > > ./ex2f on a linux-mpi named atlas12.nus.edu.sg with 4 processors, by
> > > > g0306332 Sat Feb 10 16:21:36 2007
> > > > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan  2 14:33:59 PST
> > 2007
> > > > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80
> > > >
> > > >                          Max       Max/Min        Avg      Total
> > > > Time (sec):           2.213e+02      1.00051   2.212e+02
> > > > Objects:              5.500e+01      1.00000   5.500e+01
> > > > Flops:                4.718e+09      1.00019   4.718e+09  1.887e+10
> > > > Flops/sec:            2.134e+07       1.00070   2.133e+07  8.531e+07
> > > >
> > > > Memory:               3.186e+07      1.00069              1.274e+08
> > > > MPI Messages:         1.832e+03      2.00000   1.374e+03  5.496e+03
> > > > MPI Message Lengths:  7.324e+06       2.00000   3.998e+03  2.197e+07
> > > > MPI Reductions:       7.112e+02      1.00000
> > > >
> > > > Flop counting convention: 1 flop = 1 real number operation of type
> > > > (multiply/divide/add/subtract)
> > > >                             e.g., VecAXPY() for real vectors of length
> > N
> > > > --> 2N flops
> > > >                             and VecAXPY() for complex vectors of
> > length N
> > > > --> 8N flops
> > > >
> > > > Summary of Stages:   ----- Time ------  ----- Flops -----  ---
> > Messages
> > > > ---  -- Message Lengths --  -- Reductions --
> > > >                         Avg     %Total     Avg     %Total   counts
> > > > %Total     Avg         %Total   counts   %Total
> > > >  0:      Main Stage: 2.2120e+02 100.0%  1.8871e+10 100.0%  5.496e+03
> > > > 100.0%  3.998e+03      100.0%  2.845e+03 100.0%
> > > >
> > > >
> > > >
> > > >
> > 
> > ------------------------------------------------------------------------------------------------------------------------
> > > > See the 'Profiling' chapter of the users' manual for details on
> > > > interpreting output.
> > > > Phase summary info:
> > > >    Count: number of times phase was executed
> > > >    Time and Flops/sec: Max - maximum over all processors
> > > >                        Ratio - ratio of maximum to minimum over all
> > > > processors
> > > >    Mess: number of messages sent
> > > >    Avg. len: average message length
> > > >    Reduct: number of global reductions
> > > >    Global: entire computation
> > > >    Stage: stages of a computation. Set stages with PetscLogStagePush()
> > and
> > > > PetscLogStagePop().
> > > >       %T - percent time in this phase         %F - percent flops in
> > this
> > > > phase
> > > >       %M - percent messages in this phase     %L - percent message
> > lengths
> > > > in this phase
> > > >       %R - percent reductions in this phase
> > > >    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> > > > over all processors)
> > > >
> > > >
> > > >
> > 
> > ------------------------------------------------------------------------------------------------------------------------
> > > >
> > > >       ##########################################################
> > > >       #                                                        #
> > > >       #                          WARNING!!!                    #
> > > >       #                                                        #
> > > >       #   This code was compiled with a debugging option,      #
> > > >       #   To get timing results run config/configure.py        #
> > > >       #   using --with-debugging=no, the performance will      #
> > > >       #   be generally two or three times faster.              #
> > > >       #                                                        #
> > > >       ##########################################################
> > > >
> > > >
> > > >
> > > >
> > > >       ##########################################################
> > > >       #                                                        #
> > > >       #                          WARNING!!!                    #
> > > >       #                                                        #
> > > >       #   This code was run without the PreLoadBegin()         #
> > > >       #   macros. To get timing results we always recommend    #
> > > >       #   preloading. otherwise timing numbers may be          #
> > > >       #   meaningless.                                         #
> > > >       ##########################################################
> > > >
> > > >
> > > > Event                Count      Time (sec)
> > > > Flops/sec                         --- Global ---  --- Stage ---
> > Total
> > > >                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg
> > len
> > > > Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> > > >
> > > >
> > > >
> > 
> > ------------------------------------------------------------------------------------------------------------------------
> > > >
> > > > --- Event Stage 0: Main Stage
> > > >
> > > > MatMult              915 1.0 4.4291e+01 1.3 1.50e+07 1.3 5.5e+03
> > 4.0e+03
> > > > 0.0e+00 18 11100100  0  18 11100100  0    46
> > > > MatSolve             915 1.0 1.5684e+01 1.1 3.56e+07 1.1 0.0e+00
> > 0.0e+00
> > > > 0.0e+00  7 11  0  0  0   7 11  0  0  0   131
> > > > MatLUFactorNum         1 1.0 5.1654e-02 1.4 1.48e+07 1.4 0.0e+00
> > 0.0e+00
> > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0    43
> > > > MatILUFactorSym        1 1.0 1.6838e-02 1.1 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > MatAssemblyBegin       1 1.0 3.2428e-01 1.6 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > MatAssemblyEnd         1 1.0 1.3120e+00 1.1 0.00e+00 0.0 6.0e+00
> > 2.0e+03
> > > > 1.3e+01  1  0  0  0  0   1  0  0  0  0     0
> > > > MatGetOrdering         1 1.0 4.1590e-03 1.2 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > VecMDot              885 1.0 8.5091e+01 1.1 2.27e+07 1.1 0.0e+00
> > 0.0e+00
> > > > 8.8e+02 36 36  0  0 31  36 36  0  0 31    80
> > > > VecNorm              916 1.0 6.6747e+01 1.1 1.81e+06 1.1 0.0e+00
> > 0.0e+00
> > > > 9.2e+02 29  2  0  0 32  29  2  0  0 32     7
> > > > VecScale             915 1.0 1.1430e+00 2.2 1.12e+08 2.2 0.0e+00
> > 0.0e+00
> > > > 0.0e+00  0  1  0  0  0   0  1  0  0  0   200
> > > > VecCopy               30 1.0 1.2816e-01 5.7 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > VecSet               947 1.0 7.8979e-01 1.3 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > VecAXPY               60 1.0 5.5332e-02 1.1 1.51e+08 1.1 0.0e+00
> > 0.0e+00
> > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0   542
> > > > VecMAXPY             915 1.0 1.5004e+01 1.3 1.54e+08 1.3 0.0e+00
> > 0.0e+00
> > > > 0.0e+00  6 38  0  0  0   6 38  0  0  0   483
> > > > VecScatterBegin      915 1.0 9.0358e-02 1.4 0.00e+00 0.0 5.5e+03
> > 4.0e+03
> > > > 0.0e+00  0  0100100  0   0  0100100  0     0
> > > > VecScatterEnd        915 1.0 3.5136e+01 1.4 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 0.0e+00 14  0  0  0  0  14  0  0  0  0     0
> > > > VecNormalize         915 1.0 6.7272e+01 1.0 2.68e+06 1.0 0.0e+00
> > 0.0e+00
> > > > 9.2e+02 30  4  0  0 32  30  4  0  0 32    10
> > > > KSPGMRESOrthog       885 1.0 9.8478e+01 1.1 3.87e+07 1.1 0.0e+00
> > 0.0e+00
> > > > 8.8e+02 42 72  0  0 31  42 72  0  0 31   138
> > > > KSPSetup               2 1.0 6.1918e-01 1.2 0.00e+00 0.0 0.0e+00
> > 0.0e+00
> > > > 1.0e+01  0  0  0  0  0   0  0  0  0  0     0
> > > > KSPSolve               1 1.0 2.1892e+02 1.0 2.15e+07 1.0 5.5e+03
> > 4.0e+03
> > > > 2.8e+03 99100100100 99  99100100100 99    86
> > > > PCSetUp                2 1.0 7.3292e-02 1.3 9.84e+06 1.3 0.0e+00
> > 0.0e+00
> > > > 6.0e+00  0  0  0  0  0   0  0  0  0  0    30
> > > > PCSetUpOnBlocks        1 1.0 7.2706e-02 1.3 9.97e+06 1.3 0.0e+00
> > 0.0e+00
> > > > 4.0e+00  0  0  0  0  0   0  0  0  0  0    31
> > > > PCApply              915 1.0 1.6508e+01 1.1 3.27e+07 1.1 0.0e+00
> > 0.0e+00
> > > > 9.2e+02  7 11  0  0 32   7 11  0  0 32   124
> > > >
> > > >
> > 
> > ------------------------------------------------------------------------------------------------------------------------
> > > >
> > > >
> > > > Memory usage is given in bytes:
> > > >
> > > > Object Type          Creations   Destructions   Memory  Descendants'
> > Mem.
> > > >
> > > > --- Event Stage 0: Main Stage
> > > >
> > > >               Matrix     4              4     252008     0
> > > >            Index Set     5              5     753096     0
> > > >                  Vec    41             41   18519984     0
> > > >          Vec Scatter     1              1          0     0
> > > >        Krylov Solver     2              2      16880     0
> > > >       Preconditioner     2              2        196     0
> > > >
> > ========================================================================================================================
> > > >
> > > > Average time to get PetscTime(): 1.09673e-06
> > > > Average time for MPI_Barrier(): 4.18186e-05
> > > > Average time for zero size MPI_Send(): 2.62856e-05
> > > > OptionTable: -log_summary
> > > > Compiled without FORTRAN kernels
> > > > Compiled with full precision matrices (default)
> > > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4
> > > > sizeof(PetscScalar) 8
> > > > Configure run at: Thu Jan 18 12:23:31 2007
> > > > Configure options: --with-vendor-compilers=intel --with-x=0
> > --with-shared
> > > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32
> > > > --with-mpi-dir=/opt/mpich/myrinet/intel/
> > > > -----------------------------------------
> > > > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on
> > atlas1.nus.edu.sg
> > > > Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1
> > SMP
> > > > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux
> > > > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8
> > > > Using PETSc arch: linux-mpif90
> > > > -----------------------------------------
> > > > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g
> > > > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC
> > -g
> > > > -w90 -w
> > > > -----------------------------------------
> > > > Using include paths: -I/nas/lsftmp/g0306332/petsc-
> > > > 2.3.2-p8-I/nas/lsftmp/g0306332/petsc-
> > > > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8
> > /include
> > > > -I/opt/mpich/myrinet/intel/include
> > > > ------------------------------------------
> > > > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g
> > > > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g
> > > > -w90 -w
> > > > Using libraries:
> > > > -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90
> > > > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts
> > > > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
> > > > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32
> > > > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide
> > > > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > > > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib
> > > > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm
> > > > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > -L/opt/intel/compiler70/ia32/lib
> > > > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts
> > -lcxa
> > > > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib
> > -lPEPCF90
> > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > -L/opt/intel/compiler70/ia32/lib
> > > > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90
> > -lm  -Wl,-rpath,\
> > > > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread
> > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > -L/opt/intel/compiler70/ia32/lib
> > > > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl
> > > > ------------------------------------------
> > > >
> > > >  So is there something wrong with the server's mpi implementation?
> > > >
> > > > Thank you.
> > > >
> > > >
> > > >
> > > > On 2/10/07, Satish Balay <balay at mcs.anl.gov> wrote:
> > > > >
> > > > > Looks like MatMult = 24sec Out of this the scatter time is: 22sec.
> > > > > Either something is wrong with your run - or MPI is really broken..
> > > > >
> > > > > Satish
> > > > >
> > > > > > > > MatMult             3927 1.0 2.4071e+01 1.3 6.14e+06 1.4
> > 2.4e+04
> > > > > 1.3e+03
> > > > > > > > VecScatterBegin     3927 1.0 2.8672e-01 3.9 0.00e+00 0.0
> > 2.4e+04
> > > > > 1.3e+03
> > > > > > > > VecScatterEnd       3927 1.0 2.2135e+01 1.5 0.00e+00 0.0
> > 0.0e+00
> > > > > 0.0e+00
> > > > >
> > > > >
> > > >
> > >
> > 
> > 
>