Analysis of performance of parallel code as processors increase

Fri Jun 6 21:23:15 CDT 2008

    You are not using hypre, you are using block Jacobi with ILU on  
the blocks.

    The number of iterations goes from around 4000 to around 5000 in  
going from 4 to 8 processes,
this is why you do not see such a great speedup.

    Barry

On Jun 6, 2008, at 8:07 PM, Ben Tay wrote:

> Hi,
>
> I have coded in parallel using PETSc and Hypre. I found that going  
> from 1 to 4 processors gives an almost 4 times increase. However  
> from 4 to 8 processors only increase performance by 1.2-1.5 instead  
> of 2.
>
> Is the slowdown due to the size of the matrix being not large  
> enough? Currently I am using 600x2160 to do the benchmark. Even when  
> increase the matrix size to 900x3240  or 1200x2160, the performance  
> increase is also not much. Is it possible to use -log_summary find  
> out the error? I have attached the log file comparison for the 4 and  
> 8 processors, I found that some event like VecScatterEnd, VecNorm  
> and MatAssemblyBegin have much higher ratios. Does it indicate  
> something? Another strange thing is that MatAssemblyBegin for the 4  
> pros has a much higher ratio than the 8pros. I thought there should  
> be less communications for the 4 pros case, and so the ratio should  
> be lower. Does it mean there's some communication problem at that  
> time?
>
> Thank you very much.
>
> Regards
>
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> ./a.out on a atlas3-mp named atlas3-c43 with 4 processors, by  
> g0306332 Fri Jun  6 17:29:26 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST  
> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                         Max       Max/Min        Avg      Total
> Time (sec):           1.750e+03      1.00043   1.750e+03
> Objects:              4.200e+01      1.00000   4.200e+01
> Flops:                6.961e+10      1.00074   6.959e+10  2.784e+11
> Flops/sec:            3.980e+07      1.00117   3.978e+07  1.591e+08
> MPI Messages:         8.168e+03      2.00000   6.126e+03  2.450e+04
> MPI Message Lengths:  5.525e+07      2.00000   6.764e+03  1.658e+08
> MPI Reductions:       3.203e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of  
> length N --> 2N flops
>                            and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 1.7495e+03 100.0%  2.7837e+11 100.0%  2.450e+04  
> 100.0%  6.764e+03      100.0%  1.281e+04 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops/sec: Max - maximum over all processors
>                       Ratio - ratio of maximum to minimum over all  
> processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with  
> PetscLogStagePush() and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in  
> this phase
>      %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max  
> time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>      ##########################################################
>      #                                                        #
>      #                          WARNING!!!                    #
>      #                                                        #
>      #   This code was run without the PreLoadBegin()         #
>      #   macros. To get timing results we always recommend    #
>      #   preloading. otherwise timing numbers may be          #
>      #   meaningless.                                         #
>      ##########################################################
>
>
> Event                Count      Time (sec)     Flops/ 
> sec                         --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             4082 1.0 8.2037e+01 1.5 4.67e+08 1.5 2.4e+04 6.8e 
> +03 0.0e+00  4 37100100  0   4 37100100  0  1240
> MatSolve            1976 1.0 1.3250e+02 1.5 2.52e+08 1.5 0.0e+00 0.0e 
> +00 0.0e+00  6 31  0  0  0   6 31  0  0  0   655
> MatLUFactorNum       300 1.0 3.8260e+01 1.2 2.07e+08 1.2 0.0e+00 0.0e 
> +00 0.0e+00  2  9  0  0  0   2  9  0  0  0   668
> MatILUFactorSym        1 1.0 2.2550e-01 2.7 0.00e+00 0.0 0.0e+00 0.0e 
> +00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatConvert             1 1.0 2.9182e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin     301 1.0 1.0776e+021228.9 0.00e+00 0.0 0.0e+00  
> 0.0e+00 6.0e+02  4  0  0  0  5   4  0  0  0  5     0
> MatAssemblyEnd       301 1.0 9.6146e+00 1.1 0.00e+00 0.0 1.2e+01 3.6e 
> +03 3.1e+02  1  0  0  0  2   1  0  0  0  2     0
> MatGetRow         324000 1.0 1.2161e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            3 1.0 5.0068e-06 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 2.1279e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetup             601 1.0 2.5108e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve             600 1.0 1.2353e+03 1.0 5.64e+07 1.0 2.4e+04 6.8e 
> +03 8.3e+03 71100100100 65  71100100100 65   225
> PCSetUp              601 1.0 4.0116e+01 1.2 1.96e+08 1.2 0.0e+00 0.0e 
> +00 5.0e+00  2  9  0  0  0   2  9  0  0  0   637
> PCSetUpOnBlocks      300 1.0 3.8513e+01 1.2 2.06e+08 1.2 0.0e+00 0.0e 
> +00 3.0e+00  2  9  0  0  0   2  9  0  0  0   664
> PCApply             4682 1.0 1.0566e+03 1.0 2.12e+07 1.0 0.0e+00 0.0e 
> +00 0.0e+00 59 31  0  0  0  59 31  0  0  0    82
> VecDot              4812 1.0 8.2762e+00 1.1 4.00e+08 1.1 0.0e+00 0.0e 
> +00 4.8e+03  0  4  0  0 38   0  4  0  0 38  1507
> VecNorm             3479 1.0 9.2739e+01 8.3 3.15e+08 8.3 0.0e+00 0.0e 
> +00 3.5e+03  4  5  0  0 27   4  5  0  0 27   152
> VecCopy              900 1.0 2.0819e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              5882 1.0 9.4626e+00 1.5 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY             5585 1.0 1.5397e+01 1.5 4.67e+08 1.5 0.0e+00 0.0e 
> +00 0.0e+00  1  7  0  0  0   1  7  0  0  0  1273
> VecAYPX             2879 1.0 1.0303e+01 1.6 4.45e+08 1.6 0.0e+00 0.0e 
> +00 0.0e+00  0  4  0  0  0   0  4  0  0  0  1146
> VecWAXPY            2406 1.0 7.7902e+00 1.6 3.14e+08 1.6 0.0e+00 0.0e 
> +00 0.0e+00  0  2  0  0  0   0  2  0  0  0   801
> VecAssemblyBegin    1200 1.0 8.4259e+00 3.8 0.00e+00 0.0 0.0e+00 0.0e 
> +00 3.6e+03  0  0  0  0 28   0  0  0  0 28     0
> VecAssemblyEnd      1200 1.0 2.4173e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin     4082 1.0 1.2512e-01 1.5 0.00e+00 0.0 2.4e+04 6.8e 
> +03 0.0e+00  0  0100100  0   0  0100100  0     0
> VecScatterEnd       4082 1.0 2.0954e+0153.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>              Matrix     7              7  321241092     0
>       Krylov Solver     3              3          8     0
>      Preconditioner     3              3        528     0
>           Index Set     7              7    7785600     0
>                 Vec    20             20   46685344     0
>         Vec Scatter     2              2          0     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 1.90735e-07
> Average time for MPI_Barrier(): 1.45912e-05
> Average time for zero size MPI_Send(): 7.27177e-06
> OptionTable: -log_summary test4_600
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Tue Jan  8 22:22:08 2008
> Configure options: --with-memcmp-ok --sizeof_char=1 -- 
> sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 -- 
> sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 -- 
> bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with- 
> vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/ 
> g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi- 
> shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include -- 
> with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with- 
> mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun --with-blas-lapack- 
> dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
> -----------------------------------------
> Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed  
> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
> Using PETSc arch: atlas3-mpi
> -----------------------------------------
> Using C compiler: mpicc -fPIC -O
> Using Fortran compiler: mpif90 -I. -fPIC -O
> -----------------------------------------
> Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/ 
> nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/ 
> home/enduser/g0306332/petsc-2.3.3-p8/include -I/home/enduser/ 
> g0306332/lib/hypre/include -I/usr/local/topspin/mpi/mpich/include
> ------------------------------------------
> Using C linker: mpicc -fPIC -O
> Using Fortran linker: mpif90 -I. -fPIC -O
> Using libraries: -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3- 
> p8/lib/atlas3-mpi -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/ 
> atlas3-mpi -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat - 
> lpetscvec -lpetsc        -Wl,-rpath,/home/enduser/g0306332/lib/hypre/ 
> lib -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/ 
> mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/ 
> opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat- 
> linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/ 
> opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,- 
> rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/ 
> topspin/mpi/mpich/lib -L/usr/local/topspin/mpi/mpich/lib -lmpich - 
> Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/ 
> lib/em64t -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/ 
> local/ofed/lib64 -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/ 
> 0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs - 
> libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/ 
> intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/ 
> 3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s - 
> lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- 
> rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib - 
> lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,- 
> rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib - 
> Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc+ 
> + -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- 
> rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 - 
> Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,- 
> rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 - 
> Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/ 
> 0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -Wl,- 
> rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs - 
> libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/ 
> intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/ 
> 3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
> ------------------------------------------
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> ./a.out on a atlas3-mp named atlas3-c18 with 8 processors, by  
> g0306332 Fri Jun  6 17:23:25 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST  
> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                         Max       Max/Min        Avg      Total
> Time (sec):           1.140e+03      1.00019   1.140e+03
> Objects:              4.200e+01      1.00000   4.200e+01
> Flops:                4.620e+10      1.00158   4.619e+10  3.695e+11
> Flops/sec:            4.053e+07      1.00177   4.051e+07  3.241e+08
> MPI Messages:         9.954e+03      2.00000   8.710e+03  6.968e+04
> MPI Message Lengths:  7.224e+07      2.00000   7.257e+03  5.057e+08
> MPI Reductions:       1.716e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of  
> length N --> 2N flops
>                            and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 1.1402e+03 100.0%  3.6953e+11 100.0%  6.968e+04  
> 100.0%  7.257e+03      100.0%  1.372e+04 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops/sec: Max - maximum over all processors
>                       Ratio - ratio of maximum to minimum over all  
> processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with  
> PetscLogStagePush() and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in  
> this phase
>      %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max  
> time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>      ##########################################################
>      #                                                        #
>      #                          WARNING!!!                    #
>      #                                                        #
>      #   This code was run without the PreLoadBegin()         #
>      #   macros. To get timing results we always recommend    #
>      #   preloading. otherwise timing numbers may be          #
>      #   meaningless.                                         #
>      ##########################################################
>
>
> Event                Count      Time (sec)     Flops/ 
> sec                         --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             4975 1.0 7.8154e+01 1.9 4.19e+08 1.9 7.0e+04 7.3e 
> +03 0.0e+00  5 38100100  0   5 38100100  0  1798
> MatSolve            2855 1.0 1.0870e+02 1.8 2.57e+08 1.8 0.0e+00 0.0e 
> +00 0.0e+00  7 34  0  0  0   7 34  0  0  0  1153
> MatLUFactorNum       300 1.0 2.3238e+01 1.5 2.07e+08 1.5 0.0e+00 0.0e 
> +00 0.0e+00  2  7  0  0  0   2  7  0  0  0  1099
> MatILUFactorSym        1 1.0 6.1973e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e 
> +00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatConvert             1 1.0 1.4168e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin     301 1.0 6.9683e+01 8.6 0.00e+00 0.0 0.0e+00 0.0e 
> +00 6.0e+02  4  0  0  0  4   4  0  0  0  4     0
> MatAssemblyEnd       301 1.0 6.2247e+00 1.2 0.00e+00 0.0 2.8e+01 3.6e 
> +03 3.1e+02  0  0  0  0  2   0  0  0  0  2     0
> MatGetRow         162000 1.0 6.0330e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            3 1.0 9.0599e-06 3.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 5.6710e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetup             601 1.0 1.5631e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve             600 1.0 8.1668e+02 1.0 5.66e+07 1.0 7.0e+04 7.3e 
> +03 9.2e+03 72100100100 67  72100100100 67   452
> PCSetUp              601 1.0 2.4372e+01 1.5 1.93e+08 1.5 0.0e+00 0.0e 
> +00 5.0e+00  2  7  0  0  0   2  7  0  0  0  1048
> PCSetUpOnBlocks      300 1.0 2.3303e+01 1.5 2.07e+08 1.5 0.0e+00 0.0e 
> +00 3.0e+00  2  7  0  0  0   2  7  0  0  0  1096
> PCApply             5575 1.0 6.5344e+02 1.1 2.57e+07 1.1 0.0e+00 0.0e 
> +00 0.0e+00 55 34  0  0  0  55 34  0  0  0   192
> VecDot              4840 1.0 6.8932e+00 1.3 3.07e+08 1.3 0.0e+00 0.0e 
> +00 4.8e+03  1  3  0  0 35   1  3  0  0 35  1820
> VecNorm             4365 1.0 1.2250e+02 3.6 6.82e+07 3.6 0.0e+00 0.0e 
> +00 4.4e+03  8  5  0  0 32   8  5  0  0 32   153
> VecCopy              900 1.0 1.4297e+00 1.8 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              6775 1.0 8.1405e+00 1.8 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY             6485 1.0 1.0003e+01 1.9 5.73e+08 1.9 0.0e+00 0.0e 
> +00 0.0e+00  1  7  0  0  0   1  7  0  0  0  2420
> VecAYPX             3765 1.0 7.8289e+00 2.0 5.17e+08 2.0 0.0e+00 0.0e 
> +00 0.0e+00  0  4  0  0  0   0  4  0  0  0  2092
> VecWAXPY            2420 1.0 3.8504e+00 1.9 3.80e+08 1.9 0.0e+00 0.0e 
> +00 0.0e+00  0  2  0  0  0   0  2  0  0  0  1629
> VecAssemblyBegin    1200 1.0 9.2808e+00 3.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 3.6e+03  1  0  0  0 26   1  0  0  0 26     0
> VecAssemblyEnd      1200 1.0 2.3313e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin     4975 1.0 2.2727e-01 2.6 0.00e+00 0.0 7.0e+04 7.3e 
> +03 0.0e+00  0  0100100  0   0  0100100  0     0
> VecScatterEnd       4975 1.0 2.7557e+0168.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>              Matrix     7              7  160595412     0
>       Krylov Solver     3              3          8     0
>      Preconditioner     3              3        528     0
>           Index Set     7              7    3897600     0
>                 Vec    20             20   23357344     0
>         Vec Scatter     2              2          0     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 1.19209e-07
> Average time for MPI_Barrier(): 2.10285e-05
> Average time for zero size MPI_Send(): 7.59959e-06
> OptionTable: -log_summary test8_600
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Tue Jan  8 22:22:08 2008
> Configure options: --with-memcmp-ok --sizeof_char=1 -- 
> sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 -- 
> sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 -- 
> bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with- 
> vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/ 
> g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi- 
> shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include -- 
> with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with- 
> mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun --with-blas-lapack- 
> dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
> -----------------------------------------
> Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed  
> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
> Using PETSc arch: atlas3-mpi
> -----------------------------------------
> Using C compiler: mpicc -fPIC -O
> Using Fortran compiler: mpif90 -I. -fPIC -O
> -----------------------------------------
> Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/ 
> nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/ 
> home/enduser/g0306332/petsc-2.3.3-p8/include -I/home/enduser/ 
> g0306332/lib/hypre/include -I/usr/local/topspin/mpi/mpich/include
> ------------------------------------------
> Using C linker: mpicc -fPIC -O
> Using Fortran linker: mpif90 -I. -fPIC -O
> Using libraries: -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3- 
> p8/lib/atlas3-mpi -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/ 
> atlas3-mpi -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat - 
> lpetscvec -lpetsc        -Wl,-rpath,/home/enduser/g0306332/lib/hypre/ 
> lib -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/ 
> mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/ 
> opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat- 
> linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/ 
> opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,- 
> rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/ 
> topspin/mpi/mpich/lib -L/usr/local/topspin/mpi/mpich/lib -lmpich - 
> Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/ 
> lib/em64t -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/ 
> local/ofed/lib64 -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/ 
> 0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs - 
> libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/ 
> intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/ 
> 3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s - 
> lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- 
> rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib - 
> lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,- 
> rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib - 
> Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc+ 
> + -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- 
> rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 - 
> Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,- 
> rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 - 
> Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/ 
> 0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -Wl,- 
> rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs - 
> libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/ 
> intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/ 
> 3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
> ------------------------------------------