[petsc-users] Very poor speed up performance

Mon Dec 20 13:21:17 CST 2010

On Mon, Dec 20, 2010 at 10:38 AM, Yongjun Chen <yjxd.chen at gmail.com> wrote:

> Hi Matt,
>
> Thanks for your reply. Just now I have carried out a series of tests with
> k=2, 4, 8, 12 and 16 cores on the first server again with the -log_summary
> option. From 8 cores to 12 cores, a small speed up has been found this time,
> but from 12 cores to 16 cores, the computation time increase!
> Attached please find these 5 log files. Thank you very much!
>

Its very clear from these, but Barry was right in his reply. These are
memory bandwidth limited
computations, so if you don't get any more bandwidth you will not speed up.
This is rarely mentioned
in sales pitches for multicore computers. LAMMPS is not limited by bandwidth
for most computations.

   Matt

> mpiexec -n *k* ./AMG_Solver_MPI -pc_type jacobi -ksp_type bicg
> -log_summary
> Here, I use ksp bicg instead of gmres, because the two ksp gives almost the
> same speed up performance, as I have tried many times.
> ----------------------
> (1) k=2
> ----------------------
> Process 1 of total 2 on wmss04
> Process 0 of total 2 on wmss04
> The dimension of Matrix A is n = 1177754
> Begin Assembly:
> Begin Assembly:
> End Assembly.
> End Assembly.
> =========================================================
> Begin the solving:
> =========================================================
> The current time is: Mon Dec 20 17:42:23 2010
>
> KSP Object:
>   type: bicg
>   maximum iterations=10000, initial guess is zero
>   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000
>   left preconditioning
>   using PRECONDITIONED norm type for convergence test
> PC Object:
>   type: jacobi
>   linear system matrix = precond matrix:
>   Matrix Object:
>     type=mpisbaij, rows=1177754, cols=1177754
>     total: nonzeros=49908476, allocated nonzeros=49908476
>         block size is 1
>
> norm(b-Ax)=1.25862e-06
> Norm of error 1.25862e-06, Iterations 1475
> =========================================================
> The solver has finished successfully!
> =========================================================
> The solving time is 762.874 seconds.
> The time accuracy is 1e-06 second.
> The current time is Mon Dec 20 17:55:06 2010
>
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./AMG_Solver_MPI on a linux-gnu named wmss04 with 2 processors, by cheny
> Mon Dec 20 18:55:06 2010
> Using Petsc Release Version 3.1.0, Patch 5, Mon Sep 27 11:51:54 CDT 2010
>
>                          Max       Max/Min        Avg      Total
> Time (sec):           8.160e+02      1.00000   8.160e+02
> Objects:              3.000e+01      1.00000   3.000e+01
> Flops:                3.120e+11      1.04720   3.050e+11  6.100e+11
> Flops/sec:            3.824e+08      1.04720   3.737e+08  7.475e+08
> MPI Messages:         2.958e+03      1.00068   2.958e+03  5.915e+03
> MPI Message Lengths:  9.598e+08      1.00034   3.245e+05  1.919e+09
> MPI Reductions:       4.483e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>                             and VecAXPY() for complex vectors of length N
> --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 8.1603e+02 100.0%  6.0997e+11 100.0%  5.915e+03
> 100.0%  3.245e+05      100.0%  4.467e+03  99.6%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in this
> phase
>       %M - percent messages in this phase     %L - percent message lengths
> in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)
> Flops                             --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             1476 1.0 3.4220e+02 1.0 1.48e+11 1.0 3.0e+03 3.2e+05
> 0.0e+00 41 47 50 50  0  41 47 50 50  0   846
> MatMultTranspose    1475 1.0 3.4208e+02 1.0 1.48e+11 1.0 3.0e+03 3.2e+05
> 0.0e+00 42 47 50 50  0  42 47 50 50  0   846
> MatAssemblyBegin       1 1.0 1.5492e-0281.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 8.1615e-02 1.0 0.00e+00 0.0 1.0e+01 1.1e+05
> 1.2e+01  0  0  0  0  0   0  0  0  0  0     0
> MatView                1 1.0 1.5807e-04 3.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecView                1 1.0 1.0809e+01 2.1 0.00e+00 0.0 2.0e+00 2.4e+06
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecDot              2950 1.0 2.0457e+01 1.9 3.47e+09 1.0 0.0e+00 0.0e+00
> 3.0e+03  2  1  0  0 66   2  1  0  0 66   340
> VecNorm             1477 1.0 1.2103e+01 1.7 1.74e+09 1.0 0.0e+00 0.0e+00
> 1.5e+03  1  1  0  0 33   1  1  0  0 33   287
> VecCopy                4 1.0 1.0110e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              8855 1.0 6.0069e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY             4426 1.0 1.8430e+01 1.2 5.21e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  2  0  0  0   2  2  0  0  0   566
> VecAYPX             2948 1.0 1.3610e+01 1.2 3.47e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  1  0  0  0   2  1  0  0  0   510
> VecAssemblyBegin       6 1.0 9.1116e-0317.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.8e+01  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyEnd         6 1.0 1.7405e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecPointwiseMult    2952 1.0 1.7966e+01 1.1 1.74e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  1  0  0  0   2  1  0  0  0   194
> VecScatterBegin     2951 1.0 8.6552e-01 1.1 0.00e+00 0.0 5.9e+03 3.2e+05
> 0.0e+00  0  0100100  0   0  0100100  0     0
> VecScatterEnd       2951 1.0 2.7126e+01 8.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> KSPSetup               1 1.0 3.9254e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 7.5170e+02 1.0 3.12e+11 1.0 5.9e+03 3.2e+05
> 4.4e+03 92100100100 99  92100100100 99   811
> PCSetUp                1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCApply             2952 1.0 1.8043e+01 1.1 1.74e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  1  0  0  0   2  1  0  0  0   193
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     3              3    339744648     0
>                  Vec    18             18     62239872     0
>          Vec Scatter     2              2         1736     0
>            Index Set     4              4       974736     0
>        Krylov Solver     1              1          832     0
>       Preconditioner     1              1          872     0
>               Viewer     1              1          544     0
>
> ========================================================================================================================
> Average time to get PetscTime(): 1.21593e-06
> Average time for MPI_Barrier(): 1.44005e-05
> Average time for zero size MPI_Send(): 1.94311e-05
> #PETSc Option Table entries:
> -ksp_type bicg
> -log_summary
> -pc_type jacobi
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Tue Nov 23 15:54:45 2010
> Configure options: --known-level1-dcache-size=65536
> --known-level1-dcache-linesize=64 --known-level1-dcache-assoc=2
> --known-memcmp-ok=1 --known-sizeof-char=1 --known-sizeof-void-p=8
> --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
> --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8
> --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-sizeof-MPI_Comm=4
> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --with-cc=gcc
> --with-cxx=g++ --with-F77=ifort --with-FC=ifort --download-f-blas-lapack=1
> --download-superlu-dist=1 --download-hypre=1 --download-trilinos=1
> --download-parmetis=1 --download-mumps=1 --download-scalapack=1
> --download-blacs=1 --download-mpich=1 --with-debugging=0 --with-batch
> --known-mpi-shared=1
> -----------------------------------------
> Libraries compiled on Tue Nov 23 15:57:11 CET 2010 on wmss04
> Machine characteristics: Linux wmss04 2.6.16.60-0.21-smp #1 SMP Tue May 6
> 12:41:02 UTC 2008 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /sun42/cheny/petsc-3.1-p5-optimized
> Using PETSc arch: linux-gnu-c-opt
> -----------------------------------------
> Using C compiler:
> /sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/bin/mpicc -Wall
> -Wwrite-strings -Wno-strict-aliasing -O
> Using Fortran compiler:
> /sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/bin/mpif90 -Wall
> -Wno-unused-variable -O
> -----------------------------------------
> Using include paths:
> -I/sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/include
> -I/sun42/cheny/petsc-3.1-p5-optimized/include
> -I/sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/include
> ------------------------------------------
> Using C linker:
> /sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/bin/mpicc -Wall
> -Wwrite-strings -Wno-strict-aliasing -O
> Using Fortran linker:
> /sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/bin/mpif90 -Wall
> -Wno-unused-variable -O
> Using libraries:
> -Wl,-rpath,/sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/lib
> -L/sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/lib -lpetsc
> -Wl,-rpath,/sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/lib
> -L/sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/lib -lHYPRE -lmpichcxx
> -lstdc++ -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord
> -lparmetis -lmetis -lscalapack -lblacs -lflapack -lfblas -lnsl -laio -lrt
> -L/sun42/cheny/petsc-3.1-p5-optimized/linux-gnu-c-opt/lib
> -L/usr/lib64/gcc/x86_64-suse-linux/4.1.2
> -L/opt/intel/Compiler/11.0/083/ipp/em64t/lib
> -L/opt/intel/Compiler/11.0/083/mkl/lib/em64t
> -L/opt/intel/Compiler/11.0/083/tbb/em64t/cc4.1.0_libc2.4_kernel2.6.16.21/lib
> -L/usr/x86_64-suse-linux/lib -ldl -lmpich -lpthread -lrt -lgcc_s -lmpichf90
> -lgfortran -lm -lm -lmpichcxx -lstdc++ -lmpichcxx -lstdc++ -ldl -lmpich
> -lpthread -lrt -lgcc_s -ldl
> ------------------------------------------
>
>
> ----------------------
> (2) k=4
> ----------------------
> Process 0 of total 4 on wmss04
> Process 2 of total 4 on wmss04
> Process 3 of total 4 on wmss04
> Process 1 of total 4 on wmss04
> The dimension of Matrix A is n = 1177754
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.
> =========================================================
> Begin the solving:
> =========================================================
> The current time is: Mon Dec 20 17:33:24 2010
>
> KSP Object:
>   type: bicg
>   maximum iterations=10000, initial guess is zero
>   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000
>   left preconditioning
>   using PRECONDITIONED norm type for convergence test
> PC Object:
>   type: jacobi
>   linear system matrix = precond matrix:
>   Matrix Object:
>     type=mpisbaij, rows=1177754, cols=1177754
>     total: nonzeros=49908476, allocated nonzeros=49908476
>         block size is 1
>
> norm(b-Ax)=1.28342e-06
> Norm of error 1.28342e-06, Iterations 1473
> =========================================================
> The solver has finished successfully!
> =========================================================
> The solving time is 450.583 seconds.
> The time accuracy is 1e-06 second.
> The current time is Mon Dec 20 17:40:55 2010
>
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./AMG_Solver_MPI on a linux-gnu named wmss04 with 4 processors, by cheny
> Mon Dec 20 18:40:55 2010
> Using Petsc Release Version 3.1.0, Patch 5, Mon Sep 27 11:51:54 CDT 2010
>
>                          Max       Max/Min        Avg      Total
> Time (sec):           4.807e+02      1.00000   4.807e+02
> Objects:              3.000e+01      1.00000   3.000e+01
> Flops:                1.558e+11      1.06872   1.523e+11  6.091e+11
> Flops/sec:            3.241e+08      1.06872   3.168e+08  1.267e+09
> MPI Messages:         5.906e+03      2.00017   4.430e+03  1.772e+04
> MPI Message Lengths:  1.727e+09      2.74432   2.658e+05  4.710e+09
> MPI Reductions:       4.477e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>                             and VecAXPY() for complex vectors of length N
> --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 4.8066e+02 100.0%  6.0914e+11 100.0%  1.772e+04
> 100.0%  2.658e+05      100.0%  4.461e+03  99.6%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in this
> phase
>       %M - percent messages in this phase     %L - percent message lengths
> in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)
> Flops                             --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             1474 1.0 1.9344e+02 1.1 7.40e+10 1.1 8.8e+03 2.7e+05
> 0.0e+00 39 47 50 50  0  39 47 50 50  0  1494
> MatMultTranspose    1473 1.0 1.9283e+02 1.0 7.40e+10 1.1 8.8e+03 2.7e+05
> 0.0e+00 40 47 50 50  0  40 47 50 50  0  1498
> MatAssemblyBegin       1 1.0 1.5624e-0263.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 6.3599e-02 1.0 0.00e+00 0.0 3.0e+01 9.3e+04
> 1.2e+01  0  0  0  0  0   0  0  0  0  0     0
> MatView                1 1.0 1.8096e-04 2.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecView                1 1.0 1.1063e+01 4.7 0.00e+00 0.0 6.0e+00 1.2e+06
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecDot              2946 1.0 2.5350e+01 2.7 1.73e+09 1.0 0.0e+00 0.0e+00
> 2.9e+03  3  1  0  0 66   3  1  0  0 66   274
> VecNorm             1475 1.0 1.1197e+01 3.0 8.69e+08 1.0 0.0e+00 0.0e+00
> 1.5e+03  1  1  0  0 33   1  1  0  0 33   310
> VecCopy                4 1.0 6.0010e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              8843 1.0 3.6737e+00 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY             4420 1.0 1.4221e+01 1.4 2.60e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  3  2  0  0  0   3  2  0  0  0   732
> VecAYPX             2944 1.0 1.1377e+01 1.1 1.73e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  1  0  0  0   2  1  0  0  0   610
> VecAssemblyBegin       6 1.0 2.8596e-0223.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.8e+01  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyEnd         6 1.0 2.4796e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecPointwiseMult    2948 1.0 1.7210e+01 1.2 8.68e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  3  1  0  0  0   3  1  0  0  0   202
> VecScatterBegin     2947 1.0 1.9806e+00 2.4 0.00e+00 0.0 1.8e+04 2.7e+05
> 0.0e+00  0  0100100  0   0  0100100  0     0
> VecScatterEnd       2947 1.0 4.3833e+01 7.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  6  0  0  0  0   6  0  0  0  0     0
> KSPSetup               1 1.0 2.1496e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 4.3931e+02 1.0 1.56e+11 1.1 1.8e+04 2.7e+05
> 4.4e+03 91100100100 99  91100100100 99  1386
> PCSetUp                1 1.0 3.0994e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCApply             2948 1.0 1.7256e+01 1.2 8.68e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  3  1  0  0  0   3  1  0  0  0   201
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     3              3    169902696     0
>                  Vec    18             18     31282096     0
>          Vec Scatter     2              2         1736     0
>            Index Set     4              4       638616     0
>        Krylov Solver     1              1          832     0
>       Preconditioner     1              1          872     0
>               Viewer     1              1          544     0
>
> ========================================================================================================================
> Average time to get PetscTime(): 1.5974e-06
> Average time for MPI_Barrier(): 3.48091e-05
> Average time for zero size MPI_Send(): 1.8537e-05
> #PETSc Option Table entries:
> -ksp_type bicg
> -log_summary
> -pc_type jacobi
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Tue Nov 23 15:54:45 2010
> Configure options: --known-level1-dcache-size=65536
> --known-level1-dcache-linesize=64 --known-level1-dcache-assoc=2
> --known-memcmp-ok=1 --known-sizeof-char=1 --known-sizeof-void-p=8
> --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
> --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8
> --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-sizeof-MPI_Comm=4
> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --with-cc=gcc
> --with-cxx=g++ --with-F77=ifort --with-FC=ifort --download-f-blas-lapack=1
> --download-superlu-dist=1 --download-hypre=1 --download-trilinos=1
> --download-parmetis=1 --download-mumps=1 --download-scalapack=1
> --download-blacs=1 --download-mpich=1 --with-debugging=0 --with-batch
> --known-mpi-shared=1
> -----------------------------------------
>
>
>
> ----------------------
> (3) k=8
> ----------------------
> Process 0 of total 8 on wmss04
> Process 4 of total 8 on wmss04
> Process 2 of total 8 on wmss04
> Process 6 of total 8 on wmss04
> Process 3 of total 8 on wmss04
> Process 7 of total 8 on wmss04
> Process 1 of total 8 on wmss04
> Process 5 of total 8 on wmss04
> The dimension of Matrix A is n = 1177754
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.
> =========================================================
> Begin the solving:
> =========================================================
> The current time is: Mon Dec 20 18:14:59 2010
>
> KSP Object:
>   type: bicg
>   maximum iterations=10000, initial guess is zero
>   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000
>   left preconditioning
>   using PRECONDITIONED norm type for convergence test
> PC Object:
>   type: jacobi
>   linear system matrix = precond matrix:
>   Matrix Object:
>     type=mpisbaij, rows=1177754, cols=1177754
>     total: nonzeros=49908476, allocated nonzeros=49908476
>         block size is 1
>
> norm(b-Ax)=1.32502e-06
> Norm of error 1.32502e-06, Iterations 1473
> =========================================================
> The solver has finished successfully!
> =========================================================
> The solving time is 311.937 seconds.
> The time accuracy is 1e-06 second.
> The current time is Mon Dec 20 18:20:11 2010
>
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./AMG_Solver_MPI on a linux-gnu named wmss04 with 8 processors, by cheny
> Mon Dec 20 19:20:11 2010
> Using Petsc Release Version 3.1.0, Patch 5, Mon Sep 27 11:51:54 CDT 2010
>
>                          Max       Max/Min        Avg      Total
> Time (sec):           3.330e+02      1.00000   3.330e+02
> Objects:              3.000e+01      1.00000   3.000e+01
> Flops:                7.792e+10      1.09702   7.614e+10  6.091e+11
> Flops/sec:            2.340e+08      1.09702   2.286e+08  1.829e+09
> MPI Messages:         5.906e+03      2.00017   5.169e+03  4.135e+04
> MPI Message Lengths:  1.866e+09      4.61816   2.430e+05  1.005e+10
> MPI Reductions:       4.477e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>                             and VecAXPY() for complex vectors of length N
> --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 3.3302e+02 100.0%  6.0914e+11 100.0%  4.135e+04
> 100.0%  2.430e+05      100.0%  4.461e+03  99.6%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in this
> phase
>       %M - percent messages in this phase     %L - percent message lengths
> in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)
> Flops                             --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             1474 1.0 1.4230e+02 1.4 3.70e+10 1.1 2.1e+04 2.4e+05
> 0.0e+00 38 47 50 50  0  38 47 50 50  0  2031
> MatMultTranspose    1473 1.0 1.3627e+02 1.1 3.70e+10 1.1 2.1e+04 2.4e+05
> 0.0e+00 38 47 50 50  0  38 47 50 50  0  2120
> MatAssemblyBegin       1 1.0 8.0800e-0324.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 5.3647e-02 1.0 0.00e+00 0.0 7.0e+01 8.5e+04
> 1.2e+01  0  0  0  0  0   0  0  0  0  0     0
> MatView                1 1.0 2.1791e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecView                1 1.0 1.0902e+0112.1 0.00e+00 0.0 1.4e+01 5.9e+05
> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> VecDot              2946 1.0 3.5689e+01 7.6 8.67e+08 1.0 0.0e+00 0.0e+00
> 2.9e+03  6  1  0  0 66   6  1  0  0 66   194
> VecNorm             1475 1.0 8.1093e+00 4.0 4.34e+08 1.0 0.0e+00 0.0e+00
> 1.5e+03  1  1  0  0 33   1  1  0  0 33   428
> VecCopy                4 1.0 5.2011e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              8843 1.0 3.0491e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY             4420 1.0 9.2421e+00 1.6 1.30e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  2  0  0  0   2  2  0  0  0  1127
> VecAYPX             2944 1.0 6.8297e+00 1.5 8.67e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  1  0  0  0   2  1  0  0  0  1015
> VecAssemblyBegin       6 1.0 2.6218e-0210.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.8e+01  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyEnd         6 1.0 3.6240e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecPointwiseMult    2948 1.0 9.6646e+00 1.4 4.34e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  3  1  0  0  0   3  1  0  0  0   359
> VecScatterBegin     2947 1.0 2.2599e+00 2.3 0.00e+00 0.0 4.1e+04 2.4e+05
> 0.0e+00  1  0100100  0   1  0100100  0     0
> VecScatterEnd       2947 1.0 7.7004e+0120.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  9  0  0  0  0   9  0  0  0  0     0
> KSPSetup               1 1.0 1.4287e-02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 3.0090e+02 1.0 7.79e+10 1.1 4.1e+04 2.4e+05
> 4.4e+03 90100100100 99  90100100100 99  2024
> PCSetUp                1 1.0 4.0531e-06 2.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCApply             2948 1.0 9.7001e+00 1.4 4.34e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  3  1  0  0  0   3  1  0  0  0   358
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     3              3     84944064     0
>                  Vec    18             18     15741712     0
>          Vec Scatter     2              2         1736     0
>            Index Set     4              4       409008     0
>        Krylov Solver     1              1          832     0
>       Preconditioner     1              1          872     0
>               Viewer     1              1          544     0
>
> ========================================================================================================================
> Average time to get PetscTime(): 3.38554e-06
> Average time for MPI_Barrier(): 7.40051e-05
> Average time for zero size MPI_Send(): 1.88947e-05
> #PETSc Option Table entries:
> -ksp_type bicg
> -log_summary
> -pc_type jacobi
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Tue Nov 23 15:54:45 2010
> Configure options: --known-level1-dcache-size=65536
> --known-level1-dcache-linesize=64 --known-level1-dcache-assoc=2
> --known-memcmp-ok=1 --known-sizeof-char=1 --known-sizeof-void-p=8
> --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
> --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8
> --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-sizeof-MPI_Comm=4
> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --with-cc=gcc
> --with-cxx=g++ --with-F77=ifort --with-FC=ifort --download-f-blas-lapack=1
> --download-superlu-dist=1 --download-hypre=1 --download-trilinos=1
> --download-parmetis=1 --download-mumps=1 --download-scalapack=1
> --download-blacs=1 --download-mpich=1 --with-debugging=0 --with-batch
> --known-mpi-shared=1
> -----------------------------------------
>
>
>
> ----------------------
> (4) k=12
> ----------------------
> Process 1 of total 12 on wmss04
> Process 5 of total 12 on wmss04
> Process 2 of total 12 on wmss04
> Process 9 of total 12 on wmss04
> Process 6 of total 12 on wmss04
> Process 7 of total 12 on wmss04
> Process 10 of total 12 on wmss04
> Process 3 of total 12 on wmss04
> Process 11 of total 12 on wmss04
> Process 4 of total 12 on wmss04
> Process 8 of total 12 on wmss04
> Process 0 of total 12 on wmss04
> The dimension of Matrix A is n = 1177754
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.End Assembly.
> End Assembly.
> End Assembly.
>
> End Assembly.
> End Assembly.
> =========================================================
> Begin the solving:
> =========================================================
> The current time is: Mon Dec 20 17:56:36 2010
>
> KSP Object:
>   type: bicg
>   maximum iterations=10000, initial guess is zero
>   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000
>   left preconditioning
>   using PRECONDITIONED norm type for convergence test
> PC Object:
>   type: jacobi
>   linear system matrix = precond matrix:
>   Matrix Object:
>     type=mpisbaij, rows=1177754, cols=1177754
>     total: nonzeros=49908476, allocated nonzeros=49908476
>         block size is 1
>
> norm(b-Ax)=1.28414e-06
> Norm of error 1.28414e-06, Iterations 1473
> =========================================================
> The solver has finished successfully!
> =========================================================
> The solving time is 291.503 seconds.
> The time accuracy is 1e-06 second.
> The current time is Mon Dec 20 18:01:28 2010
>
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./AMG_Solver_MPI on a linux-gnu named wmss04 with 12 processors, by cheny
> Mon Dec 20 19:01:28 2010
> Using Petsc Release Version 3.1.0, Patch 5, Mon Sep 27 11:51:54 CDT 2010
>
>                          Max       Max/Min        Avg      Total
> Time (sec):           3.089e+02      1.00012   3.089e+02
> Objects:              3.000e+01      1.00000   3.000e+01
> Flops:                5.197e+10      1.11689   5.074e+10  6.089e+11
> Flops/sec:            1.683e+08      1.11689   1.643e+08  1.971e+09
> MPI Messages:         5.906e+03      2.00017   5.415e+03  6.498e+04
> MPI Message Lengths:  1.887e+09      6.23794   2.345e+05  1.524e+10
> MPI Reductions:       4.477e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>                             and VecAXPY() for complex vectors of length N
> --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 3.0887e+02 100.0%  6.0890e+11 100.0%  6.498e+04
> 100.0%  2.345e+05      100.0%  4.461e+03  99.6%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in this
> phase
>       %M - percent messages in this phase     %L - percent message lengths
> in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)
> Flops                             --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             1474 1.0 1.4069e+02 2.1 2.47e+10 1.1 3.2e+04 2.3e+05
> 0.0e+00 35 47 50 50  0  35 47 50 50  0  2054
> MatMultTranspose    1473 1.0 1.3272e+02 1.8 2.47e+10 1.1 3.2e+04 2.3e+05
> 0.0e+00 34 47 50 50  0  34 47 50 50  0  2175
> MatAssemblyBegin       1 1.0 6.4070e-0314.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 6.2698e-02 1.0 0.00e+00 0.0 1.1e+02 8.2e+04
> 1.2e+01  0  0  0  0  0   0  0  0  0  0     0
> MatView                1 1.0 2.4605e-04 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecView                1 1.0 1.1164e+0182.6 0.00e+00 0.0 2.2e+01 3.9e+05
> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> VecDot              2946 1.0 1.1499e+0234.8 5.78e+08 1.0 0.0e+00 0.0e+00
> 2.9e+03 13  1  0  0 66  13  1  0  0 66    60
> VecNorm             1475 1.0 1.0804e+01 7.7 2.90e+08 1.0 0.0e+00 0.0e+00
> 1.5e+03  2  1  0  0 33   2  1  0  0 33   322
> VecCopy                4 1.0 6.9451e-03 2.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              8843 1.0 2.9336e+00 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY             4420 1.0 1.0803e+01 2.3 8.68e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  2  0  0  0   2  2  0  0  0   964
> VecAYPX             2944 1.0 6.6637e+00 2.1 5.78e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  1  0  0  0   2  1  0  0  0  1041
> VecAssemblyBegin       6 1.0 3.7719e-0214.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.8e+01  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyEnd         6 1.0 5.3883e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecPointwiseMult    2948 1.0 8.7972e+00 2.3 2.89e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  1  0  0  0   2  1  0  0  0   395
> VecScatterBegin     2947 1.0 3.3624e+00 4.3 0.00e+00 0.0 6.5e+04 2.3e+05
> 0.0e+00  1  0100100  0   1  0100100  0     0
> VecScatterEnd       2947 1.0 8.0508e+0119.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 12  0  0  0  0  12  0  0  0  0     0
> KSPSetup               1 1.0 1.1752e-02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 2.8016e+02 1.0 5.20e+10 1.1 6.5e+04 2.3e+05
> 4.4e+03 91100100100 99  91100100100 99  2173
> PCSetUp                1 1.0 5.9605e-06 2.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCApply             2948 1.0 8.8313e+00 2.3 2.89e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  1  0  0  0   2  1  0  0  0   393
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     3              3     56593044     0
>                  Vec    18             18     10534536     0
>          Vec Scatter     2              2         1736     0
>            Index Set     4              4       305424     0
>        Krylov Solver     1              1          832     0
>       Preconditioner     1              1          872     0
>               Viewer     1              1          544     0
>
> ========================================================================================================================
> Average time to get PetscTime(): 6.48499e-06
> Average time for MPI_Barrier(): 0.000102377
> Average time for zero size MPI_Send(): 2.15967e-05
> #PETSc Option Table entries:
> -ksp_type bicg
> -log_summary
> -pc_type jacobi
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Tue Nov 23 15:54:45 2010
> Configure options: --known-level1-dcache-size=65536
> --known-level1-dcache-linesize=64 --known-level1-dcache-assoc=2
> --known-memcmp-ok=1 --known-sizeof-char=1 --known-sizeof-void-p=8
> --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
> --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8
> --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-sizeof-MPI_Comm=4
> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --with-cc=gcc
> --with-cxx=g++ --with-F77=ifort --with-FC=ifort --download-f-blas-lapack=1
> --download-superlu-dist=1 --download-hypre=1 --download-trilinos=1
> --download-parmetis=1 --download-mumps=1 --download-scalapack=1
> --download-blacs=1 --download-mpich=1 --with-debugging=0 --with-batch
> --known-mpi-shared=1
> -----------------------------------------
>
>
> ----------------------
> (5) k=16
> ----------------------
> Process 0 of total 16 on wmss04
> Process 8 of total 16 on wmss04
> Process 4 of total 16 on wmss04
> Process 12 of total 16 on wmss04
> Process 2 of total 16 on wmss04
> Process 6 of total 16 on wmss04
> Process 5 of total 16 on wmss04
> Process 11 of total 16 on wmss04
> Process 14 of total 16 on wmss04
> Process 7 of total 16 on wmss04
> Process Process 15 of total 16 on wmss04
> 3Process 13 of total 16 on wmss04
> Process 10 of total 16 on wmss04
> Process 9 of total 16 on wmss04
> Process 1 of total 16 on wmss04
> The dimension of Matrix A is n = 1177754
>  of total 16 on wmss04
>
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
>
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
>
> Begin Assembly:
> Begin Assembly:
> Begin Assembly:
>
> Begin Assembly:
> Begin Assembly:
> End Assembly.
> End Assembly.End Assembly.
> End Assembly.End Assembly.End Assembly.End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.End Assembly.
>
> End Assembly.
> End Assembly.
> End Assembly.
> End Assembly.End Assembly.
>
>
>
> =========================================================
> Begin the solving:
> =========================================================
> The current time is: Mon Dec 20 18:02:28 2010
>
> KSP Object:
>   type: bicg
>   maximum iterations=10000, initial guess is zero
>   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000
>   left preconditioning
>   using PRECONDITIONED norm type for convergence test
> PC Object:
>   type: jacobi
>   linear system matrix = precond matrix:
>   Matrix Object:
>     type=mpisbaij, rows=1177754, cols=1177754
>     total: nonzeros=49908476, allocated nonzeros=49908476
>         block size is 1
>
> norm(b-Ax)=1.15892e-06
> Norm of error 1.15892e-06, Iterations 1497
> =========================================================
> The solver has finished successfully!
> =========================================================
> The solving time is 337.91 seconds.
> The time accuracy is 1e-06 second.
> The current time is Mon Dec 20 18:08:06 2010
>
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> ./AMG_Solver_MPI on a linux-gnu named wmss04 with 16 processors, by cheny
> Mon Dec 20 19:08:06 2010
> Using Petsc Release Version 3.1.0, Patch 5, Mon Sep 27 11:51:54 CDT 2010
>
>                          Max       Max/Min        Avg      Total
> Time (sec):           3.534e+02      1.00001   3.534e+02
> Objects:              3.000e+01      1.00000   3.000e+01
> Flops:                3.964e+10      1.13060   3.864e+10  6.182e+11
> Flops/sec:            1.122e+08      1.13060   1.093e+08  1.749e+09
> MPI Messages:         1.200e+04      3.99917   7.127e+03  1.140e+05
> MPI Message Lengths:  1.950e+09      7.80999   1.819e+05  2.074e+10
> MPI Reductions:       4.549e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>                             and VecAXPY() for complex vectors of length N
> --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts
> %Total     Avg         %Total   counts   %Total
>  0:      Main Stage: 3.5342e+02 100.0%  6.1820e+11 100.0%  1.140e+05
> 100.0%  1.819e+05      100.0%  4.533e+03  99.6%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in this
> phase
>       %M - percent messages in this phase     %L - percent message lengths
> in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)
> Flops                             --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             1498 1.0 1.8860e+02 1.7 1.88e+10 1.1 5.7e+04 1.8e+05
> 0.0e+00 40 47 50 50  0  40 47 50 50  0  1555
> MatMultTranspose    1497 1.0 1.4165e+02 1.3 1.88e+10 1.1 5.7e+04 1.8e+05
> 0.0e+00 35 47 50 50  0  35 47 50 50  0  2069
> MatAssemblyBegin       1 1.0 1.0044e-0217.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 7.3835e-02 1.0 0.00e+00 0.0 1.8e+02 6.7e+04
> 1.2e+01  0  0  0  0  0   0  0  0  0  0     0
> MatView                1 1.0 2.6107e-04 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecView                1 1.0 1.1282e+01109.0 0.00e+00 0.0 3.0e+01 2.9e+05
> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> VecDot              2994 1.0 6.7490e+0119.6 4.41e+08 1.0 0.0e+00 0.0e+00
> 3.0e+03 10  1  0  0 66  10  1  0  0 66   104
> VecNorm             1499 1.0 1.3431e+0110.8 2.21e+08 1.0 0.0e+00 0.0e+00
> 1.5e+03  2  1  0  0 33   2  1  0  0 33   263
> VecCopy                4 1.0 7.3178e-03 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              8987 1.0 3.1772e+00 3.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY             4492 1.0 1.1361e+01 3.1 6.61e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  2  0  0  0   2  2  0  0  0   931
> VecAYPX             2992 1.0 7.3248e+00 2.5 4.40e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  1  1  0  0  0   1  1  0  0  0   962
> VecAssemblyBegin       6 1.0 3.6338e-0212.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.8e+01  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyEnd         6 1.0 7.2002e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecPointwiseMult    2996 1.0 9.7892e+00 2.4 2.21e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  1  0  0  0   2  1  0  0  0   360
> VecScatterBegin     2995 1.0 4.0570e+00 5.5 0.00e+00 0.0 1.1e+05 1.8e+05
> 0.0e+00  1  0100100  0   1  0100100  0     0
> VecScatterEnd       2995 1.0 1.7309e+0251.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 22  0  0  0  0  22  0  0  0  0     0
> KSPSetup               1 1.0 1.3058e-02 2.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 3.2641e+02 1.0 3.96e+10 1.1 1.1e+05 1.8e+05
> 4.5e+03 92100100100 99  92100100100 99  1893
> PCSetUp                1 1.0 8.1062e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> PCApply             2996 1.0 9.8336e+00 2.4 2.21e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  2  1  0  0  0   2  1  0  0  0   359
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     3              3     42424600     0
>                  Vec    18             18      7924896     0
>          Vec Scatter     2              2         1736     0
>            Index Set     4              4       247632     0
>        Krylov Solver     1              1          832     0
>       Preconditioner     1              1          872     0
>               Viewer     1              1          544     0
>
> ========================================================================================================================
> Average time to get PetscTime(): 6.10352e-06
> Average time for MPI_Barrier(): 0.000129986
> Average time for zero size MPI_Send(): 2.08169e-05
> #PETSc Option Table entries:
> -ksp_type bicg
> -log_summary
> -pc_type jacobi
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Tue Nov 23 15:54:45 2010
> Configure options: --known-level1-dcache-size=65536
> --known-level1-dcache-linesize=64 --known-level1-dcache-assoc=2
> --known-memcmp-ok=1 --known-sizeof-char=1 --known-sizeof-void-p=8
> --known-sizeof-short=2 --known-sizeof-int=4 --known-sizeof-long=8
> --known-sizeof-long-long=8 --known-sizeof-float=4 --known-sizeof-double=8
> --known-sizeof-size_t=8 --known-bits-per-byte=8 --known-sizeof-MPI_Comm=4
> --known-sizeof-MPI_Fint=4 --known-mpi-long-double=1 --with-cc=gcc
> --with-cxx=g++ --with-F77=ifort --with-FC=ifort --download-f-blas-lapack=1
> --download-superlu-dist=1 --download-hypre=1 --download-trilinos=1
> --download-parmetis=1 --download-mumps=1 --download-scalapack=1
> --download-blacs=1 --download-mpich=1 --with-debugging=0 --with-batch
> --known-mpi-shared=1
> -----------------------------------------
>
>
>
>
> On Mon, Dec 20, 2010 at 6:06 PM, Matthew Knepley <knepley at gmail.com>wrote:
>
>> On Mon, Dec 20, 2010 at 8:46 AM, Yongjun Chen <yjxd.chen at gmail.com>wrote:
>>
>>>
>>> Hi everyone,
>>>
>>>
>>> I use PETSC (version 3.1-p5) to solve a linear problem Ax=b. The matrix A
>>> and right hand vector b are read from files. The dimension of A is
>>> 1.2Million*1.2Million. I am pretty sure the matrix A and vector b have been
>>> read correctly.
>>>
>>> I compiled the program with optimized version (--with-debugging=0),
>>> tested the speed up performance on two servers, and I have found that the
>>> performance is very poor.
>>>
>>> For the two servers, one is 4 cpus * 4 cores per cpu, i.e., with a total
>>> 16 cores. And the other one is 4 cpus * 12 cores per cpu, with a total 48
>>> cores.
>>>
>>> On each of them, with the increasing of computing cores k from 1 to 8
>>> (mpiexec –n  k ./Solver_MPI -pc_type jacobi -ksp-type gmres), the speed up
>>> will increase from 1 to 6, but when the computing cores k increase from 9 to
>>> 16(for the first server) or 48 (for the second server), the speed up
>>> decrease firstly and then remains a constant value 5.0 (for the first
>>> server) or 4.5(for the second server).
>>>
>>
>> We cannot say anything at all without -log_summary data for your runs.
>>
>>    Matt
>>
>>
>>>  Actually, the program LAMMPS speed up excellently on these two servers.
>>>
>>> Any comments are very appreciated! Thanks!
>>>
>>>
>>>
>>>
>>> --------------------------------------------------------------------------------------------------------------------------
>>>
>>> PS: the related codes are as following,
>>>
>>>
>>> //firstly read A and b from files
>>>
>>> ...
>>>
>>> //then
>>>
>>>
>>>
>>>               ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);
>>> CHKERRQ(ierr);
>>>
>>>               ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);
>>> CHKERRQ(ierr);
>>>
>>>               ierr = VecAssemblyBegin(b); CHKERRQ(ierr);
>>>
>>>               ierr = VecAssemblyEnd(b); CHKERRQ(ierr);
>>>
>>>
>>>
>>>               ierr = MatSetOption(A,MAT_SYMMETRIC,PETSC_TRUE);
>>> CHKERRQ(ierr);
>>>
>>>               ierr = MatGetRowUpperTriangular(A); CHKERRQ(ierr);
>>>
>>>               ierr = KSPCreate(PETSC_COMM_WORLD,&ksp);CHKERRQ(ierr);
>>>
>>>
>>>
>>>               ierr =
>>> KSPSetOperators(ksp,A,A,DIFFERENT_NONZERO_PATTERN);CHKERRQ(ierr);
>>>
>>>               ierr = KSPGetPC(ksp,&pc);CHKERRQ(ierr);
>>>
>>>               ierr =
>>> KSPSetTolerances(ksp,1.e-7,PETSC_DEFAULT,PETSC_DEFAULT,PETSC_DEFAULT);CHKERRQ(ierr);
>>>
>>>               ierr = KSPSetFromOptions(ksp);CHKERRQ(ierr);
>>>
>>>
>>>
>>>               ierr = KSPSolve(ksp,b,x);CHKERRQ(ierr);
>>>
>>>
>>>
>>>               ierr =
>>> KSPView(ksp,PETSC_VIEWER_STDOUT_WORLD);CHKERRQ(ierr);
>>>
>>>
>>>
>>>               ierr = KSPGetSolution(ksp, &x);CHKERRQ(ierr);
>>>
>>>
>>>
>>>               ierr = VecAssemblyBegin(x);CHKERRQ(ierr);
>>>
>>>               ierr = VecAssemblyEnd(x);CHKERRQ(ierr);
>>>
>>> ...
>>>
>>>
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>
>
>
> --
> Dr.Yongjun Chen
> Room 2507, Building M
> Institute of Materials Science and Technology
> Technical University of Hamburg-Harburg
> Eißendorfer Straße 42, 21073 Hamburg, Germany.
> Tel:  +49 (0)40-42878-4386
> Fax: +49 (0)40-42878-4070
> E-mail: yjxd.chen at gmail.com
>
>

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20101220/0be40bd4/attachment-0001.htm>