Slow speed after changing from serial to parallel (with ex2f.F)

Tue Apr 15 22:08:02 CDT 2008

On Tue, Apr 15, 2008 at 10:01 PM, Ben Tay <zonexo at gmail.com> wrote:
>
>  Hi Matthew,
>
>  You mention that the unbalanced events take 0.01% of the time and speedup
> is terrible. Where did you get this information? Are you referring to Global

1) Look at the time of the events you point out (1.0e-2s) and the
total time or time for KSPSolve(1.0e2)

2) Look at the time for KSPSolve on 1 and 2 procs

> %T? As for the speedup, do you look at the time reported by the "time"
> command ie 63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata
> 0maxresident)?
>
>  I think you may be right. My school uses :
>
> The Supercomputing & Visualisation Unit, Computer Centre is pleased to
> announce the addition of a new cluster of Linux-based compute servers,
> consisting of a total of 64 servers (60 dual-core and 4 quad-core systems).
> Each of the compute nodes in the cluster is equipped with the following
> configurations:
>
>    No of Nodes Processors Qty per node Total cores per node Memory per node
>    4 Quad-Core Intel Xeon X5355 2 8 16 GB
>    60 Dual-Core Intel Xeon 5160 2 4 8 GB
>  When I run on 2 processors, it states I'm running on 2*atlas3-c45. So does
> it mean I running on shared memory bandwidth? So does it mean if I run on 4
> processors, is it equivalent to using 2 memory pipes?
>
>  I also got a reply from my school's engineer:
>
>  For queue mcore_parallel, LSF will assign the compute nodes automatically.
> To most of applications, running with 2*atlas3-c45 and 2*atlas3-c50 may be
> faster. However, it is not sure if 2*atlas3-c45 means to run the job within
> one CPU on dual core, or with two CPUs on two separate cores. This is not
> controllable.
>
>  So what can I do on my side to ensure speedup? I hope I do not have to
> switch from PETSc to other solvers.

Switching solvers will do you no good at all. The easiest thing to do
is get these
guys to improve the scheduler. Every half decent scheduler can assure that you
get separate processors. There is no excuse for forcing you into dual cores.

   Matt

>  Thanks lot!
>
>
>
>  Matthew Knepley wrote:
>  On Tue, Apr 15, 2008 at 9:08 PM, Ben Tay <zonexo at gmail.com> wrote:
>
>
>  Hi,
>
>  I just tested the ex2f.F example, changing m and n to 600. Here's the
> result for 1, 2 and 4 processors. Interestingly, MatAssemblyBegin,
> MatGetOrdering and KSPSetup have ratios >>1. The time taken seems to be
> faster as the processor increases, although speedup is not 1:1. I thought
> that this example should scale well, shouldn't it? Is there something wrong
> with my installation then?
>
>  1) Notice that the events that are unbalanced take 0.01% of the time.
> Not important.
>
> 2) The speedup really stinks. Even though this is a small problem. Are
> you sure that
>  you are actually running on two processors with separate memory
> pipes and not
>  on 1 dual core?
>
>  Matt
>
>
>
>  Thank you.
>
>  1 processor:
>
>  Norm of error 0.3371E+01 iterations 1153
>
> ************************************************************************************************************************
>  *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c58 with 1 processor, by g0306332 Wed
> Apr 16 10:03:12 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>  Max Max/Min Avg Total
>  Time (sec): 1.222e+02 1.00000 1.222e+02
>  Objects: 4.400e+01 1.00000 4.400e+01
>  Flops: 3.547e+10 1.00000 3.547e+10 3.547e+10
>  Flops/sec: 2.903e+08 1.00000 2.903e+08 2.903e+08
>  MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
>  MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
>  MPI Reductions: 2.349e+03 1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>  e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>  and VecAXPY() for complex vectors of length N
> --> 8N flops
>
>  Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
> -- Message Lengths -- -- Reductions --
>  Avg %Total Avg %Total counts %Total
> Avg %Total counts %Total
>  0: Main Stage: 1.2216e+02 100.0% 3.5466e+10 100.0% 0.000e+00 0.0%
> 0.000e+00 0.0% 2.349e+03 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>  Count: number of times phase was executed
>  Time and Flops/sec: Max - maximum over all processors
>  Ratio - ratio of maximum to minimum over all
> processors
>  Mess: number of messages sent
>  Avg. len: average message length
>  Reduct: number of global reductions
>  Global: entire computation
>  Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>  %T - percent time in this phase %F - percent flops in this
> phase
>  %M - percent messages in this phase %L - percent message lengths
> in this phase
>  %R - percent reductions in this phase
>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>  ##########################################################
>  # #
>  # WARNING!!! #
>  # #
>  # This code was run without the PreLoadBegin() #
>  # macros. To get timing results we always recommend #
>  # preloading. otherwise timing numbers may be #
>  # meaningless. #
>  ##########################################################
>
>  Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
>  Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 13 11 0 0 0 13 11 0 0 0 239
>  MatSolve 1192 1.0 3.1017e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11 0 0 0 25 11 0 0 0 124
>  MatLUFactorNum 1 1.0 3.6166e-02 1.0 8.94e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 89
>  MatILUFactorSym 1 1.0 1.9690e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyBegin 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyEnd 1 1.0 2.6258e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetOrdering 1 1.0 5.4259e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecMDot 1153 1.0 3.2664e+01 1.0 3.92e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 27 36 0 0 49 27 36 0 0 49 392
>  VecNorm 1193 1.0 2.0344e+00 1.0 4.22e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 2 2 0 0 51 2 2 0 0 51 422
>  VecScale 1192 1.0 6.9107e-01 1.0 6.21e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 1 0 0 0 1 1 0 0 0 621
>  VecCopy 39 1.0 3.4571e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecSet 41 1.0 1.1397e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecAXPY 78 1.0 6.9354e-01 1.0 8.10e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 81
>  VecMAXPY 1192 1.0 3.7492e+01 1.0 3.63e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 31 38 0 0 0 31 38 0 0 0 363
>  VecNormalize 1192 1.0 2.7284e+00 1.0 4.72e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 2 4 0 0 51 2 4 0 0 51 472
>  KSPGMRESOrthog 1153 1.0 6.7939e+01 1.0 3.76e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 56 72 0 0 49 56 72 0 0 49 376
>  KSPSetup 1 1.0 1.1651e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  KSPSolve 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00
> 2.3e+03100100 0 0100 100100 0 0100 292
>  PCSetUp 1 1.0 2.3852e-01 1.0 1.36e+07 1.0 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 14
>  PCApply 1192 1.0 3.1021e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11 0 0 0 25 11 0 0 0 124
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type Creations Destructions Memory Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>  Matrix 2 2 54691212 0
>  Index Set 3 3 4321032 0
>  Vec 37 37 103708408 0
>  Krylov Solver 1 1 17216 0
>  Preconditioner 1 1 168 0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 1.90735e-07
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan 8 22:22:08 2008
>  Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8
> --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8
> --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4
> --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0
> --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
> --with-batch=1 --with-mpi-shared=0
> --with-mpi-include=/usr/local/topspin/mpi/mpich/include
> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
> --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun
> --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
>  -----------------------------------------
>  Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01
>  Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12
> 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
>  Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
>  Using PETSc arch: atlas3-mpi
>  -----------------------------------------
>  85.53user 1.22system 2:02.65elapsed 70%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (16major+46429minor)pagefaults 0swaps
>  Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>
>
>  2 processors:
>
>  Norm of error 0.3231E+01 iterations 1177
>
> ************************************************************************************************************************
>  *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c58 with 2 processors, by g0306332 Wed
> Apr 16 09:48:37 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>  Max Max/Min Avg Total
>  Time (sec): 1.034e+02 1.00000 1.034e+02
>  Objects: 5.500e+01 1.00000 5.500e+01
>  Flops: 1.812e+10 1.00000 1.812e+10 3.625e+10
>  Flops/sec: 1.752e+08 1.00000 1.752e+08 3.504e+08
>  MPI Messages: 1.218e+03 1.00000 1.218e+03 2.436e+03
>  MPI Message Lengths: 5.844e+06 1.00000 4.798e+03 1.169e+07
>  MPI Reductions: 1.204e+03 1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>  e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>  and VecAXPY() for complex vectors of length N
> --> 8N flops
>
>  Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
> -- Message Lengths -- -- Reductions --
>  Avg %Total Avg %Total counts %Total
> Avg %Total counts %Total
>  0: Main Stage: 1.0344e+02 100.0% 3.6250e+10 100.0% 2.436e+03 100.0%
> 4.798e+03 100.0% 2.407e+03 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>  Count: number of times phase was executed
>  Time and Flops/sec: Max - maximum over all processors
>  Ratio - ratio of maximum to minimum over all
> processors
>  Mess: number of messages sent
>  Avg. len: average message length
>  Reduct: number of global reductions
>  Global: entire computation
>  Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>  %T - percent time in this phase %F - percent flops in this
> phase
>  %M - percent messages in this phase %L - percent message lengths
> in this phase
>  %R - percent reductions in this phase
>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>  ##########################################################
>  # #
>  # WARNING!!! #
>  # #
>  # This code was run without the PreLoadBegin() #
>  # macros. To get timing results we always recommend #
>  # preloading. otherwise timing numbers may be #
>  # meaningless. #
>  ##########################################################
>
>  Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
>  Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03
> 0.0e+00 11 11100100 0 11 11100100 0 315
>  MatSolve 1217 1.0 2.1088e+01 1.2 1.10e+08 1.2 0.0e+00 0.0e+00
> 0.0e+00 19 11 0 0 0 19 11 0 0 0 187
>  MatLUFactorNum 1 1.0 8.2862e-02 2.9 5.58e+07 2.9 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 39
>  MatILUFactorSym 1 1.0 3.3310e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyBegin 1 1.0 1.5567e-011854.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyEnd 1 1.0 1.0352e-01 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
> 7.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetRowIJ 1 1.0 3.0994e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetOrdering 1 1.0 5.0953e-0210.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecMDot 1177 1.0 4.0427e+01 1.1 1.85e+08 1.1 0.0e+00 0.0e+00
> 1.2e+03 37 36 0 0 49 37 36 0 0 49 323
>  VecNorm 1218 1.0 1.5475e+01 1.9 5.25e+07 1.9 0.0e+00 0.0e+00
> 1.2e+03 12 2 0 0 51 12 2 0 0 51 57
>  VecScale 1217 1.0 5.7866e-01 1.0 3.97e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 1 0 0 0 1 1 0 0 0 757
>  VecCopy 40 1.0 6.6697e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecSet 1259 1.0 1.5276e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>  VecAXPY 80 1.0 2.1163e-01 2.4 3.21e+08 2.4 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 272
>  VecMAXPY 1217 1.0 2.2980e+01 1.4 4.28e+08 1.4 0.0e+00 0.0e+00
> 0.0e+00 19 38 0 0 0 19 38 0 0 0 606
>  VecScatterBegin 1217 1.0 3.6620e-02 1.4 0.00e+00 0.0 2.4e+03 4.8e+03
> 0.0e+00 0 0100100 0 0 0100100 0 0
>  VecScatterEnd 1217 1.0 8.1980e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>  VecNormalize 1217 1.0 1.6030e+01 1.8 7.36e+07 1.8 0.0e+00 0.0e+00
> 1.2e+03 12 4 0 0 51 12 4 0 0 51 82
>  KSPGMRESOrthog 1177 1.0 5.7248e+01 1.0 2.35e+08 1.0 0.0e+00 0.0e+00
> 1.2e+03 55 72 0 0 49 55 72 0 0 49 457
>  KSPSetup 2 1.0 1.0363e-0110.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  KSPSolve 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03
> 2.4e+03 99100100100100 99100100100100 352
>  PCSetUp 2 1.0 1.5685e-01 2.3 2.40e+07 2.3 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 21
>  PCSetUpOnBlocks 1 1.0 1.5668e-01 2.3 2.41e+07 2.3 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 21
>  PCApply 1217 1.0 2.2625e+01 1.2 1.02e+08 1.2 0.0e+00 0.0e+00
> 0.0e+00 20 11 0 0 0 20 11 0 0 0 174
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type Creations Destructions Memory Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>  Matrix 4 4 34540820 0
>  Index Set 5 5 2164120 0
>  Vec 41 41 53315992 0
>  Vec Scatter 1 1 0 0
>  Krylov Solver 2 2 17216 0
>  Preconditioner 2 2 256 0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 1.90735e-07
>  Average time for MPI_Barrier(): 8.10623e-07
>  Average time for zero size MPI_Send(): 2.98023e-06
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan 8 22:22:08 2008
>
>  42.64user 0.28system 1:08.08elapsed 63%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (18major+28609minor)pagefaults 0swaps
>  1:08.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
>  0inputs+0outputs (18major+23666minor)pagefaults 0swaps
>
>
>  4 processors:
>
>  Norm of error 0.3090E+01 iterations 937
>  63.17user 0.05system 1:04.31elapsed 98%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (16major+13520minor)pagefaults 0swaps
>  53.13user 0.06system 1:04.31elapsed 82%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (15major+13414minor)pagefaults 0swaps
>  58.55user 0.23system 1:04.31elapsed 91%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (17major+18383minor)pagefaults 0swaps
>  20.36user 0.67system 1:04.33elapsed 32%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (14major+18392minor)pagefaults 0swaps
>  Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>
>
>
>
> ************************************************************************************************************************
>  *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
> -fCourier9' to print this document ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c45 with 4 processors, by g0306332 Wed
> Apr 16 09:55:16 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>  Max Max/Min Avg Total
>  Time (sec): 6.374e+01 1.00001 6.374e+01
>  Objects: 5.500e+01 1.00000 5.500e+01
>  Flops: 7.209e+09 1.00016 7.208e+09 2.883e+10
>  Flops/sec: 1.131e+08 1.00017 1.131e+08 4.524e+08
>  MPI Messages: 1.940e+03 2.00000 1.455e+03 5.820e+03
>  MPI Message Lengths: 9.307e+06 2.00000 4.798e+03 2.792e+07
>  MPI Reductions: 4.798e+02 1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>  e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>  and VecAXPY() for complex vectors of length N
> --> 8N flops
>
>  Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
> -- Message Lengths -- -- Reductions --
>  Avg %Total Avg %Total counts %Total
> Avg %Total counts %Total
>  0: Main Stage: 6.3737e+01 100.0% 2.8832e+10 100.0% 5.820e+03 100.0%
> 4.798e+03 100.0% 1.919e+03 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>  Count: number of times phase was executed
>  Time and Flops/sec: Max - maximum over all processors
>  Ratio - ratio of maximum to minimum over all
> processors
>  Mess: number of messages sent
>  Avg. len: average message length
>  Reduct: number of global reductions
>  Global: entire computation
>  Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>  %T - percent time in this phase %F - percent flops in this
> phase
>  %M - percent messages in this phase %L - percent message lengths
> in this phase
>  %R - percent reductions in this phase
>  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>  ##########################################################
>  # #
>  # WARNING!!! #
>  # #
>  # This code was run without the PreLoadBegin() #
>  # macros. To get timing results we always recommend #
>  # preloading. otherwise timing numbers may be #
>  # meaningless. #
>  ##########################################################
>
>
>  Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
>  Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03
> 0.0e+00 8 11100100 0 8 11100100 0 321
>  MatSolve 969 1.0 1.4244e+01 3.3 1.79e+08 3.3 0.0e+00 0.0e+00
> 0.0e+00 11 11 0 0 0 11 11 0 0 0 220
>  MatLUFactorNum 1 1.0 5.2070e-02 6.2 9.63e+07 6.2 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 62
>  MatILUFactorSym 1 1.0 1.7911e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyBegin 1 1.0 2.1741e-01164.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatAssemblyEnd 1 1.0 3.5663e-02 1.0 0.00e+00 0.0 6.0e+00 2.4e+03
> 7.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetRowIJ 1 1.0 2.1458e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  MatGetOrdering 1 1.0 1.2779e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecMDot 937 1.0 3.5634e+01 2.1 1.52e+08 2.1 0.0e+00 0.0e+00
> 9.4e+02 48 36 0 0 49 48 36 0 0 49 292
>  VecNorm 970 1.0 1.4387e+01 2.9 3.55e+07 2.9 0.0e+00 0.0e+00
> 9.7e+02 18 2 0 0 51 18 2 0 0 51 49
>  VecScale 969 1.0 1.5714e-01 2.1 1.14e+09 2.1 0.0e+00 0.0e+00
> 0.0e+00 0 1 0 0 0 0 1 0 0 0 2220
>  VecCopy 32 1.0 1.8988e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  VecSet 1003 1.0 1.1690e+00 3.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>  VecAXPY 64 1.0 2.1091e-02 1.1 6.07e+08 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2185
>  VecMAXPY 969 1.0 1.4823e+01 3.4 6.26e+08 3.4 0.0e+00 0.0e+00
> 0.0e+00 11 38 0 0 0 11 38 0 0 0 747
>  VecScatterBegin 969 1.0 2.3238e-02 2.1 0.00e+00 0.0 5.8e+03 4.8e+03
> 0.0e+00 0 0100100 0 0 0100100 0 0
>  VecScatterEnd 969 1.0 1.4613e+0083.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>  VecNormalize 969 1.0 1.4468e+01 2.8 5.15e+07 2.8 0.0e+00 0.0e+00
> 9.7e+02 18 4 0 0 50 18 4 0 0 50 72
>  KSPGMRESOrthog 937 1.0 3.9924e+01 1.3 1.68e+08 1.3 0.0e+00 0.0e+00
> 9.4e+02 59 72 0 0 49 59 72 0 0 49 521
>  KSPSetup 2 1.0 2.6190e-02 8.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>  KSPSolve 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03
> 1.9e+03 98100100100 99 98100100100 99 461
>  PCSetUp 2 1.0 7.1320e-02 4.1 4.59e+07 4.1 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 45
>  PCSetUpOnBlocks 1 1.0 7.1230e-02 4.1 4.62e+07 4.1 0.0e+00 0.0e+00
> 3.0e+00 0 0 0 0 0 0 0 0 0 0 45
>  PCApply 969 1.0 1.5379e+01 3.3 1.66e+08 3.3 0.0e+00 0.0e+00
> 0.0e+00 12 11 0 0 0 12 11 0 0 0 203
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type Creations Destructions Memory Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>  Matrix 4 4 17264420 0
>  Index Set 5 5 1084120 0
>  Vec 41 41 26675992 0
>  Vec Scatter 1 1 0 0
>  Krylov Solver 2 2 17216 0
>  Preconditioner 2 2 256 0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 1.90735e-07
>  Average time for MPI_Barrier(): 6.00815e-06
>  Average time for zero size MPI_Send(): 5.42402e-05
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan 8 22:22:08 2008
>
>
>
>  Matthew Knepley wrote:
>  The convergence here is jsut horrendous. Have you tried using LU to check
> your implementation? All the time is in the solve right now. I would first
> try a direct method (at least on a small problem) and then try to understand
> the convergence behavior. MUMPS can actually scale very well for big
> problems.
>
>  Matt
>
>
>
>
>
>
>
>
>
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener