profiling PETSc code

Tue Aug 1 16:45:22 CDT 2006

Hi,

i need to profile my code. Specifically the KSPSolve(...) call. Basically i am 
(just for testing) setting up the identity matrix and pass in the solution 
and RHS vectors. I solve the system 4000 times or so (400 times steps that 
is). Anyway, when i run on 1 processor it takes essentially no time (ca 7 
secs). When i run on 4 procs it takes 96 secs. I use an external timer to 
profile just the call to KSPSolve which is where the times come from.

So i read through chap 12. Unfortunately i cannot find ex21.c in 
src/ksp/ksp/examples/tutorials so i couldn't look at the code.

So what i do is the following. At the end of my program, right before calling 
PetscFinalize() i call PetscLogPrintSummary(...). I registered stage 0 and 
then right before the KSPSolve() call i do :   m_ierr = PetscLogStagePush(0); 
and right after i do: m_ierr = PetscLogStagePop(). There is also barriers 
right before the push call and right after the pop call.

The output i get (for the single proc run) is:

...

---------------------------------------------- PETSc Performance Summary: 
----------------------------------------------

/home2/users/mafunk/AMR/Chombo.2.0/example/node/maskExec/testNodePoisson2d.Linux.g++.g77.MPI32.ex 
on a linux-gnu named .1 with 1 processor, by mafunk Tue Aug  1 14:47:43 2006
Using Petsc Release Version 2.3.1, Patch 12, Wed Apr  5 17:55:50 CDT 2006
BK revision: balay at asterix.mcs.anl.gov|ChangeSet|20060405225457|13540

                         Max       Max/Min        Avg      Total 
Time (sec):           4.167e+01      1.00000   4.167e+01
Objects:              1.500e+01      1.00000   1.500e+01
Flops:                2.094e+08      1.00000   2.094e+08  2.094e+08
Flops/sec:            5.025e+06      1.00000   5.025e+06  5.025e+06
Memory:               4.700e+05      1.00000              4.700e+05
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       2.402e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 
2N flops
                            and VecAXPY() for complex vectors of length N --> 
8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  
-- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     
Avg         %Total   counts   %Total 
 0:      Main Stage: 4.1647e+01  99.9%  2.0937e+08 100.0%  0.000e+00   0.0%  
0.000e+00        0.0%  2.402e+04 100.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting 
output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
                       Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and 
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in 
this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over 
all processors)
------------------------------------------------------------------------------------------------------------------------

      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was compiled with a debugging option,      #
      #   To get timing results run config/configure.py        #
      #   using --with-debugging=no, the performance will      #
      #   be generally two or three times faster.              #
      #                                                        #
      ##########################################################

      ##########################################################
      #                                                        #
      #                          WARNING!!!                    #
      #                                                        #
      #   This code was run without the PreLoadBegin()         #
      #   macros. To get timing results we always recommend    #
      #   preloading. otherwise timing numbers may be          #
      #   meaningless.                                         #
      ##########################################################

Event                Count      Time (sec)     Flops/sec                         
--- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions   Memory  Descendants' Mem.

--- Event Stage 0: Main Stage

           Index Set     3              3      35976     0
                 Vec     8              8     169056     0
              Matrix     2              2      23304     0
       Krylov Solver     1              1      17216     0
      Preconditioner     1              1        168     0
========================================================================================================================
Average time to get PetscTime(): 2.14577e-07
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8
Configure run at: Tue Aug  1 11:28:46 2006
Configure options: --with-debugging=1 --with-blas-lapack-dir=/usr/local 
--with-mpi-dir=/usr --with-log=1 --with-shared=0

....

So i want to find out why KSPSolve() takes so long in parallel. However, there 
i no summary for stage 0. Does someone know why this is? Am i using it in a 
wrong way? I compiled the libraries with -with_log=1 -with-debugging=1

thanks
mat