profiling PETSc code

Tue Aug 1 16:57:22 CDT 2006

Take out your stage push/pop for the moment, and the log_summary
call. Just run with -log_summary and send the output as a test.

  Thanks,

     Matt

On 8/1/06, Matt Funk <mafunk at nmsu.edu> wrote:
> Hi,
>
> i need to profile my code. Specifically the KSPSolve(...) call. Basically i am
> (just for testing) setting up the identity matrix and pass in the solution
> and RHS vectors. I solve the system 4000 times or so (400 times steps that
> is). Anyway, when i run on 1 processor it takes essentially no time (ca 7
> secs). When i run on 4 procs it takes 96 secs. I use an external timer to
> profile just the call to KSPSolve which is where the times come from.
>
> So i read through chap 12. Unfortunately i cannot find ex21.c in
> src/ksp/ksp/examples/tutorials so i couldn't look at the code.
>
> So what i do is the following. At the end of my program, right before calling
> PetscFinalize() i call PetscLogPrintSummary(...). I registered stage 0 and
> then right before the KSPSolve() call i do :   m_ierr = PetscLogStagePush(0);
> and right after i do: m_ierr = PetscLogStagePop(). There is also barriers
> right before the push call and right after the pop call.
>
> The output i get (for the single proc run) is:
>
> ...
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> /home2/users/mafunk/AMR/Chombo.2.0/example/node/maskExec/testNodePoisson2d.Linux.g++.g77.MPI32.ex
> on a linux-gnu named .1 with 1 processor, by mafunk Tue Aug  1 14:47:43 2006
> Using Petsc Release Version 2.3.1, Patch 12, Wed Apr  5 17:55:50 CDT 2006
> BK revision: balay at asterix.mcs.anl.gov|ChangeSet|20060405225457|13540
>
>                          Max       Max/Min        Avg      Total
> Time (sec):           4.167e+01      1.00000   4.167e+01
> Objects:              1.500e+01      1.00000   1.500e+01
> Flops:                2.094e+08      1.00000   2.094e+08  2.094e+08
> Flops/sec:            5.025e+06      1.00000   5.025e+06  5.025e+06
> Memory:               4.700e+05      1.00000              4.700e+05
> MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
> MPI Reductions:       2.402e+04      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N -->
> 2N flops
>                             and VecAXPY() for complex vectors of length N -->
> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
> -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts   %Total
> Avg         %Total   counts   %Total
>  0:      Main Stage: 4.1647e+01  99.9%  2.0937e+08 100.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  2.402e+04 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting
> output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops/sec: Max - maximum over all processors
>                        Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in this phase
>       %M - percent messages in this phase     %L - percent message lengths in
> this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>       ##########################################################
>       #                                                        #
>       #                          WARNING!!!                    #
>       #                                                        #
>       #   This code was compiled with a debugging option,      #
>       #   To get timing results run config/configure.py        #
>       #   using --with-debugging=no, the performance will      #
>       #   be generally two or three times faster.              #
>       #                                                        #
>       ##########################################################
>
>
>
>
>       ##########################################################
>       #                                                        #
>       #                          WARNING!!!                    #
>       #                                                        #
>       #   This code was run without the PreLoadBegin()         #
>       #   macros. To get timing results we always recommend    #
>       #   preloading. otherwise timing numbers may be          #
>       #   meaningless.                                         #
>       ##########################################################
>
>
> Event                Count      Time (sec)     Flops/sec
> --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants' Mem.
>
> --- Event Stage 0: Main Stage
>
>            Index Set     3              3      35976     0
>                  Vec     8              8     169056     0
>               Matrix     2              2      23304     0
>        Krylov Solver     1              1      17216     0
>       Preconditioner     1              1        168     0
> ========================================================================================================================
> Average time to get PetscTime(): 2.14577e-07
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Tue Aug  1 11:28:46 2006
> Configure options: --with-debugging=1 --with-blas-lapack-dir=/usr/local
> --with-mpi-dir=/usr --with-log=1 --with-shared=0
>
>
> ....
>
>
> So i want to find out why KSPSolve() takes so long in parallel. However, there
> i no summary for stage 0. Does someone know why this is? Am i using it in a
> wrong way? I compiled the libraries with -with_log=1 -with-debugging=1
>
> thanks
> mat
>
>

-- 
"Failure has a thousand explanations. Success doesn't need one" -- Sir
Alec Guiness