profiling PETSc code
Matthew Knepley
knepley at gmail.com
Tue Aug 1 16:57:22 CDT 2006
Take out your stage push/pop for the moment, and the log_summary
call. Just run with -log_summary and send the output as a test.
Thanks,
Matt
On 8/1/06, Matt Funk <mafunk at nmsu.edu> wrote:
> Hi,
>
> i need to profile my code. Specifically the KSPSolve(...) call. Basically i am
> (just for testing) setting up the identity matrix and pass in the solution
> and RHS vectors. I solve the system 4000 times or so (400 times steps that
> is). Anyway, when i run on 1 processor it takes essentially no time (ca 7
> secs). When i run on 4 procs it takes 96 secs. I use an external timer to
> profile just the call to KSPSolve which is where the times come from.
>
> So i read through chap 12. Unfortunately i cannot find ex21.c in
> src/ksp/ksp/examples/tutorials so i couldn't look at the code.
>
> So what i do is the following. At the end of my program, right before calling
> PetscFinalize() i call PetscLogPrintSummary(...). I registered stage 0 and
> then right before the KSPSolve() call i do : m_ierr = PetscLogStagePush(0);
> and right after i do: m_ierr = PetscLogStagePop(). There is also barriers
> right before the push call and right after the pop call.
>
> The output i get (for the single proc run) is:
>
> ...
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> /home2/users/mafunk/AMR/Chombo.2.0/example/node/maskExec/testNodePoisson2d.Linux.g++.g77.MPI32.ex
> on a linux-gnu named .1 with 1 processor, by mafunk Tue Aug 1 14:47:43 2006
> Using Petsc Release Version 2.3.1, Patch 12, Wed Apr 5 17:55:50 CDT 2006
> BK revision: balay at asterix.mcs.anl.gov|ChangeSet|20060405225457|13540
>
> Max Max/Min Avg Total
> Time (sec): 4.167e+01 1.00000 4.167e+01
> Objects: 1.500e+01 1.00000 1.500e+01
> Flops: 2.094e+08 1.00000 2.094e+08 2.094e+08
> Flops/sec: 5.025e+06 1.00000 5.025e+06 5.025e+06
> Memory: 4.700e+05 1.00000 4.700e+05
> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
> MPI Reductions: 2.402e+04 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N -->
> 2N flops
> and VecAXPY() for complex vectors of length N -->
> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
> -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts %Total
> Avg %Total counts %Total
> 0: Main Stage: 4.1647e+01 99.9% 2.0937e+08 100.0% 0.000e+00 0.0%
> 0.000e+00 0.0% 2.402e+04 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting
> output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops/sec: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in this phase
> %M - percent messages in this phase %L - percent message lengths in
> this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was compiled with a debugging option, #
> # To get timing results run config/configure.py #
> # using --with-debugging=no, the performance will #
> # be generally two or three times faster. #
> # #
> ##########################################################
>
>
>
>
> ##########################################################
> # #
> # WARNING!!! #
> # #
> # This code was run without the PreLoadBegin() #
> # macros. To get timing results we always recommend #
> # preloading. otherwise timing numbers may be #
> # meaningless. #
> ##########################################################
>
>
> Event Count Time (sec) Flops/sec
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
>
> --- Event Stage 0: Main Stage
>
> Index Set 3 3 35976 0
> Vec 8 8 169056 0
> Matrix 2 2 23304 0
> Krylov Solver 1 1 17216 0
> Preconditioner 1 1 168 0
> ========================================================================================================================
> Average time to get PetscTime(): 2.14577e-07
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> Configure run at: Tue Aug 1 11:28:46 2006
> Configure options: --with-debugging=1 --with-blas-lapack-dir=/usr/local
> --with-mpi-dir=/usr --with-log=1 --with-shared=0
>
>
> ....
>
>
> So i want to find out why KSPSolve() takes so long in parallel. However, there
> i no summary for stage 0. Does someone know why this is? Am i using it in a
> wrong way? I compiled the libraries with -with_log=1 -with-debugging=1
>
> thanks
> mat
>
>
--
"Failure has a thousand explanations. Success doesn't need one" -- Sir
Alec Guiness
More information about the petsc-users
mailing list