profiling PETSc code
Matt Funk
mafunk at nmsu.edu
Tue Aug 1 16:45:22 CDT 2006
Hi,
i need to profile my code. Specifically the KSPSolve(...) call. Basically i am
(just for testing) setting up the identity matrix and pass in the solution
and RHS vectors. I solve the system 4000 times or so (400 times steps that
is). Anyway, when i run on 1 processor it takes essentially no time (ca 7
secs). When i run on 4 procs it takes 96 secs. I use an external timer to
profile just the call to KSPSolve which is where the times come from.
So i read through chap 12. Unfortunately i cannot find ex21.c in
src/ksp/ksp/examples/tutorials so i couldn't look at the code.
So what i do is the following. At the end of my program, right before calling
PetscFinalize() i call PetscLogPrintSummary(...). I registered stage 0 and
then right before the KSPSolve() call i do : m_ierr = PetscLogStagePush(0);
and right after i do: m_ierr = PetscLogStagePop(). There is also barriers
right before the push call and right after the pop call.
The output i get (for the single proc run) is:
...
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
/home2/users/mafunk/AMR/Chombo.2.0/example/node/maskExec/testNodePoisson2d.Linux.g++.g77.MPI32.ex
on a linux-gnu named .1 with 1 processor, by mafunk Tue Aug 1 14:47:43 2006
Using Petsc Release Version 2.3.1, Patch 12, Wed Apr 5 17:55:50 CDT 2006
BK revision: balay at asterix.mcs.anl.gov|ChangeSet|20060405225457|13540
Max Max/Min Avg Total
Time (sec): 4.167e+01 1.00000 4.167e+01
Objects: 1.500e+01 1.00000 1.500e+01
Flops: 2.094e+08 1.00000 2.094e+08 2.094e+08
Flops/sec: 5.025e+06 1.00000 5.025e+06 5.025e+06
Memory: 4.700e+05 1.00000 4.700e+05
MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Reductions: 2.402e+04 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N -->
2N flops
and VecAXPY() for complex vectors of length N -->
8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
-- Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total
Avg %Total counts %Total
0: Main Stage: 4.1647e+01 99.9% 2.0937e+08 100.0% 0.000e+00 0.0%
0.000e+00 0.0% 2.402e+04 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting
output.
Phase summary info:
Count: number of times phase was executed
Time and Flops/sec: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in
this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
all processors)
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run config/configure.py #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
##########################################################
# #
# WARNING!!! #
# #
# This code was run without the PreLoadBegin() #
# macros. To get timing results we always recommend #
# preloading. otherwise timing numbers may be #
# meaningless. #
##########################################################
Event Count Time (sec) Flops/sec
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len
Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
--- Event Stage 0: Main Stage
Index Set 3 3 35976 0
Vec 8 8 169056 0
Matrix 2 2 23304 0
Krylov Solver 1 1 17216 0
Preconditioner 1 1 168 0
========================================================================================================================
Average time to get PetscTime(): 2.14577e-07
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8
Configure run at: Tue Aug 1 11:28:46 2006
Configure options: --with-debugging=1 --with-blas-lapack-dir=/usr/local
--with-mpi-dir=/usr --with-log=1 --with-shared=0
....
So i want to find out why KSPSolve() takes so long in parallel. However, there
i no summary for stage 0. Does someone know why this is? Am i using it in a
wrong way? I compiled the libraries with -with_log=1 -with-debugging=1
thanks
mat
More information about the petsc-users
mailing list