[petsc-users] Petsc Profiling for Each Function
Barry Smith
bsmith at mcs.anl.gov
Wed Nov 23 18:24:22 CST 2016
The efficiency of the actual code looks ok, that is you are not going to suddenly make it go much faster by optimizing some piece of it. There are two things that are making it slower than ideal
1) 6960/91. = 76.48351648351648 the linear solver is using on average 76 iterations per solve. This is high. What is the problem you are solving? Have you tried -pc_type gamg to reduce the iteration counts?
2) The TSStep() is only taking 68 % of the runtime, meaning something else in your code is taking 32% of the time. Are you doing a lot of IO or something? Ideally your code that calls TSStep should be taking just a few percent of the time. Even if you halfed the time of the TSStep() your code would take a lot of time do to the 32% in your code.
To specifically answer your question, you can easily run the code on one processor with Apples Instrumentation tool or with Intel's tool to see what functions are taking the most time but to repeat from above this won't help you much because the individual routines are already optimized.
Barry
> On Nov 23, 2016, at 6:04 PM, Sharp Stone <thronesf at gmail.com> wrote:
>
> Thanks Barry for the reply. When I use the option -log_view, I got the Petsc API function profiling posted as below. But I wonder is there any way for me to get the performance/execution profiling of the functions in the code, like in Xcode profiling, so that I would know which part needs special optimization? I now badly needs your instructions. Thank you very much in advance!
>
>
> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
>
> ./pStreamer on a arch-darwin-c-opt named rcc-dhcp-249-233.sr.unh.edu with 4 processors, by GPL Wed Nov 23 18:49:59 2016
> Using Petsc Release Version 3.5.4, May, 23, 2015
>
> Max Max/Min Avg Total
> Time (sec): 1.026e+02 1.00004 1.026e+02
> Objects: 3.420e+02 1.00000 3.420e+02
> Flops: 2.509e+10 1.00200 2.506e+10 1.002e+11
> Flops/sec: 2.445e+08 1.00202 2.441e+08 9.764e+08
> MPI Messages: 1.387e+04 1.99971 1.041e+04 4.163e+04
> MPI Message Lengths: 2.783e+07 1.99971 2.006e+03 8.350e+07
> MPI Reductions: 1.425e+04 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N --> 2N flops
> and VecAXPY() for complex vectors of length N --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts %Total Avg %Total counts %Total
> 0: Main Stage: 1.0264e+02 100.0% 1.0022e+11 100.0% 4.163e+04 100.0% 2.006e+03 100.0% 1.425e+04 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in this phase
> %M - percent messages in this phase %L - percent message lengths in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flops --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> VecDot 6778 1.0 5.0633e+00 1.1 1.70e+09 1.0 0.0e+00 0.0e+00 6.8e+03 5 7 0 0 48 5 7 0 0 48 1345
> VecDotNorm2 3389 1.0 3.1751e+00 1.3 1.70e+09 1.0 0.0e+00 0.0e+00 3.4e+03 3 7 0 0 24 3 7 0 0 24 2144
> VecNorm 3571 1.0 2.1366e+00 1.1 8.98e+08 1.0 0.0e+00 0.0e+00 3.6e+03 2 4 0 0 25 2 4 0 0 25 1679
> VecScale 49 1.0 1.2606e-02 1.2 6.16e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1952
> VecCopy 157 1.0 4.5638e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 7140 1.0 5.1525e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPY 146 1.0 7.0036e-02 1.2 3.97e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2266
> VecAXPBYCZ 6778 1.0 4.9426e+00 1.2 3.41e+09 1.0 0.0e+00 0.0e+00 0.0e+00 4 14 0 0 0 4 14 0 0 0 2755
> VecWAXPY 6785 1.0 3.9687e+00 1.0 1.71e+09 1.0 0.0e+00 0.0e+00 0.0e+00 4 7 0 0 0 4 7 0 0 0 1716
> VecScatterBegin 6918 1.0 3.4263e-01 1.7 0.00e+00 0.0 4.2e+04 2.0e+03 0.0e+00 0 0100100 0 0 0100100 0 0
> VecScatterEnd 6918 1.0 8.6652e+0029.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 4 0 0 0 0 4 0 0 0 0 0
> MatMult 6869 1.0 2.6045e+01 1.5 7.76e+09 1.0 4.1e+04 2.0e+03 0.0e+00 21 31 99 99 0 21 31 99 99 0 1190
> MatSolve 6960 1.0 1.7961e+01 1.0 7.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 17 31 0 0 0 17 31 0 0 0 1747
> MatLUFactorNum 7 1.0 5.3501e-02 1.1 9.62e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 719
> MatILUFactorSym 7 1.0 5.1423e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyBegin 14 1.0 3.7722e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 2.8e+01 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyEnd 14 1.0 6.4600e-02 1.0 0.00e+00 0.0 8.4e+01 5.0e+02 5.6e+01 0 0 0 0 0 0 0 0 0 0 0
> MatGetRowIJ 7 1.0 3.4150e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetOrdering 7 1.0 5.8947e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSetUp 14 1.0 3.9437e+00132.6 0.00e+00 0.0 0.0e+00 0.0e+00 8.4e+01 2 0 0 0 1 2 0 0 0 1 0
> KSPSolve 91 1.0 6.6894e+01 1.2 2.51e+10 1.0 4.1e+04 2.0e+03 1.4e+04 59100 99 99 98 59100 99 99 98 1497
> PCSetUp 14 1.0 1.1119e-01 1.1 9.62e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 346
> PCSetUpOnBlocks 91 1.0 1.1087e-01 1.1 9.62e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 347
> PCApply 6960 1.0 1.8604e+01 1.0 7.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 18 31 0 0 0 18 31 0 0 0 1687
> TSStep 6 1.0 7.0114e+01 1.0 2.01e+10 1.0 3.3e+04 2.0e+03 1.1e+04 68 80 80 80 78 68 80 80 80 78 1146
> TSFunctionEval 6 1.0 7.0102e+01 1.0 2.01e+10 1.0 3.3e+04 2.0e+03 1.1e+04 68 80 80 80 78 68 80 80 80 78 1146
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Vector 212 207 180438384 0
> Vector Scatter 9 8 1014488 0
> Matrix 28 28 147732508 0
> Distributed Mesh 4 3 13776 0
> Star Forest Bipartite Graph 8 6 4800 0
> Discrete System 4 3 2376 0
> Index Set 39 39 8087676 0
> IS L to G Mapping 2 1 504596 0
> Krylov Solver 15 15 17456 0
> Preconditioner 15 15 14208 0
> TSAdapt 1 1 1208 0
> TS 1 1 1168 0
> DMTS 1 1 712 0
> SNES 1 1 1336 0
> DMSNES 1 1 672 0
> Viewer 1 0 0 0
> ========================================================================================================================
> Average time to get PetscTime(): 3.82e-08
> Average time for MPI_Barrier(): 6.2797e-05
> Average time for zero size MPI_Send(): 1.98967e-05
> #PETSc Option Table entries:
> -log_view
> #End of PETSc Option Table entries
>
> On Wed, Oct 19, 2016 at 8:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> To start you should just run the code with the ./configure option --with-debugging=0 with the option -log_view this will give you the high view of where it is spending the time. Feel free to email back the output. From that you can focus on what parts are both taking a lot of time AND running slowly and that gives a good idea of what needs to be optimized.
>
> Barry
>
> > On Oct 19, 2016, at 3:15 PM, Sharp Stone <thronesf at gmail.com> wrote:
> >
> > Dear all,
> >
> > Now I'm using a Petsc code which needs to be optimized. But after trying, I still don't know how to get the profiling for each of the function for each process. I mean, for each process, how should I know the execution time for each function?
> >
> > Thanks!
> >
> > --
> > Best regards,
> >
> > Feng
>
>
>
>
> --
> Best regards,
>
> Feng
More information about the petsc-users
mailing list