[petsc-users] Petsc Profiling for Each Function

Barry Smith bsmith at mcs.anl.gov
Wed Nov 23 18:24:22 CST 2016


   The efficiency of the actual code looks ok, that is you are not going to suddenly make it go much faster by optimizing some piece of it. There are two things that are making it slower than ideal

1)  6960/91. = 76.48351648351648 the linear solver is using on average 76 iterations per solve. This is high. What is the problem you are solving? Have you tried -pc_type gamg to reduce the iteration counts?


2) The TSStep() is only taking 68 % of the runtime, meaning something else in your code is taking 32% of the time. Are you doing a lot of IO or something? Ideally your code that calls TSStep should be taking just a few percent of the time. Even if you halfed the time of the TSStep() your code would take a lot of time do to the 32% in your code.

To specifically answer your question, you can easily run the code on one processor with Apples Instrumentation tool or with Intel's tool to see what functions are taking the most time but to repeat from above this won't help you much because the individual routines are already optimized.

   Barry


> On Nov 23, 2016, at 6:04 PM, Sharp Stone <thronesf at gmail.com> wrote:
> 
> Thanks Barry for the reply. When I use the option -log_view, I got the Petsc API function profiling posted as below. But I wonder is there any way for me to get the performance/execution profiling of the functions in the code, like in Xcode profiling, so that I would know which part needs special optimization? I now badly needs your instructions. Thank you very much in advance!
> 
> 
> ---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
> 
> ./pStreamer on a arch-darwin-c-opt named rcc-dhcp-249-233.sr.unh.edu with 4 processors, by GPL Wed Nov 23 18:49:59 2016
> Using Petsc Release Version 3.5.4, May, 23, 2015 
> 
>                          Max       Max/Min        Avg      Total 
> Time (sec):           1.026e+02      1.00004   1.026e+02
> Objects:              3.420e+02      1.00000   3.420e+02
> Flops:                2.509e+10      1.00200   2.506e+10  1.002e+11
> Flops/sec:            2.445e+08      1.00202   2.441e+08  9.764e+08
> MPI Messages:         1.387e+04      1.99971   1.041e+04  4.163e+04
> MPI Message Lengths:  2.783e+07      1.99971   2.006e+03  8.350e+07
> MPI Reductions:       1.425e+04      1.00000
> 
> Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N --> 2N flops
>                             and VecAXPY() for complex vectors of length N --> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
>  0:      Main Stage: 1.0264e+02 100.0%  1.0022e+11 100.0%  4.163e+04 100.0%  2.006e+03      100.0%  1.425e+04 100.0% 
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in this phase
>       %M - percent messages in this phase     %L - percent message lengths in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> VecDot              6778 1.0 5.0633e+00 1.1 1.70e+09 1.0 0.0e+00 0.0e+00 6.8e+03  5  7  0  0 48   5  7  0  0 48  1345
> VecDotNorm2         3389 1.0 3.1751e+00 1.3 1.70e+09 1.0 0.0e+00 0.0e+00 3.4e+03  3  7  0  0 24   3  7  0  0 24  2144
> VecNorm             3571 1.0 2.1366e+00 1.1 8.98e+08 1.0 0.0e+00 0.0e+00 3.6e+03  2  4  0  0 25   2  4  0  0 25  1679
> VecScale              49 1.0 1.2606e-02 1.2 6.16e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1952
> VecCopy              157 1.0 4.5638e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              7140 1.0 5.1525e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY              146 1.0 7.0036e-02 1.2 3.97e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  2266
> VecAXPBYCZ          6778 1.0 4.9426e+00 1.2 3.41e+09 1.0 0.0e+00 0.0e+00 0.0e+00  4 14  0  0  0   4 14  0  0  0  2755
> VecWAXPY            6785 1.0 3.9687e+00 1.0 1.71e+09 1.0 0.0e+00 0.0e+00 0.0e+00  4  7  0  0  0   4  7  0  0  0  1716
> VecScatterBegin     6918 1.0 3.4263e-01 1.7 0.00e+00 0.0 4.2e+04 2.0e+03 0.0e+00  0  0100100  0   0  0100100  0     0
> VecScatterEnd       6918 1.0 8.6652e+0029.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  4  0  0  0  0   4  0  0  0  0     0
> MatMult             6869 1.0 2.6045e+01 1.5 7.76e+09 1.0 4.1e+04 2.0e+03 0.0e+00 21 31 99 99  0  21 31 99 99  0  1190
> MatSolve            6960 1.0 1.7961e+01 1.0 7.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 17 31  0  0  0  17 31  0  0  0  1747
> MatLUFactorNum         7 1.0 5.3501e-02 1.1 9.62e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   719
> MatILUFactorSym        7 1.0 5.1423e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin      14 1.0 3.7722e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 2.8e+01  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd        14 1.0 6.4600e-02 1.0 0.00e+00 0.0 8.4e+01 5.0e+02 5.6e+01  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            7 1.0 3.4150e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         7 1.0 5.8947e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetUp              14 1.0 3.9437e+00132.6 0.00e+00 0.0 0.0e+00 0.0e+00 8.4e+01  2  0  0  0  1   2  0  0  0  1     0
> KSPSolve              91 1.0 6.6894e+01 1.2 2.51e+10 1.0 4.1e+04 2.0e+03 1.4e+04 59100 99 99 98  59100 99 99 98  1497
> PCSetUp               14 1.0 1.1119e-01 1.1 9.62e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
> PCSetUpOnBlocks       91 1.0 1.1087e-01 1.1 9.62e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   347
> PCApply             6960 1.0 1.8604e+01 1.0 7.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 18 31  0  0  0  18 31  0  0  0  1687
> TSStep                 6 1.0 7.0114e+01 1.0 2.01e+10 1.0 3.3e+04 2.0e+03 1.1e+04 68 80 80 80 78  68 80 80 80 78  1146
> TSFunctionEval         6 1.0 7.0102e+01 1.0 2.01e+10 1.0 3.3e+04 2.0e+03 1.1e+04 68 80 80 80 78  68 80 80 80 78  1146
> ------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
>               Vector                   212            207    180438384     0
>       Vector Scatter                  9              8      1014488     0
>               Matrix                      28             28    147732508     0
>     Distributed Mesh                 4              3        13776     0
> Star Forest Bipartite Graph     8              6         4800     0
>      Discrete System                 4              3         2376     0
>            Index Set                   39             39      8087676     0
>    IS L to G Mapping              2              1       504596     0
>        Krylov Solver                15             15        17456     0
>       Preconditioner               15             15        14208     0
>              TSAdapt                  1              1         1208     0
>                   TS                       1              1         1168     0
>                 DMTS                   1              1          712     0
>                 SNES                    1              1         1336     0
>               DMSNES                1              1          672     0
>               Viewer                    1              0            0     0
> ========================================================================================================================
> Average time to get PetscTime(): 3.82e-08
> Average time for MPI_Barrier(): 6.2797e-05
> Average time for zero size MPI_Send(): 1.98967e-05
> #PETSc Option Table entries:
> -log_view
> #End of PETSc Option Table entries
> 
> On Wed, Oct 19, 2016 at 8:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
>    To start you should just run the code with the ./configure option --with-debugging=0 with the option -log_view this will give you the high view of where it is spending the time. Feel free to email back the output. From that you can focus on what parts are both taking a lot of time AND running slowly and that gives a good idea of what needs to be optimized.
> 
>    Barry
> 
> > On Oct 19, 2016, at 3:15 PM, Sharp Stone <thronesf at gmail.com> wrote:
> >
> > Dear all,
> >
> > Now I'm using a Petsc code which needs to be optimized. But after trying, I still don't know how to get the profiling for each of the function for each process. I mean, for each process, how should I know the execution time for each function?
> >
> > Thanks!
> >
> > --
> > Best regards,
> >
> > Feng
> 
> 
> 
> 
> -- 
> Best regards,
> 
> Feng



More information about the petsc-users mailing list