[petsc-users] analysing petsc log information and standard execution time for Stokes like equations
Matthew Knepley
knepley at gmail.com
Wed Oct 29 11:35:14 CDT 2014
On Wed, Oct 29, 2014 at 11:18 AM, Bishesh Khanal <bisheshkh at gmail.com>
wrote:
> Dear all,
> The computer cluster I'm using to execute my Petsc based code (Stokes-like
> equation solver) is providing me results with a big variation on the
> execution times for almost the same problems.
> When I look at the -log_summary, and the time taken and Mflop/s for Petsc
> routines it looks like the issue is with cluster.
> I'd like to confirm this before looking into cluster related issues.
> I've provided below (after the questions) the relevant -log_summary
> outputs for two problems (also a file attached if the email does not show
> up the outputs nicely).
>
> The tow problems P1 and P2 solves Stokes-like equation of the same size
> and using same combination of KSP and PC using 64 processors (8 nodes with
> 8 proc/node).
> P1 solves the equation 6 times while P2 only once.
> The operator is slightly different due to slightly different boundary but
> since the no. of iterations from ksp_monitor were almost the same, I guess
> this is not an issue.
>
> Now the problem is that P2 case execution was much slower than P1. In
> other experiments too, the execution is quite fast sometimes but slow most
> of the other times.
>
> I can see in the log_summary that different Petsc routines are running
> much slower for P2 and has smaller Mflop/s rate.
>
> My two questions:
> 1. Do these outputs confirm that the issue is with the cluster and not
> with my code ? If yes, what kinds of things I should focus/learn while
> submitting jobs to the cluster ? Any pointer would be helpful.
>
There are a lot of variables controlling performance on a cluster. You will
have to narrow this down
little by little. However, there are enormous differences below in the
simplest operations, e.g. VecScale,
but roughly the same number of flops were computed. Thus, something is
going very wrong. My first
two guesses, and things you should eliminate
1) Someone else was running at the same time
2) The mapping from processes to nodes was very different
You can usually control 2 with your submission system or mpiexec call.
Thanks,
Matt
> 2. If I set the equations for the constant viscosity and as:
> div(grad(u)) - grad(p) = f
> div(u) + kp = g
> with k=1 in some regions and 0 in most of the other regions; with f and g
> are functions spatially varying.
> and solve the system with 64 to 128 processors using ksp and pc as:
> -pc_fieldsplit_type schur -pc_fieldsplit_schur_precondition self
> -pc_fieldsplit_dm_splits 0 -pc_fieldsplit_0_fields 0,1,2
> -pc_fieldsplit_1_fields 3 -fieldsplit_0_pc_type hypre
>
> What order of execution time for solving this system should I target to be
> reasonable with around, say 128 processors ?
>
> -log_summary output for P1
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> /epi/asclepios2/bkhanal/works/AdLemModel/build/src/AdLemMain on a
> arch-linux2-cxx-opt named nef017 with 64 processors, by bkhanal Tue Oct 28
> 05:52:24 2014
> Using Petsc Release Version 3.4.3, Oct, 15, 2013
>
> Max Max/Min Avg Total
> Time (sec): 4.221e+04 1.00780 4.201e+04
> Objects: 9.980e+02 1.00000 9.980e+02
> Flops: 2.159e+11 1.08499 2.106e+11 1.348e+13
> Flops/sec: 5.154e+06 1.08499 5.013e+06 3.208e+08
> MPI Messages: 1.316e+05 3.69736 7.413e+04 4.744e+06
> MPI Message Lengths: 1.986e+09 2.61387 1.581e+04 7.502e+10
> MPI Reductions: 8.128e+03 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N
> --> 2N flops
> and VecAXPY() for complex vectors of length N
> --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages
> --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts
> %Total Avg %Total counts %Total
> 0: Main Stage: 4.2010e+04 100.0% 1.3477e+13 100.0% 4.744e+06
> 100.0% 1.581e+04 100.0% 8.127e+03 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %f - percent flops in this
> phase
> %M - percent messages in this phase %L - percent message lengths
> in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec)
> Flops --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> VecView 6 1.0 3.8704e+01424.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.2e+01 0 0 0 0 0 0 0 0 0 0 0
> VecMDot 3228 1.0 1.0614e+01 1.7 5.93e+09 1.1 0.0e+00 0.0e+00
> 3.2e+03 0 3 0 0 40 0 3 0 0 40 35184
> VecNorm 4383 1.0 4.0579e+01 9.8 2.73e+09 1.1 0.0e+00 0.0e+00
> 4.4e+03 0 1 0 0 54 0 1 0 0 54 4239
> VecScale 4680 1.0 1.2393e+00 1.3 1.39e+09 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 1 0 0 0 0 1 0 0 0 70518
> VecCopy 1494 1.0 1.1592e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 5477 1.0 9.7614e+02288.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPY 1113 1.0 7.9877e-01 2.3 6.17e+08 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 48645
> VecAYPX 375 1.0 7.9671e-02 1.4 4.99e+07 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 39491
> VecMAXPY 4317 1.0 7.3185e+00 2.1 8.89e+09 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 4 0 0 0 0 4 0 0 0 76569
> VecScatterBegin 5133 1.0 6.7937e+00 2.4 0.00e+00 0.0 4.7e+06 1.5e+04
> 1.2e+01 0 0100 96 0 0 0100 96 0 0
> VecScatterEnd 5121 1.0 2.5840e+02113.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecNormalize 3984 1.0 7.2473e+00 1.6 3.92e+09 1.1 0.0e+00 0.0e+00
> 4.0e+03 0 2 0 0 49 0 2 0 0 49 34118
> MatMult 910 1.0 7.5977e+03 1.0 2.11e+11 1.1 4.7e+06 1.5e+04
> 6.2e+03 18 98 99 92 76 18 98 99 92 76 1736
> MatMultAdd 702 1.0 5.5198e+01 6.3 4.98e+09 1.1 6.6e+05 6.2e+03
> 0.0e+00 0 2 14 5 0 0 2 14 5 0 5624
> MatConvert 6 1.0 3.3739e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> MatAssemblyBegin 55 1.0 5.7320e+0319.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.2e+01 11 0 0 0 1 11 0 0 0 1 0
> MatAssemblyEnd 55 1.0 1.6179e+00 1.3 0.00e+00 0.0 9.4e+03 3.6e+03
> 4.0e+01 0 0 0 0 0 0 0 0 0 0 0
> MatGetRowIJ 12 1.0 1.3590e-05 2.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatZeroEntries 20 1.0 6.9117e-01 2.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatView 18 1.0 4.7298e-02 2.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 3.6e+01 0 0 0 0 0 0 0 0 0 0 0
> KSPGMRESOrthog 3228 1.0 1.4727e+01 1.7 1.19e+10 1.1 0.0e+00 0.0e+00
> 3.2e+03 0 6 0 0 40 0 6 0 0 40 50718
> KSPSetUp 18 1.0 5.4130e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+01 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 6 1.0 3.2363e+04 1.0 2.15e+11 1.1 4.7e+06 1.5e+04
> 7.7e+03 77100 99 92 95 77100 99 92 95 415
> PCSetUp 18 1.0 2.5724e+04 1.0 0.00e+00 0.0 7.7e+03 3.3e+04
> 2.0e+02 61 0 0 0 2 61 0 0 0 2 0
> PCApply 18 1.0 3.2248e+04 1.0 2.12e+11 1.1 4.7e+06 1.5e+04
> 7.6e+03 77 98 99 91 93 77 98 99 91 93 411
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Vector 859 849 1446200816 0
> Vector Scatter 26 16 16352 0
> Matrix 20 20 1832984916 0
> Distributed Mesh 3 3 5457592 0
> Bipartite Graph 6 6 4848 0
> Index Set 61 61 1882648 0
> IS L to G Mapping 5 5 3961468 0
> Krylov Solver 5 5 57888 0
> DMKSP interface 1 1 656 0
> Preconditioner 5 5 4440 0
> Viewer 7 6 4272 0
>
> ========================================================================================================================
> Average time to get PetscTime(): 0
> Average time for MPI_Barrier(): 1.27792e-05
> Average time for zero size MPI_Send(): 5.20423e-06
> #PETSc Option Table entries:
>
>
>
>
> -log_summary output for P2:
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> /epi/asclepios2/bkhanal/works/AdLemModel/build/src/AdLemMain on a
> arch-linux2-cxx-opt named nef001 with 64 processors, by bkhanal Wed Oct 29
> 14:24:36 2014
> Using Petsc Release Version 3.4.3, Oct, 15, 2013
>
> Max Max/Min Avg Total
> Time (sec): 1.958e+04 1.00194 1.955e+04
> Objects: 3.190e+02 1.00000 3.190e+02
> Flops: 3.638e+10 1.08499 3.548e+10 2.271e+12
> Flops/sec: 1.861e+06 1.08676 1.815e+06 1.161e+08
> MPI Messages: 2.253e+04 3.68455 1.270e+04 8.131e+05
> MPI Message Lengths: 3.403e+08 2.51345 1.616e+04 1.314e+10
> MPI Reductions: 1.544e+03 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N
> --> 2N flops
> and VecAXPY() for complex vectors of length N
> --> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages
> --- -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts
> %Total Avg %Total counts %Total
> 0: Main Stage: 1.9554e+04 100.0% 2.2709e+12 100.0% 8.131e+05
> 100.0% 1.616e+04 100.0% 1.543e+03 99.9%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %f - percent flops in this
> phase
> %M - percent messages in this phase %L - percent message lengths
> in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec)
> Flops --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> VecView 1 1.0 4.4869e+02189.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecMDot 544 1.0 1.8271e+01 2.2 1.00e+09 1.1 0.0e+00 0.0e+00
> 5.4e+02 0 3 0 0 35 0 3 0 0 35 3456
> VecNorm 738 1.0 2.0433e+0218.1 4.60e+08 1.1 0.0e+00 0.0e+00
> 7.4e+02 1 1 0 0 48 1 1 0 0 48 142
> VecScale 788 1.0 4.1195e+00 9.0 2.34e+08 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 1 0 0 0 0 1 0 0 0 3573
> VecCopy 251 1.0 7.6140e+0046.9 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 926 1.0 3.9087e+0141.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPY 187 1.0 6.0848e+0032.3 1.04e+08 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 1073
> VecAYPX 63 1.0 4.6702e-0116.2 8.38e+06 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 1131
> VecMAXPY 727 1.0 1.0997e+01 4.9 1.50e+09 1.1 0.0e+00 0.0e+00
> 0.0e+00 0 4 0 0 0 0 4 0 0 0 8610
> VecScatterBegin 864 1.0 2.0978e+0234.1 0.00e+00 0.0 8.0e+05 1.5e+04
> 2.0e+00 0 0 98 92 0 0 0 98 92 0 0
> VecScatterEnd 862 1.0 5.4781e+02114.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecNormalize 671 1.0 1.6922e+01 2.2 6.61e+08 1.1 0.0e+00 0.0e+00
> 6.7e+02 0 2 0 0 43 0 2 0 0 43 2461
> MatMult 152 1.0 6.3271e+03 1.0 3.56e+10 1.1 7.9e+05 1.5e+04
> 1.0e+03 32 98 98 89 68 32 98 98 89 68 351
> MatMultAdd 118 1.0 4.5234e+02183.7 8.36e+08 1.1 1.1e+05 6.2e+03
> 0.0e+00 1 2 14 5 0 1 2 14 5 0 115
> MatConvert 1 1.0 3.6065e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
> MatAssemblyBegin 10 1.0 1.0849e+03 3.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.2e+01 5 0 0 0 1 5 0 0 0 1 0
> MatAssemblyEnd 10 1.0 1.3957e+01 1.1 0.00e+00 0.0 9.4e+03 3.6e+03
> 4.0e+01 0 0 1 0 3 0 0 1 0 3 0
> MatGetRowIJ 2 1.0 2.2221e-03582.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatView 3 1.0 3.7378e-01 9.8 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPGMRESOrthog 544 1.0 2.0370e+01 1.7 2.00e+09 1.1 0.0e+00 0.0e+00
> 5.4e+02 0 6 0 0 35 0 6 0 0 35 6200
> KSPSetUp 3 1.0 4.2598e+01 3.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+01 0 0 0 0 1 0 0 0 0 1 0
> KSPSolve 1 1.0 1.5113e+04 1.0 3.63e+10 1.1 7.9e+05 1.5e+04
> 1.3e+03 77100 98 89 84 77100 98 89 84 150
> PCSetUp 3 1.0 1.1794e+04 1.0 0.00e+00 0.0 7.7e+03 3.3e+04
> 1.3e+02 60 0 1 2 8 60 0 1 2 8 0
> PCApply 3 1.0 1.4940e+04 1.0 3.58e+10 1.1 7.9e+05 1.5e+04
> 1.3e+03 76 98 97 88 83 76 98 97 88 83 149
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Vector 215 215 743212688 0
> Vector Scatter 16 16 16352 0
> Matrix 20 20 1832984916 0
> Distributed Mesh 3 3 5457592 0
> Bipartite Graph 6 6 4848 0
> Index Set 41 41 1867368 0
> IS L to G Mapping 5 5 3961468 0
> Krylov Solver 5 5 57888 0
> DMKSP interface 1 1 656 0
> Preconditioner 5 5 4440 0
> Viewer 2 1 712 0
>
> ========================================================================================================================
> Average time to get PetscTime(): 9.53674e-08
> Average time for MPI_Barrier(): 0.000274992
> Average time for zero size MPI_Send(): 1.67042e-05
> #PETSc Option Table entries:
>
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20141029/265f46e8/attachment-0001.html>
More information about the petsc-users
mailing list