<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Oct 29, 2014 at 11:18 AM, Bishesh Khanal <span dir="ltr"><<a href="mailto:bisheshkh@gmail.com" target="_blank">bisheshkh@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div><div>Dear all,<br>The computer cluster I'm using to execute my Petsc based code (Stokes-like equation solver) is providing me results with a big variation on the execution times for almost the same problems. <br>When I look at the -log_summary, and the time taken and Mflop/s for Petsc routines it looks like the issue is with cluster.<br>I'd like to confirm this before looking into cluster related issues.<br></div>I've provided below (after the questions) the relevant -log_summary outputs for two problems (also a file attached if the email does not show up the outputs nicely).<br></div><br>The tow problems P1 and P2 solves Stokes-like equation of the same size and using same combination of KSP and PC using 64 processors (8 nodes with 8 proc/node). <br>P1 solves the equation 6 times while P2 only once.<br>The operator is slightly different due to slightly different boundary but since the no. of iterations from ksp_monitor were almost the same, I guess this is not an issue.<br></div></div></div></div><br>Now the problem is that P2 case execution was much slower than P1. In other experiments too, the execution is quite fast sometimes but slow most of the other times.<br><br>I can see in the log_summary that different Petsc routines are running much slower for P2 and has smaller Mflop/s rate.<br></div><div><br>My two questions:<br></div><div>1. Do these outputs confirm that the issue is with the cluster and not with my code ? If yes, what kinds of things I should focus/learn while submitting jobs to the cluster ? Any pointer would be helpful.</div></div></div></blockquote><div><br></div><div>There are a lot of variables controlling performance on a cluster. You will have to narrow this down</div><div>little by little. However, there are enormous differences below in the simplest operations, e.g. VecScale,</div><div>but roughly the same number of flops were computed. Thus, something is going very wrong. My first</div><div>two guesses, and things you should eliminate</div><div><br></div><div> 1) Someone else was running at the same time</div><div><br></div><div> 2) The mapping from processes to nodes was very different</div><div><br></div><div>You can usually control 2 with your submission system or mpiexec call.</div><div><br></div><div> Thanks,</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>2. If I set the equations for the constant viscosity and as:<br>div(grad(u)) - grad(p) = f<br>div(u) + kp = g<br>with k=1 in some regions and 0 in most of the other regions; with f and g are functions spatially varying.<br>and solve the system with 64 to 128 processors using ksp and pc as: -pc_fieldsplit_type schur -pc_fieldsplit_schur_precondition self -pc_fieldsplit_dm_splits 0 -pc_fieldsplit_0_fields 0,1,2 -pc_fieldsplit_1_fields 3 -fieldsplit_0_pc_type hypre<br></div><div><br>What order of execution time for solving this system should I target to be reasonable with around, say 128 processors ?<br><br></div><div>-log_summary output for P1<br></div>---------------------------------------------- PETSc Performance Summary: ----------------------------------------------<br><br>/epi/asclepios2/bkhanal/works/AdLemModel/build/src/AdLemMain on a arch-linux2-cxx-opt named nef017 with 64 processors, by bkhanal Tue Oct 28 05:52:24 2014<br>Using Petsc Release Version 3.4.3, Oct, 15, 2013 <br><br> Max Max/Min Avg Total <br>Time (sec): 4.221e+04 1.00780 4.201e+04<br>Objects: 9.980e+02 1.00000 9.980e+02<br>Flops: 2.159e+11 1.08499 2.106e+11 1.348e+13<br>Flops/sec: 5.154e+06 1.08499 5.013e+06 3.208e+08<br>MPI Messages: 1.316e+05 3.69736 7.413e+04 4.744e+06<br>MPI Message Lengths: 1.986e+09 2.61387 1.581e+04 7.502e+10<br>MPI Reductions: 8.128e+03 1.00000<br><br>Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)<br> e.g., VecAXPY() for real vectors of length N --> 2N flops<br> and VecAXPY() for complex vectors of length N --> 8N flops<br><br>Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --<br> Avg %Total Avg %Total counts %Total Avg %Total counts %Total <br> 0: Main Stage: 4.2010e+04 100.0% 1.3477e+13 100.0% 4.744e+06 100.0% 1.581e+04 100.0% 8.127e+03 100.0% <br><br>------------------------------------------------------------------------------------------------------------------------<br>See the 'Profiling' chapter of the users' manual for details on interpreting output.<br>Phase summary info:<br> Count: number of times phase was executed<br> Time and Flops: Max - maximum over all processors<br> Ratio - ratio of maximum to minimum over all processors<br> Mess: number of messages sent<br> Avg. len: average message length (bytes)<br> Reduct: number of global reductions<br> Global: entire computation<br> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().<br> %T - percent time in this phase %f - percent flops in this phase<br> %M - percent messages in this phase %L - percent message lengths in this phase<br> %R - percent reductions in this phase<br> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)<br>------------------------------------------------------------------------------------------------------------------------<br>Event Count Time (sec) Flops --- Global --- --- Stage --- Total<br> Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s<br>------------------------------------------------------------------------------------------------------------------------<br><br>--- Event Stage 0: Main Stage<br><br>VecView 6 1.0 3.8704e+01424.2 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01 0 0 0 0 0 0 0 0 0 0 0<br>VecMDot 3228 1.0 1.0614e+01 1.7 5.93e+09 1.1 0.0e+00 0.0e+00 3.2e+03 0 3 0 0 40 0 3 0 0 40 35184<br>VecNorm 4383 1.0 4.0579e+01 9.8 2.73e+09 1.1 0.0e+00 0.0e+00 4.4e+03 0 1 0 0 54 0 1 0 0 54 4239<br>VecScale 4680 1.0 1.2393e+00 1.3 1.39e+09 1.1 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 70518<br>VecCopy 1494 1.0 1.1592e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>VecSet 5477 1.0 9.7614e+02288.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>VecAXPY 1113 1.0 7.9877e-01 2.3 6.17e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 48645<br>VecAYPX 375 1.0 7.9671e-02 1.4 4.99e+07 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 39491<br>VecMAXPY 4317 1.0 7.3185e+00 2.1 8.89e+09 1.1 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 76569<br>VecScatterBegin 5133 1.0 6.7937e+00 2.4 0.00e+00 0.0 4.7e+06 1.5e+04 1.2e+01 0 0100 96 0 0 0100 96 0 0<br>VecScatterEnd 5121 1.0 2.5840e+02113.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>VecNormalize 3984 1.0 7.2473e+00 1.6 3.92e+09 1.1 0.0e+00 0.0e+00 4.0e+03 0 2 0 0 49 0 2 0 0 49 34118<br>MatMult 910 1.0 7.5977e+03 1.0 2.11e+11 1.1 4.7e+06 1.5e+04 6.2e+03 18 98 99 92 76 18 98 99 92 76 1736<br>MatMultAdd 702 1.0 5.5198e+01 6.3 4.98e+09 1.1 6.6e+05 6.2e+03 0.0e+00 0 2 14 5 0 0 2 14 5 0 5624<br>MatConvert 6 1.0 3.3739e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>MatAssemblyBegin 55 1.0 5.7320e+0319.7 0.00e+00 0.0 0.0e+00 0.0e+00 6.2e+01 11 0 0 0 1 11 0 0 0 1 0<br>MatAssemblyEnd 55 1.0 1.6179e+00 1.3 0.00e+00 0.0 9.4e+03 3.6e+03 4.0e+01 0 0 0 0 0 0 0 0 0 0 0<br>MatGetRowIJ 12 1.0 1.3590e-05 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatZeroEntries 20 1.0 6.9117e-01 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatView 18 1.0 4.7298e-02 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 3.6e+01 0 0 0 0 0 0 0 0 0 0 0<br>KSPGMRESOrthog 3228 1.0 1.4727e+01 1.7 1.19e+10 1.1 0.0e+00 0.0e+00 3.2e+03 0 6 0 0 40 0 6 0 0 40 50718<br>KSPSetUp 18 1.0 5.4130e-02 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+01 0 0 0 0 0 0 0 0 0 0 0<br>KSPSolve 6 1.0 3.2363e+04 1.0 2.15e+11 1.1 4.7e+06 1.5e+04 7.7e+03 77100 99 92 95 77100 99 92 95 415<br>PCSetUp 18 1.0 2.5724e+04 1.0 0.00e+00 0.0 7.7e+03 3.3e+04 2.0e+02 61 0 0 0 2 61 0 0 0 2 0<br>PCApply 18 1.0 3.2248e+04 1.0 2.12e+11 1.1 4.7e+06 1.5e+04 7.6e+03 77 98 99 91 93 77 98 99 91 93 411<br>------------------------------------------------------------------------------------------------------------------------<br><br>Memory usage is given in bytes:<br><br>Object Type Creations Destructions Memory Descendants' Mem.<br>Reports information only for process 0.<br><br>--- Event Stage 0: Main Stage<br><br> Vector 859 849 1446200816 0<br> Vector Scatter 26 16 16352 0<br> Matrix 20 20 1832984916 0<br> Distributed Mesh 3 3 5457592 0<br> Bipartite Graph 6 6 4848 0<br> Index Set 61 61 1882648 0<br> IS L to G Mapping 5 5 3961468 0<br> Krylov Solver 5 5 57888 0<br> DMKSP interface 1 1 656 0<br> Preconditioner 5 5 4440 0<br> Viewer 7 6 4272 0<br>========================================================================================================================<br>Average time to get PetscTime(): 0<br>Average time for MPI_Barrier(): 1.27792e-05<br>Average time for zero size MPI_Send(): 5.20423e-06<br>#PETSc Option Table entries:<br><br><br><br></div><br>-log_summary output for P2:<br>---------------------------------------------- PETSc Performance Summary: ----------------------------------------------<br><br>/epi/asclepios2/bkhanal/works/AdLemModel/build/src/AdLemMain on a arch-linux2-cxx-opt named nef001 with 64 processors, by bkhanal Wed Oct 29 14:24:36 2014<br>Using Petsc Release Version 3.4.3, Oct, 15, 2013 <br><br> Max Max/Min Avg Total <br>Time (sec): 1.958e+04 1.00194 1.955e+04<br>Objects: 3.190e+02 1.00000 3.190e+02<br>Flops: 3.638e+10 1.08499 3.548e+10 2.271e+12<br>Flops/sec: 1.861e+06 1.08676 1.815e+06 1.161e+08<br>MPI Messages: 2.253e+04 3.68455 1.270e+04 8.131e+05<br>MPI Message Lengths: 3.403e+08 2.51345 1.616e+04 1.314e+10<br>MPI Reductions: 1.544e+03 1.00000<br><br>Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)<br> e.g., VecAXPY() for real vectors of length N --> 2N flops<br> and VecAXPY() for complex vectors of length N --> 8N flops<br><br>Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions --<br> Avg %Total Avg %Total counts %Total Avg %Total counts %Total <br> 0: Main Stage: 1.9554e+04 100.0% 2.2709e+12 100.0% 8.131e+05 100.0% 1.616e+04 100.0% 1.543e+03 99.9% <br><br>------------------------------------------------------------------------------------------------------------------------<br>See the 'Profiling' chapter of the users' manual for details on interpreting output.<br>Phase summary info:<br> Count: number of times phase was executed<br> Time and Flops: Max - maximum over all processors<br> Ratio - ratio of maximum to minimum over all processors<br> Mess: number of messages sent<br> Avg. len: average message length (bytes)<br> Reduct: number of global reductions<br> Global: entire computation<br> Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().<br> %T - percent time in this phase %f - percent flops in this phase<br> %M - percent messages in this phase %L - percent message lengths in this phase<br> %R - percent reductions in this phase<br> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)<br>------------------------------------------------------------------------------------------------------------------------<br>Event Count Time (sec) Flops --- Global --- --- Stage --- Total<br> Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %f %M %L %R %T %f %M %L %R Mflop/s<br>------------------------------------------------------------------------------------------------------------------------<br><br>--- Event Stage 0: Main Stage<br><br>VecView 1 1.0 4.4869e+02189.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>VecMDot 544 1.0 1.8271e+01 2.2 1.00e+09 1.1 0.0e+00 0.0e+00 5.4e+02 0 3 0 0 35 0 3 0 0 35 3456<br>VecNorm 738 1.0 2.0433e+0218.1 4.60e+08 1.1 0.0e+00 0.0e+00 7.4e+02 1 1 0 0 48 1 1 0 0 48 142<br>VecScale 788 1.0 4.1195e+00 9.0 2.34e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 3573<br>VecCopy 251 1.0 7.6140e+0046.9 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>VecSet 926 1.0 3.9087e+0141.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>VecAXPY 187 1.0 6.0848e+0032.3 1.04e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1073<br>VecAYPX 63 1.0 4.6702e-0116.2 8.38e+06 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 1131<br>VecMAXPY 727 1.0 1.0997e+01 4.9 1.50e+09 1.1 0.0e+00 0.0e+00 0.0e+00 0 4 0 0 0 0 4 0 0 0 8610<br>VecScatterBegin 864 1.0 2.0978e+0234.1 0.00e+00 0.0 8.0e+05 1.5e+04 2.0e+00 0 0 98 92 0 0 0 98 92 0 0<br>VecScatterEnd 862 1.0 5.4781e+02114.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>VecNormalize 671 1.0 1.6922e+01 2.2 6.61e+08 1.1 0.0e+00 0.0e+00 6.7e+02 0 2 0 0 43 0 2 0 0 43 2461<br>MatMult 152 1.0 6.3271e+03 1.0 3.56e+10 1.1 7.9e+05 1.5e+04 1.0e+03 32 98 98 89 68 32 98 98 89 68 351<br>MatMultAdd 118 1.0 4.5234e+02183.7 8.36e+08 1.1 1.1e+05 6.2e+03 0.0e+00 1 2 14 5 0 1 2 14 5 0 115<br>MatConvert 1 1.0 3.6065e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0<br>MatAssemblyBegin 10 1.0 1.0849e+03 3.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.2e+01 5 0 0 0 1 5 0 0 0 1 0<br>MatAssemblyEnd 10 1.0 1.3957e+01 1.1 0.00e+00 0.0 9.4e+03 3.6e+03 4.0e+01 0 0 1 0 3 0 0 1 0 3 0<br>MatGetRowIJ 2 1.0 2.2221e-03582.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>MatView 3 1.0 3.7378e-01 9.8 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>KSPGMRESOrthog 544 1.0 2.0370e+01 1.7 2.00e+09 1.1 0.0e+00 0.0e+00 5.4e+02 0 6 0 0 35 0 6 0 0 35 6200<br>KSPSetUp 3 1.0 4.2598e+01 3.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+01 0 0 0 0 1 0 0 0 0 1 0<br>KSPSolve 1 1.0 1.5113e+04 1.0 3.63e+10 1.1 7.9e+05 1.5e+04 1.3e+03 77100 98 89 84 77100 98 89 84 150<br>PCSetUp 3 1.0 1.1794e+04 1.0 0.00e+00 0.0 7.7e+03 3.3e+04 1.3e+02 60 0 1 2 8 60 0 1 2 8 0<br>PCApply 3 1.0 1.4940e+04 1.0 3.58e+10 1.1 7.9e+05 1.5e+04 1.3e+03 76 98 97 88 83 76 98 97 88 83 149<br>------------------------------------------------------------------------------------------------------------------------<br><br>Memory usage is given in bytes:<br><br>Object Type Creations Destructions Memory Descendants' Mem.<br>Reports information only for process 0.<br><br>--- Event Stage 0: Main Stage<br><br> Vector 215 215 743212688 0<br> Vector Scatter 16 16 16352 0<br> Matrix 20 20 1832984916 0<br> Distributed Mesh 3 3 5457592 0<br> Bipartite Graph 6 6 4848 0<br> Index Set 41 41 1867368 0<br> IS L to G Mapping 5 5 3961468 0<br> Krylov Solver 5 5 57888 0<br> DMKSP interface 1 1 656 0<br> Preconditioner 5 5 4440 0<br> Viewer 2 1 712 0<br>========================================================================================================================<br>Average time to get PetscTime(): 9.53674e-08<br>Average time for MPI_Barrier(): 0.000274992<br>Average time for zero size MPI_Send(): 1.67042e-05<br>#PETSc Option Table entries:<br><br><div><div><div><div><div><div><div><div><div><br></div></div></div></div></div></div></div></div></div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener
</div></div>