[petsc-users] Tuning performance for simple solver

Mon Jun 6 07:10:07 CDT 2016

For clarification (as it seems I may have misused certain terms): 
Poisson's equation is meant to be solved ~1E6 times with successively 
changing right handed side. Assembling the matrix wont be a dominating 
part of the solution process as long as it can be reused. My initial 
concerns involved possible flaws in the set-up or optimizations I 
wouldn't come up with myself.

A certain amount of nodes within the grid define walls of arbitrary 
shape by having a fixed value in the solution vector (e.g. capacitor 
plates with fixed potential). I account for those by replacing the 
correspondent matrix rows by rows of the identity matrix and by 
adjusting the rhs vector.

Nonetheless, algebraic multigrid seems to work really well compared to 
other preconditioners. Can I expect any further runtime improvements by 
using additional options like -pc_mg_smoothdown <1> or are the standard 
values generally sufficient?

Thanks,
Michael

Am 03.06.2016 um 16:56 schrieb Dave May:
>
>
> On 3 June 2016 at 15:34, Michael Becker 
> <Michael.Becker at physik.uni-giessen.de 
> <mailto:Michael.Becker at physik.uni-giessen.de>> wrote:
>
>     So using -log_summary helps me find out how much time is actually
>     spent on the PETSc routines that are repeatedly called. Since that
>     part of my code is fairly simple:
>
>       PetscScalar *barray;
>       VecGetArray(b,&barray);
>       for (int i=0; i<Nall; i++) {
>         if (bound[i]==0)
>           barray[i] = charge[i]*ih*iepsilon0;
>         else
>           barray[i] = phi[i];
>       }
>       VecRestoreArray(b,&barray);
>
>       KSPSolve(ksp,b,x);
>
>       KSPGetSolution(ksp,&x);
>       PetscScalar *xarray;
>       VecGetArray(x,&xarray);
>       for (int i=0; i<Nall; i++)
>         phi[i] = xarray[i];
>       VecRestoreArray(x,&xarray);
>
>
> Well - you also have user code which assembles a matrix...
>
> However it seems assembly is not taking much time.
> Note that this is not always the case, for instance if the 
> preallocation was done incorrectly.
>
>     , I don't see how additional log states would help me.
>
>
> The point of profiling is to precisely identify calls which consume 
> the most time,
> rather than just _assuming_ which functions consume the largest 
> fraction of time.
>
> Now we are all on the same page and can correct state that the solve 
> is the problem.
>
> Without knowing anything other than you are solving Poisson, the 
> simplest preconditioner to try out which
> can yield scalable and optimal results is algebraic multigrid.
> Try the option:
>   -pc_type gamg
>
>     So I would then still just test which KSP method is the fastest?
>
>     I ran a test over 1000 iterations; this is the output:
>>     Max       Max/Min        Avg      Total
>>     Time (sec):           1.916e+02      1.00055 1.915e+02
>>     Objects:              1.067e+03      1.00000 1.067e+03
>>     Flops:                5.730e+10      1.22776 5.360e+10  1.158e+13
>>     Flops/sec:            2.992e+08      1.22792 2.798e+08  6.044e+10
>>     MPI Messages:         1.900e+06      3.71429 1.313e+06  2.835e+08
>>     MPI Message Lengths:  1.138e+09      2.38189 6.824e+02  1.935e+11
>>     MPI Reductions:       1.462e+05      1.00000
>>
>>     Flop counting convention: 1 flop = 1 real number operation of
>>     type (multiply/divide/add/subtract)
>>                                 e.g., VecAXPY() for real vectors of
>>     length N --> 2N flops
>>                                 and VecAXPY() for complex vectors of
>>     length N --> 8N flops
>>
>>     Summary of Stages:   ----- Time ------  ----- Flops -----  ---
>>     Messages ---  -- Message Lengths --  -- Reductions --
>>                             Avg     %Total     Avg %Total   counts  
>>     %Total     Avg         %Total counts   %Total
>>      0:      Main Stage: 1.9154e+02 100.0%  1.1577e+13 100.0% 
>>     2.835e+08 100.0%  6.824e+02      100.0% 1.462e+05 100.0%
>>
>>     ------------------------------------------------------------------------------------------------------------------------
>>     See the 'Profiling' chapter of the users' manual for details on
>>     interpreting output.
>>     Phase summary info:
>>        Count: number of times phase was executed
>>        Time and Flops: Max - maximum over all processors
>>                        Ratio - ratio of maximum to minimum over all
>>     processors
>>        Mess: number of messages sent
>>        Avg. len: average message length (bytes)
>>        Reduct: number of global reductions
>>        Global: entire computation
>>        Stage: stages of a computation. Set stages with
>>     PetscLogStagePush() and PetscLogStagePop().
>>           %T - percent time in this phase         %F - percent flops
>>     in this phase
>>           %M - percent messages in this phase     %L - percent
>>     message lengths in this phase
>>           %R - percent reductions in this phase
>>        Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max
>>     time over all processors)
>>     ------------------------------------------------------------------------------------------------------------------------
>>     Event                Count      Time (sec)
>>     Flops                             --- Global ---  --- Stage ---  
>>     Total
>>                        Max Ratio  Max     Ratio   Max Ratio  Mess  
>>     Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>     ------------------------------------------------------------------------------------------------------------------------
>>
>>     --- Event Stage 0: Main Stage
>>
>>     KSPGMRESOrthog     70070 1.0 7.8035e+01 2.3 1.94e+10 1.2 0.0e+00
>>     0.0e+00 7.0e+04 29 34  0  0 48  29 34  0 0 48 50538
>>     KSPSetUp               2 1.0 1.5209e-03 1.1 0.00e+00 0.0 0.0e+00
>>     0.0e+00 1.0e+01  0  0  0  0  0   0  0  0 0  0     0
>>     KSPSolve            1001 1.0 1.9097e+02 1.0 5.73e+10 1.2 2.8e+08
>>     6.8e+02 1.5e+05100100100100100 100100100100100 60621
>>     VecMDot            70070 1.0 6.9833e+01 2.8 9.69e+09 1.2 0.0e+00
>>     0.0e+00 7.0e+04 25 17  0  0 48  25 17  0 0 48 28235
>>     VecNorm            74074 1.0 1.1570e+01 1.7 7.28e+08 1.2 0.0e+00
>>     0.0e+00 7.4e+04  5  1  0  0 51   5  1  0 0 51 12804
>>     VecScale           73073 1.0 5.6676e-01 1.3 3.59e+08 1.2 0.0e+00
>>     0.0e+00 0.0e+00  0  1  0  0  0   0  1  0 0  0 128930
>>     VecCopy             3003 1.0 1.0008e-01 1.6 0.00e+00 0.0 0.0e+00
>>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 0  0     0
>>     VecSet             77080 1.0 1.3647e+00 1.4 0.00e+00 0.0 0.0e+00
>>     0.0e+00 0.0e+00  1  0  0  0  0   1  0  0 0  0     0
>>     VecAXPY             6006 1.0 1.0779e-01 1.7 5.90e+07 1.2 0.0e+00
>>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 0  0 111441
>>     VecMAXPY           73073 1.0 9.2155e+00 1.3 1.04e+10 1.2 0.0e+00
>>     0.0e+00 0.0e+00  4 18  0  0  0   4 18  0 0  0 229192
>>     VecScatterBegin    73073 1.0 7.0538e+00 4.4 0.00e+00 0.0 2.8e+08
>>     6.8e+02 0.0e+00  2  0100100  0   2 0100100  0     0
>>     VecScatterEnd      73073 1.0 7.8382e+00 2.6 0.00e+00 0.0 0.0e+00
>>     0.0e+00 0.0e+00  3  0  0  0  0   3  0  0 0  0     0
>>     VecNormalize       73073 1.0 1.1774e+01 1.6 1.08e+09 1.2 0.0e+00
>>     0.0e+00 7.3e+04  5  2  0  0 50   5  2  0 0 50 18619
>>     MatMult            73073 1.0 8.6056e+01 1.7 1.90e+10 1.3 2.8e+08
>>     6.8e+02 0.0e+00 36 33100100  0  36 33100100  0 44093
>>     MatSolve           74074 1.0 5.4865e+01 1.2 1.71e+10 1.2 0.0e+00
>>     0.0e+00 0.0e+00 27 30  0  0  0  27 30  0 0  0 63153
>>     MatLUFactorNum         1 1.0 4.1230e-03 2.6 9.89e+05241.4 0.0e+00
>>     0.0e+00 0.0e+00  0  0  0  0  0 0  0  0  0  0 36155
>>     MatILUFactorSym        1 1.0 2.1942e-03 1.3 0.00e+00 0.0 0.0e+00
>>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 0  0     0
>>     MatAssemblyBegin       2 1.0 5.6112e-03 4.8 0.00e+00 0.0 0.0e+00
>>     0.0e+00 4.0e+00  0  0  0  0  0   0  0  0 0  0     0
>>     MatAssemblyEnd         2 1.0 6.3889e-03 1.0 0.00e+00 0.0 7.8e+03
>>     1.7e+02 8.0e+00  0  0  0  0  0   0  0  0 0  0     0
>>     MatGetRowIJ            1 1.0 2.8849e-0515.1 0.00e+00 0.0 0.0e+00
>>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 0  0     0
>>     MatGetOrdering         1 1.0 1.2279e-04 1.6 0.00e+00 0.0 0.0e+00
>>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 0  0     0
>>     PCSetUp                2 1.0 6.6662e-03 1.8 9.89e+05241.4 0.0e+00
>>     0.0e+00 0.0e+00  0  0  0  0  0 0  0  0  0  0 22361
>>     PCSetUpOnBlocks     1001 1.0 7.5164e-03 1.7 9.89e+05241.4 0.0e+00
>>     0.0e+00 0.0e+00  0  0  0  0  0 0  0  0  0  0 19832
>>     PCApply            74074 1.0 5.9613e+01 1.2 1.71e+10 1.2 0.0e+00
>>     0.0e+00 0.0e+00 29 30  0  0  0  29 30  0 0  0 58123
>>     ------------------------------------------------------------------------------------------------------------------------
>>
>>     Memory usage is given in bytes:
>>
>>     Object Type          Creations   Destructions Memory 
>>     Descendants' Mem.
>>     Reports information only for process 0.
>>
>>     --- Event Stage 0: Main Stage
>>
>>            Krylov Solver     2              2 19576     0.
>>          DMKSP interface     1              1 656     0.
>>                   Vector  1043           1043 42492328     0.
>>           Vector Scatter     2              2 41496     0.
>>                   Matrix     4              4 3163588     0.
>>         Distributed Mesh     1              1 5080     0.
>>     Star Forest Bipartite Graph     2 2         1728     0.
>>          Discrete System     1              1 872     0.
>>                Index Set     7              7 71796     0.
>>        IS L to G Mapping     1              1 28068     0.
>>           Preconditioner     2              2 1912     0.
>>                   Viewer     1              0 0     0.
>>     ========================================================================================================================
>>     Average time to get PetscTime(): 1.90735e-07
>>     Average time for MPI_Barrier(): 0.000184202
>>     Average time for zero size MPI_Send(): 1.03469e-05
>>     #PETSc Option Table entries:
>>     -log_summary
>>     #End of PETSc Option Table entries
>>     Compiled without FORTRAN kernels
>>     Compiled with full precision matrices (default)
>>     sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>     sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>     Configure options: --download-f2cblaslapack --with-fc=0
>>     --with-debugging=0 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
>
>     Regarding Matt's answer: It's generally a rectangular grid (3D) of
>     predetermined size (not necessarily a cube). Additionally, objects
>     of arbitrary shape can be defined by Dirichlet boundary
>     conditions. Is geometric MG still viable?
>
>     Thanks,
>     Michael
>
>
>
>     Am 03.06.2016 um 14:32 schrieb Matthew Knepley:
>>     On Fri, Jun 3, 2016 at 5:56 AM, Dave May <dave.mayhem23 at gmail.com
>>     <mailto:dave.mayhem23 at gmail.com>> wrote:
>>
>>         On 3 June 2016 at 11:37, Michael Becker
>>         <Michael.Becker at physik.uni-giessen.de
>>         <mailto:Michael.Becker at physik.uni-giessen.de>> wrote:
>>
>>             Dear all,
>>
>>             I have a few questions regarding possible performance
>>             enhancements for the PETSc solver I included in my project.
>>
>>             It's a particle-in-cell plasma simulation written in C++,
>>             where Poisson's equation needs to be solved repeatedly on
>>             every timestep.
>>             The simulation domain is discretized using finite
>>             differences, so the solver therefore needs to be able to
>>             efficiently solve the linear system A x = b successively
>>             with changing b. The solution x of the previous timestep
>>             is generally a good initial guess for the solution.
>>
>>             I wrote a class PETScSolver that holds all PETSc objects
>>             and necessary information about domain size and
>>             decomposition. To solve the linear system, two arrays,
>>             'phi' and 'charge', are passed to a member function
>>             solve(), where they are copied to PETSc vectors, and
>>             KSPSolve() is called. After convergence, the solution is
>>             then transferred again to the phi array so that other
>>             program parts can use it.
>>
>>             The matrix is created using DMDA. An array 'bound' is
>>             used to determine whether a node is either a Dirichlet BC
>>             or holds a charge.
>>
>>             I attached three files, petscsolver.h, petscsolver.cpp
>>             and main.cpp, that contain a shortened version of the
>>             solver class and a set-up to initialize and run a simple
>>             problem.
>>
>>             Is there anything I can change to generally make the
>>             program run faster?
>>
>>
>>         Before changing anything, you should profile your code to see
>>         where time is being spent.
>>
>>         To that end, you should compile an optimized build of petsc,
>>         link it to you application and run your code with the option
>>         -log_summary. The -log_summary flag will generate a
>>         performance profile of specific functionality within petsc
>>         (KSPSolve, MatMult etc) so you can see where all the time is
>>         being spent.
>>
>>         As a second round of profiling, you should consider
>>         registering specific functionality in your code you think is
>>         performance critical.
>>         You can do this using the function PetscLogStageRegister()
>>
>>         http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Profiling/PetscLogStageRegister.html
>>
>>
>>         Check out the examples listed at the bottom of this web page
>>         to see how to log stages. Once you've registered stages,
>>         these will appear in the report provided by -log_summary
>>
>>
>>     Do everything Dave said. I will also note that since you are
>>     using FD, I am guessing you are solving on a square. Then
>>     you should really be using geometric MG. We support this through
>>     the DMDA object.
>>
>>       Thanks,
>>
>>          Matt
>>
>>         Thanks,
>>           Dave
>>
>>
>>             And, since I'm rather unexperienced with KSP methods, how
>>             do I efficiently choose PC and KSP? Just by testing every
>>             combination?
>>             Would multigrid be a viable option as a pure solver
>>             (-ksp_type preonly)?
>>
>>             Thanks,
>>             Michael
>>
>>
>>
>>
>>
>>     -- 
>>     What most experimenters take for granted before they begin their
>>     experiments is infinitely more interesting than any results to
>>     which their experiments lead.
>>     -- Norbert Wiener
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160606/7613357e/attachment-0001.html>