<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><br>
    Nonetheless, algebraic multigrid seems to work really well compared
    to other preconditioners. Can I expect any further runtime
    improvements by using additional options like -pc_mg_smoothdown
    <1> or are the standard values generally sufficient?<br></div></blockquote><div><br></div><div>Getting almost one digit reduction in residual per iteration is about as good as it gets. If you are (far) away from that then there is some discussion in the document on how to try to improve GAMG.  If GAMG is spending a lot more time in the matrix setup phase (RAP or 'MatPtAP..' in log view) than the  'KSPSolve' then you may be coarsening too slowly and you can use the threshold parameters and graph squaring to increase coarsening rate and thereby decrease setup cost and cost per iteration, and reducing overall solve time.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
    <br>
    Thanks,<br>
    Michael<div><div class="h5"><br>
    <br>
    <br>
    <div>Am 03.06.2016 um 16:56 schrieb Dave
      May:<br>
    </div>
    <blockquote type="cite">
      <div dir="ltr"><br>
        <div class="gmail_extra"><br>
          <div class="gmail_quote">On 3 June 2016 at 15:34, Michael
            Becker <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-giessen.de</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"> So using
                -log_summary helps me find out how much time is actually
                spent on the PETSc routines that are repeatedly called.
                Since that part of my code is fairly simple:<br>
                <br>
                  PetscScalar *barray;<br>
                  VecGetArray(b,&barray);<br>
                  for (int i=0; i<Nall; i++) {<br>
                    if (bound[i]==0)<br>
                      barray[i] = charge[i]*ih*iepsilon0;<br>
                    else<br>
                      barray[i] = phi[i];<br>
                  }<br>
                  VecRestoreArray(b,&barray);<br>
                <br>
                  KSPSolve(ksp,b,x);<br>
                  <br>
                  KSPGetSolution(ksp,&x);<br>
                  PetscScalar *xarray;<br>
                  VecGetArray(x,&xarray);<br>
                  for (int i=0; i<Nall; i++)<br>
                    phi[i] = xarray[i];<br>
                  VecRestoreArray(x,&xarray);<br>
                <br>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>Well - you also have user code which assembles a
              matrix...<br>
            </div>
            <div><br>
              However it seems assembly is not taking much time.<br>
            </div>
            <div>Note that this is not always the case, for instance if
              the preallocation was done incorrectly.<br>
            </div>
            <div><br>
               </div>
            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"> , I don't see how
                additional log states would help me. </div>
            </blockquote>
            <div><br>
            </div>
            <div>The point of profiling is to precisely identify calls
              which consume the most time, <br>
              rather than just _assuming_ which functions consume the
              largest fraction of time.<br>
              <br>
            </div>
            <div>Now we are all on the same page and can correct state
              that the solve is the problem.<br>
            </div>
            <div><br>
              Without knowing anything other than you are solving
              Poisson, the simplest preconditioner to try out which <br>
              can yield scalable and optimal results is algebraic
              multigrid.<br>
              Try the option:<br>
                -pc_type gamg<br>
            </div>
            <br>
             
            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000">So I would then
                still just test which KSP method is the fastest?<br>
                <br>
                I ran a test over 1000 iterations; this is the output:<br>
                <blockquote type="cite">                        
                  Max       Max/Min        Avg      Total <br>
                  Time (sec):           1.916e+02      1.00055  
                  1.915e+02<br>
                  Objects:              1.067e+03      1.00000  
                  1.067e+03<br>
                  Flops:                5.730e+10      1.22776  
                  5.360e+10  1.158e+13<br>
                  Flops/sec:            2.992e+08      1.22792  
                  2.798e+08  6.044e+10<br>
                  MPI Messages:         1.900e+06      3.71429  
                  1.313e+06  2.835e+08<br>
                  MPI Message Lengths:  1.138e+09      2.38189  
                  6.824e+02  1.935e+11<br>
                  MPI Reductions:       1.462e+05      1.00000<br>
                  <br>
                  Flop counting convention: 1 flop = 1 real number
                  operation of type (multiply/divide/add/subtract)<br>
                                              e.g., VecAXPY() for real
                  vectors of length N --> 2N flops<br>
                                              and VecAXPY() for complex
                  vectors of length N --> 8N flops<br>
                  <br>
                  Summary of Stages:   ----- Time ------  ----- Flops
                  -----  --- Messages ---  -- Message Lengths --  --
                  Reductions --<br>
                                          Avg     %Total     Avg    
                  %Total   counts   %Total     Avg         %Total  
                  counts   %Total <br>
                   0:      Main Stage: 1.9154e+02 100.0%  1.1577e+13
                  100.0%  2.835e+08 100.0%  6.824e+02      100.0% 
                  1.462e+05 100.0% <br>
                  <br>
------------------------------------------------------------------------------------------------------------------------<br>
                  See the 'Profiling' chapter of the users' manual for
                  details on interpreting output.<br>
                  Phase summary info:<br>
                     Count: number of times phase was executed<br>
                     Time and Flops: Max - maximum over all processors<br>
                                     Ratio - ratio of maximum to minimum
                  over all processors<br>
                     Mess: number of messages sent<br>
                     Avg. len: average message length (bytes)<br>
                     Reduct: number of global reductions<br>
                     Global: entire computation<br>
                     Stage: stages of a computation. Set stages with
                  PetscLogStagePush() and PetscLogStagePop().<br>
                        %T - percent time in this phase         %F -
                  percent flops in this phase<br>
                        %M - percent messages in this phase     %L -
                  percent message lengths in this phase<br>
                        %R - percent reductions in this phase<br>
                     Total Mflop/s: 10e-6 * (sum of flops over all
                  processors)/(max time over all processors)<br>
------------------------------------------------------------------------------------------------------------------------<br>
                  Event                Count      Time (sec)    
                  Flops                             --- Global ---  ---
                  Stage ---   Total<br>
                                     Max Ratio  Max     Ratio   Max 
                  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M
                  %L %R Mflop/s<br>
------------------------------------------------------------------------------------------------------------------------<br>
                  <br>
                  --- Event Stage 0: Main Stage<br>
                  <br>
                  KSPGMRESOrthog     70070 1.0 7.8035e+01 2.3 1.94e+10
                  1.2 0.0e+00 0.0e+00 7.0e+04 29 34  0  0 48  29 34  0 
                  0 48 50538<br>
                  KSPSetUp               2 1.0 1.5209e-03 1.1 0.00e+00
                  0.0 0.0e+00 0.0e+00 1.0e+01  0  0  0  0  0   0  0  0 
                  0  0     0<br>
                  KSPSolve            1001 1.0 1.9097e+02 1.0 5.73e+10
                  1.2 2.8e+08 6.8e+02 1.5e+05100100100100100
                  100100100100100 60621<br>
                  VecMDot            70070 1.0 6.9833e+01 2.8 9.69e+09
                  1.2 0.0e+00 0.0e+00 7.0e+04 25 17  0  0 48  25 17  0 
                  0 48 28235<br>
                  VecNorm            74074 1.0 1.1570e+01 1.7 7.28e+08
                  1.2 0.0e+00 0.0e+00 7.4e+04  5  1  0  0 51   5  1  0 
                  0 51 12804<br>
                  VecScale           73073 1.0 5.6676e-01 1.3 3.59e+08
                  1.2 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0 
                  0  0 128930<br>
                  VecCopy             3003 1.0 1.0008e-01 1.6 0.00e+00
                  0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
                  0  0     0<br>
                  VecSet             77080 1.0 1.3647e+00 1.4 0.00e+00
                  0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0 
                  0  0     0<br>
                  VecAXPY             6006 1.0 1.0779e-01 1.7 5.90e+07
                  1.2 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
                  0  0 111441<br>
                  VecMAXPY           73073 1.0 9.2155e+00 1.3 1.04e+10
                  1.2 0.0e+00 0.0e+00 0.0e+00  4 18  0  0  0   4 18  0 
                  0  0 229192<br>
                  VecScatterBegin    73073 1.0 7.0538e+00 4.4 0.00e+00
                  0.0 2.8e+08 6.8e+02 0.0e+00  2  0100100  0   2 
                  0100100  0     0<br>
                  VecScatterEnd      73073 1.0 7.8382e+00 2.6 0.00e+00
                  0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   3  0  0 
                  0  0     0<br>
                  VecNormalize       73073 1.0 1.1774e+01 1.6 1.08e+09
                  1.2 0.0e+00 0.0e+00 7.3e+04  5  2  0  0 50   5  2  0 
                  0 50 18619<br>
                  MatMult            73073 1.0 8.6056e+01 1.7 1.90e+10
                  1.3 2.8e+08 6.8e+02 0.0e+00 36 33100100  0  36
                  33100100  0 44093<br>
                  MatSolve           74074 1.0 5.4865e+01 1.2 1.71e+10
                  1.2 0.0e+00 0.0e+00 0.0e+00 27 30  0  0  0  27 30  0 
                  0  0 63153<br>
                  MatLUFactorNum         1 1.0 4.1230e-03 2.6
                  9.89e+05241.4 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  
                  0  0  0  0  0 36155<br>
                  MatILUFactorSym        1 1.0 2.1942e-03 1.3 0.00e+00
                  0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
                  0  0     0<br>
                  MatAssemblyBegin       2 1.0 5.6112e-03 4.8 0.00e+00
                  0.0 0.0e+00 0.0e+00 4.0e+00  0  0  0  0  0   0  0  0 
                  0  0     0<br>
                  MatAssemblyEnd         2 1.0 6.3889e-03 1.0 0.00e+00
                  0.0 7.8e+03 1.7e+02 8.0e+00  0  0  0  0  0   0  0  0 
                  0  0     0<br>
                  MatGetRowIJ            1 1.0 2.8849e-0515.1 0.00e+00
                  0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
                  0  0     0<br>
                  MatGetOrdering         1 1.0 1.2279e-04 1.6 0.00e+00
                  0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
                  0  0     0<br>
                  PCSetUp                2 1.0 6.6662e-03 1.8
                  9.89e+05241.4 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  
                  0  0  0  0  0 22361<br>
                  PCSetUpOnBlocks     1001 1.0 7.5164e-03 1.7
                  9.89e+05241.4 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0  
                  0  0  0  0  0 19832<br>
                  PCApply            74074 1.0 5.9613e+01 1.2 1.71e+10
                  1.2 0.0e+00 0.0e+00 0.0e+00 29 30  0  0  0  29 30  0 
                  0  0 58123<br>
------------------------------------------------------------------------------------------------------------------------<br>
                  <br>
                  Memory usage is given in bytes:<br>
                  <br>
                  Object Type          Creations   Destructions    
                  Memory  Descendants' Mem.<br>
                  Reports information only for process 0.<br>
                  <br>
                  --- Event Stage 0: Main Stage<br>
                  <br>
                         Krylov Solver     2              2       
                  19576     0.<br>
                       DMKSP interface     1              1         
                  656     0.<br>
                                Vector  1043           1043    
                  42492328     0.<br>
                        Vector Scatter     2              2       
                  41496     0.<br>
                                Matrix     4              4     
                  3163588     0.<br>
                      Distributed Mesh     1              1        
                  5080     0.<br>
                  Star Forest Bipartite Graph     2             
                  2         1728     0.<br>
                       Discrete System     1              1         
                  872     0.<br>
                             Index Set     7              7       
                  71796     0.<br>
                     IS L to G Mapping     1              1       
                  28068     0.<br>
                        Preconditioner     2              2        
                  1912     0.<br>
                                Viewer     1              0           
                  0     0.<br>
========================================================================================================================<br>
                  Average time to get PetscTime(): 1.90735e-07<br>
                  Average time for MPI_Barrier(): 0.000184202<br>
                  Average time for zero size MPI_Send(): 1.03469e-05<br>
                  #PETSc Option Table entries:<br>
                  -log_summary<br>
                  #End of PETSc Option Table entries<br>
                  Compiled without FORTRAN kernels<br>
                  Compiled with full precision matrices (default)<br>
                  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8
                  sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt)
                  4<br>
                  Configure options: --download-f2cblaslapack
                  --with-fc=0 --with-debugging=0 COPTFLAGS=-O3
                  CXXOPTFLAGS=-O3</blockquote>
                <br>
                Regarding Matt's answer: It's generally a rectangular
                grid (3D) of predetermined size (not necessarily a
                cube). Additionally, objects of arbitrary shape can be
                defined by Dirichlet boundary conditions. Is geometric
                MG still viable?<br>
                <br>
                Thanks,<br>
                Michael
                <div>
                  <div><br>
                    <br>
                    <br>
                    <div>Am 03.06.2016 um 14:32 schrieb Matthew Knepley:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div class="gmail_extra">
                          <div class="gmail_quote">On Fri, Jun 3, 2016
                            at 5:56 AM, Dave May <span dir="ltr"><<a href="mailto:dave.mayhem23@gmail.com" target="_blank"></a><a href="mailto:dave.mayhem23@gmail.com" target="_blank">dave.mayhem23@gmail.com</a>></span>
                            wrote:<br>
                            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                              <div dir="ltr">
                                <div class="gmail_extra">
                                  <div class="gmail_quote"><span>On 3
                                      June 2016 at 11:37, Michael Becker
                                      <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank"></a><a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-giessen.de</a>></span>
                                      wrote:<br>
                                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear
                                        all,<br>
                                        <br>
                                        I have a few questions regarding
                                        possible performance
                                        enhancements for the PETSc
                                        solver I included in my project.<br>
                                        <br>
                                        It's a particle-in-cell plasma
                                        simulation written in C++, where
                                        Poisson's equation needs to be
                                        solved repeatedly on every
                                        timestep.<br>
                                        The simulation domain is
                                        discretized using finite
                                        differences, so the solver
                                        therefore needs to be able to
                                        efficiently solve the linear
                                        system A x = b successively with
                                        changing b. The solution x of
                                        the previous timestep is
                                        generally a good initial guess
                                        for the solution.<br>
                                        <br>
                                        I wrote a class PETScSolver that
                                        holds all PETSc objects and
                                        necessary information about
                                        domain size and decomposition.
                                        To solve the linear system, two
                                        arrays, 'phi' and 'charge', are
                                        passed to a member function
                                        solve(), where they are copied
                                        to PETSc vectors, and KSPSolve()
                                        is called. After convergence,
                                        the solution is then transferred
                                        again to the phi array so that
                                        other program parts can use it.<br>
                                        <br>
                                        The matrix is created using
                                        DMDA. An array 'bound' is used
                                        to determine whether a node is
                                        either a Dirichlet BC or holds a
                                        charge.<br>
                                        <br>
                                        I attached three files,
                                        petscsolver.h, petscsolver.cpp
                                        and main.cpp, that contain a
                                        shortened version of the solver
                                        class and a set-up to initialize
                                        and run a simple problem.<br>
                                        <br>
                                        Is there anything I can change
                                        to generally make the program
                                        run faster?<br>
                                      </blockquote>
                                      <div><br>
                                      </div>
                                    </span>
                                    <div>Before changing anything, you
                                      should profile your code to see
                                      where time is being spent.<br>
                                      <br>
                                      To that end, you should compile an
                                      optimized build of petsc, link it
                                      to you application and run your
                                      code with the option -log_summary.
                                      The -log_summary flag will
                                      generate a performance profile of
                                      specific functionality within
                                      petsc (KSPSolve, MatMult etc) so
                                      you can see where all the time is
                                      being spent.<br>
                                      <br>
                                    </div>
                                    <div>As a second round of profiling,
                                      you should consider registering
                                      specific functionality in your
                                      code you think is performance
                                      critical. <br>
                                      You can do this using the function
                                      PetscLogStageRegister()<br>
                                      <br>
                                      <a href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Profiling/PetscLogStageRegister.html" target="_blank">http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Profiling/PetscLogStageRegister.html</a>
                                      <br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div>Check out the examples listed
                                      at the bottom of this web page to
                                      see how to log stages. Once you've
                                      registered stages, these will
                                      appear in the report provided by
                                      -log_summary</div>
                                  </div>
                                </div>
                              </div>
                            </blockquote>
                            <div><br>
                            </div>
                            <div>Do everything Dave said. I will also
                              note that since you are using FD, I am
                              guessing you are solving on a square. Then</div>
                            <div>you should really be using geometric
                              MG. We support this through the DMDA
                              object.</div>
                            <div><br>
                            </div>
                            <div>  Thanks,</div>
                            <div><br>
                            </div>
                            <div>     Matt</div>
                            <div> </div>
                            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                              <div dir="ltr">
                                <div class="gmail_extra">
                                  <div class="gmail_quote">
                                    <div>Thanks,<br>
                                    </div>
                                    <div>  Dave<br>
                                    </div>
                                    <span>
                                      <div> <br>
                                      </div>
                                      <div><br>
                                      </div>
                                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                        And, since I'm rather
                                        unexperienced with KSP methods,
                                        how do I efficiently choose PC
                                        and KSP? Just by testing every
                                        combination?<br>
                                        Would multigrid be a viable
                                        option as a pure solver
                                        (-ksp_type preonly)?<br>
                                        <br>
                                        Thanks,<br>
                                        Michael<br>
                                      </blockquote>
                                    </span></div>
                                  <br>
                                </div>
                              </div>
                            </blockquote>
                          </div>
                          <br>
                          <br clear="all">
                          <div><br>
                          </div>
                          -- <br>
                          <div data-smartmail="gmail_signature">What
                            most experimenters take for granted before
                            they begin their experiments is infinitely
                            more interesting than any results to which
                            their experiments lead.<br>
                            -- Norbert Wiener</div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
          <br>
        </div>
      </div>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br></div></div>