[petsc-users] Tuning performance for simple solver

Tue Jun 7 06:36:44 CDT 2016

>
>
> Nonetheless, algebraic multigrid seems to work really well compared to
> other preconditioners. Can I expect any further runtime improvements by
> using additional options like -pc_mg_smoothdown <1> or are the standard
> values generally sufficient?
>

Getting almost one digit reduction in residual per iteration is about as
good as it gets. If you are (far) away from that then there is some
discussion in the document on how to try to improve GAMG.  If GAMG is
spending a lot more time in the matrix setup phase (RAP or 'MatPtAP..' in
log view) than the  'KSPSolve' then you may be coarsening too slowly and
you can use the threshold parameters and graph squaring to increase
coarsening rate and thereby decrease setup cost and cost per iteration, and
reducing overall solve time.

>
> Thanks,
> Michael
>
>
>
> Am 03.06.2016 um 16:56 schrieb Dave May:
>
>
>
> On 3 June 2016 at 15:34, Michael Becker <
> Michael.Becker at physik.uni-giessen.de> wrote:
>
>> So using -log_summary helps me find out how much time is actually spent
>> on the PETSc routines that are repeatedly called. Since that part of my
>> code is fairly simple:
>>
>>   PetscScalar *barray;
>>   VecGetArray(b,&barray);
>>   for (int i=0; i<Nall; i++) {
>>     if (bound[i]==0)
>>       barray[i] = charge[i]*ih*iepsilon0;
>>     else
>>       barray[i] = phi[i];
>>   }
>>   VecRestoreArray(b,&barray);
>>
>>   KSPSolve(ksp,b,x);
>>
>>   KSPGetSolution(ksp,&x);
>>   PetscScalar *xarray;
>>   VecGetArray(x,&xarray);
>>   for (int i=0; i<Nall; i++)
>>     phi[i] = xarray[i];
>>   VecRestoreArray(x,&xarray);
>>
>>
> Well - you also have user code which assembles a matrix...
>
> However it seems assembly is not taking much time.
> Note that this is not always the case, for instance if the preallocation
> was done incorrectly.
>
>
>
>> , I don't see how additional log states would help me.
>>
>
> The point of profiling is to precisely identify calls which consume the
> most time,
> rather than just _assuming_ which functions consume the largest fraction
> of time.
>
> Now we are all on the same page and can correct state that the solve is
> the problem.
>
> Without knowing anything other than you are solving Poisson, the simplest
> preconditioner to try out which
> can yield scalable and optimal results is algebraic multigrid.
> Try the option:
>   -pc_type gamg
>
>
>>
>> So I would then still just test which KSP method is the fastest?
>>
>> I ran a test over 1000 iterations; this is the output:
>>
>>                          Max       Max/Min        Avg      Total
>> Time (sec):           1.916e+02      1.00055   1.915e+02
>> Objects:              1.067e+03      1.00000   1.067e+03
>> Flops:                5.730e+10      1.22776   5.360e+10  1.158e+13
>> Flops/sec:            2.992e+08      1.22792   2.798e+08  6.044e+10
>> MPI Messages:         1.900e+06      3.71429   1.313e+06  2.835e+08
>> MPI Message Lengths:  1.138e+09      2.38189   6.824e+02  1.935e+11
>> MPI Reductions:       1.462e+05      1.00000
>>
>> Flop counting convention: 1 flop = 1 real number operation of type
>> (multiply/divide/add/subtract)
>>                             e.g., VecAXPY() for real vectors of length N
>> --> 2N flops
>>                             and VecAXPY() for complex vectors of length N
>> --> 8N flops
>>
>> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
>> ---  -- Message Lengths --  -- Reductions --
>>                         Avg     %Total     Avg     %Total   counts
>> %Total     Avg         %Total   counts   %Total
>>  0:      Main Stage: 1.9154e+02 100.0%  1.1577e+13 100.0%  2.835e+08
>> 100.0%  6.824e+02      100.0%  1.462e+05 100.0%
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> See the 'Profiling' chapter of the users' manual for details on
>> interpreting output.
>> Phase summary info:
>>    Count: number of times phase was executed
>>    Time and Flops: Max - maximum over all processors
>>                    Ratio - ratio of maximum to minimum over all processors
>>    Mess: number of messages sent
>>    Avg. len: average message length (bytes)
>>    Reduct: number of global reductions
>>    Global: entire computation
>>    Stage: stages of a computation. Set stages with PetscLogStagePush()
>> and PetscLogStagePop().
>>       %T - percent time in this phase         %F - percent flops in this
>> phase
>>       %M - percent messages in this phase     %L - percent message
>> lengths in this phase
>>       %R - percent reductions in this phase
>>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
>> over all processors)
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)
>> Flops                             --- Global ---  --- Stage ---   Total
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>
>> ------------------------------------------------------------------------------------------------------------------------
>>
>> --- Event Stage 0: Main Stage
>>
>> KSPGMRESOrthog     70070 1.0 7.8035e+01 2.3 1.94e+10 1.2 0.0e+00 0.0e+00
>> 7.0e+04 29 34  0  0 48  29 34  0  0 48 50538
>> KSPSetUp               2 1.0 1.5209e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 1.0e+01  0  0  0  0  0   0  0  0  0  0     0
>> KSPSolve            1001 1.0 1.9097e+02 1.0 5.73e+10 1.2 2.8e+08 6.8e+02
>> 1.5e+05100100100100100 100100100100100 60621
>> VecMDot            70070 1.0 6.9833e+01 2.8 9.69e+09 1.2 0.0e+00 0.0e+00
>> 7.0e+04 25 17  0  0 48  25 17  0  0 48 28235
>> VecNorm            74074 1.0 1.1570e+01 1.7 7.28e+08 1.2 0.0e+00 0.0e+00
>> 7.4e+04  5  1  0  0 51   5  1  0  0 51 12804
>> VecScale           73073 1.0 5.6676e-01 1.3 3.59e+08 1.2 0.0e+00 0.0e+00
>> 0.0e+00  0  1  0  0  0   0  1  0  0  0 128930
>> VecCopy             3003 1.0 1.0008e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet             77080 1.0 1.3647e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> VecAXPY             6006 1.0 1.0779e-01 1.7 5.90e+07 1.2 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 111441
>> VecMAXPY           73073 1.0 9.2155e+00 1.3 1.04e+10 1.2 0.0e+00 0.0e+00
>> 0.0e+00  4 18  0  0  0   4 18  0  0  0 229192
>> VecScatterBegin    73073 1.0 7.0538e+00 4.4 0.00e+00 0.0 2.8e+08 6.8e+02
>> 0.0e+00  2  0100100  0   2  0100100  0     0
>> VecScatterEnd      73073 1.0 7.8382e+00 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  3  0  0  0  0   3  0  0  0  0     0
>> VecNormalize       73073 1.0 1.1774e+01 1.6 1.08e+09 1.2 0.0e+00 0.0e+00
>> 7.3e+04  5  2  0  0 50   5  2  0  0 50 18619
>> MatMult            73073 1.0 8.6056e+01 1.7 1.90e+10 1.3 2.8e+08 6.8e+02
>> 0.0e+00 36 33100100  0  36 33100100  0 44093
>> MatSolve           74074 1.0 5.4865e+01 1.2 1.71e+10 1.2 0.0e+00 0.0e+00
>> 0.0e+00 27 30  0  0  0  27 30  0  0  0 63153
>> MatLUFactorNum         1 1.0 4.1230e-03 2.6 9.89e+05241.4 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 36155
>> MatILUFactorSym        1 1.0 2.1942e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyBegin       2 1.0 5.6112e-03 4.8 0.00e+00 0.0 0.0e+00 0.0e+00
>> 4.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyEnd         2 1.0 6.3889e-03 1.0 0.00e+00 0.0 7.8e+03 1.7e+02
>> 8.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatGetRowIJ            1 1.0 2.8849e-0515.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatGetOrdering         1 1.0 1.2279e-04 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> PCSetUp                2 1.0 6.6662e-03 1.8 9.89e+05241.4 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 22361
>> PCSetUpOnBlocks     1001 1.0 7.5164e-03 1.7 9.89e+05241.4 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 19832
>> PCApply            74074 1.0 5.9613e+01 1.2 1.71e+10 1.2 0.0e+00 0.0e+00
>> 0.0e+00 29 30  0  0  0  29 30  0  0  0 58123
>>
>> ------------------------------------------------------------------------------------------------------------------------
>>
>> Memory usage is given in bytes:
>>
>> Object Type          Creations   Destructions     Memory  Descendants'
>> Mem.
>> Reports information only for process 0.
>>
>> --- Event Stage 0: Main Stage
>>
>>        Krylov Solver     2              2        19576     0.
>>      DMKSP interface     1              1          656     0.
>>               Vector  1043           1043     42492328     0.
>>       Vector Scatter     2              2        41496     0.
>>               Matrix     4              4      3163588     0.
>>     Distributed Mesh     1              1         5080     0.
>> Star Forest Bipartite Graph     2              2         1728     0.
>>      Discrete System     1              1          872     0.
>>            Index Set     7              7        71796     0.
>>    IS L to G Mapping     1              1        28068     0.
>>       Preconditioner     2              2         1912     0.
>>               Viewer     1              0            0     0.
>>
>> ========================================================================================================================
>> Average time to get PetscTime(): 1.90735e-07
>> Average time for MPI_Barrier(): 0.000184202
>> Average time for zero size MPI_Send(): 1.03469e-05
>> #PETSc Option Table entries:
>> -log_summary
>> #End of PETSc Option Table entries
>> Compiled without FORTRAN kernels
>> Compiled with full precision matrices (default)
>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>> Configure options: --download-f2cblaslapack --with-fc=0
>> --with-debugging=0 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
>>
>>
>> Regarding Matt's answer: It's generally a rectangular grid (3D) of
>> predetermined size (not necessarily a cube). Additionally, objects of
>> arbitrary shape can be defined by Dirichlet boundary conditions. Is
>> geometric MG still viable?
>>
>> Thanks,
>> Michael
>>
>>
>>
>> Am 03.06.2016 um 14:32 schrieb Matthew Knepley:
>>
>> On Fri, Jun 3, 2016 at 5:56 AM, Dave May < <dave.mayhem23 at gmail.com>
>> dave.mayhem23 at gmail.com> wrote:
>>
>>> On 3 June 2016 at 11:37, Michael Becker <
>>> <Michael.Becker at physik.uni-giessen.de>
>>> Michael.Becker at physik.uni-giessen.de> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I have a few questions regarding possible performance enhancements for
>>>> the PETSc solver I included in my project.
>>>>
>>>> It's a particle-in-cell plasma simulation written in C++, where
>>>> Poisson's equation needs to be solved repeatedly on every timestep.
>>>> The simulation domain is discretized using finite differences, so the
>>>> solver therefore needs to be able to efficiently solve the linear system A
>>>> x = b successively with changing b. The solution x of the previous timestep
>>>> is generally a good initial guess for the solution.
>>>>
>>>> I wrote a class PETScSolver that holds all PETSc objects and necessary
>>>> information about domain size and decomposition. To solve the linear
>>>> system, two arrays, 'phi' and 'charge', are passed to a member function
>>>> solve(), where they are copied to PETSc vectors, and KSPSolve() is called.
>>>> After convergence, the solution is then transferred again to the phi array
>>>> so that other program parts can use it.
>>>>
>>>> The matrix is created using DMDA. An array 'bound' is used to determine
>>>> whether a node is either a Dirichlet BC or holds a charge.
>>>>
>>>> I attached three files, petscsolver.h, petscsolver.cpp and main.cpp,
>>>> that contain a shortened version of the solver class and a set-up to
>>>> initialize and run a simple problem.
>>>>
>>>> Is there anything I can change to generally make the program run faster?
>>>>
>>>
>>> Before changing anything, you should profile your code to see where time
>>> is being spent.
>>>
>>> To that end, you should compile an optimized build of petsc, link it to
>>> you application and run your code with the option -log_summary. The
>>> -log_summary flag will generate a performance profile of specific
>>> functionality within petsc (KSPSolve, MatMult etc) so you can see where all
>>> the time is being spent.
>>>
>>> As a second round of profiling, you should consider registering specific
>>> functionality in your code you think is performance critical.
>>> You can do this using the function PetscLogStageRegister()
>>>
>>>
>>> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Profiling/PetscLogStageRegister.html
>>>
>>> Check out the examples listed at the bottom of this web page to see how
>>> to log stages. Once you've registered stages, these will appear in the
>>> report provided by -log_summary
>>>
>>
>> Do everything Dave said. I will also note that since you are using FD, I
>> am guessing you are solving on a square. Then
>> you should really be using geometric MG. We support this through the DMDA
>> object.
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> Thanks,
>>>   Dave
>>>
>>>
>>> And, since I'm rather unexperienced with KSP methods, how do I
>>>> efficiently choose PC and KSP? Just by testing every combination?
>>>> Would multigrid be a viable option as a pure solver (-ksp_type preonly)?
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160607/9b85943f/attachment-0001.html>