<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><br>
Nonetheless, algebraic multigrid seems to work really well compared
to other preconditioners. Can I expect any further runtime
improvements by using additional options like -pc_mg_smoothdown
<1> or are the standard values generally sufficient?<br></div></blockquote><div><br></div><div>Getting almost one digit reduction in residual per iteration is about as good as it gets. If you are (far) away from that then there is some discussion in the document on how to try to improve GAMG. If GAMG is spending a lot more time in the matrix setup phase (RAP or 'MatPtAP..' in log view) than the 'KSPSolve' then you may be coarsening too slowly and you can use the threshold parameters and graph squaring to increase coarsening rate and thereby decrease setup cost and cost per iteration, and reducing overall solve time.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
<br>
Thanks,<br>
Michael<div><div class="h5"><br>
<br>
<br>
<div>Am 03.06.2016 um 16:56 schrieb Dave
May:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On 3 June 2016 at 15:34, Michael
Becker <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-giessen.de</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> So using
-log_summary helps me find out how much time is actually
spent on the PETSc routines that are repeatedly called.
Since that part of my code is fairly simple:<br>
<br>
PetscScalar *barray;<br>
VecGetArray(b,&barray);<br>
for (int i=0; i<Nall; i++) {<br>
if (bound[i]==0)<br>
barray[i] = charge[i]*ih*iepsilon0;<br>
else<br>
barray[i] = phi[i];<br>
}<br>
VecRestoreArray(b,&barray);<br>
<br>
KSPSolve(ksp,b,x);<br>
<br>
KSPGetSolution(ksp,&x);<br>
PetscScalar *xarray;<br>
VecGetArray(x,&xarray);<br>
for (int i=0; i<Nall; i++)<br>
phi[i] = xarray[i];<br>
VecRestoreArray(x,&xarray);<br>
<br>
</div>
</blockquote>
<div><br>
</div>
<div>Well - you also have user code which assembles a
matrix...<br>
</div>
<div><br>
However it seems assembly is not taking much time.<br>
</div>
<div>Note that this is not always the case, for instance if
the preallocation was done incorrectly.<br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> , I don't see how
additional log states would help me. </div>
</blockquote>
<div><br>
</div>
<div>The point of profiling is to precisely identify calls
which consume the most time, <br>
rather than just _assuming_ which functions consume the
largest fraction of time.<br>
<br>
</div>
<div>Now we are all on the same page and can correct state
that the solve is the problem.<br>
</div>
<div><br>
Without knowing anything other than you are solving
Poisson, the simplest preconditioner to try out which <br>
can yield scalable and optimal results is algebraic
multigrid.<br>
Try the option:<br>
-pc_type gamg<br>
</div>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">So I would then
still just test which KSP method is the fastest?<br>
<br>
I ran a test over 1000 iterations; this is the output:<br>
<blockquote type="cite">
Max Max/Min Avg Total <br>
Time (sec): 1.916e+02 1.00055
1.915e+02<br>
Objects: 1.067e+03 1.00000
1.067e+03<br>
Flops: 5.730e+10 1.22776
5.360e+10 1.158e+13<br>
Flops/sec: 2.992e+08 1.22792
2.798e+08 6.044e+10<br>
MPI Messages: 1.900e+06 3.71429
1.313e+06 2.835e+08<br>
MPI Message Lengths: 1.138e+09 2.38189
6.824e+02 1.935e+11<br>
MPI Reductions: 1.462e+05 1.00000<br>
<br>
Flop counting convention: 1 flop = 1 real number
operation of type (multiply/divide/add/subtract)<br>
e.g., VecAXPY() for real
vectors of length N --> 2N flops<br>
and VecAXPY() for complex
vectors of length N --> 8N flops<br>
<br>
Summary of Stages: ----- Time ------ ----- Flops
----- --- Messages --- -- Message Lengths -- --
Reductions --<br>
Avg %Total Avg
%Total counts %Total Avg %Total
counts %Total <br>
0: Main Stage: 1.9154e+02 100.0% 1.1577e+13
100.0% 2.835e+08 100.0% 6.824e+02 100.0%
1.462e+05 100.0% <br>
<br>
------------------------------------------------------------------------------------------------------------------------<br>
See the 'Profiling' chapter of the users' manual for
details on interpreting output.<br>
Phase summary info:<br>
Count: number of times phase was executed<br>
Time and Flops: Max - maximum over all processors<br>
Ratio - ratio of maximum to minimum
over all processors<br>
Mess: number of messages sent<br>
Avg. len: average message length (bytes)<br>
Reduct: number of global reductions<br>
Global: entire computation<br>
Stage: stages of a computation. Set stages with
PetscLogStagePush() and PetscLogStagePop().<br>
%T - percent time in this phase %F -
percent flops in this phase<br>
%M - percent messages in this phase %L -
percent message lengths in this phase<br>
%R - percent reductions in this phase<br>
Total Mflop/s: 10e-6 * (sum of flops over all
processors)/(max time over all processors)<br>
------------------------------------------------------------------------------------------------------------------------<br>
Event Count Time (sec)
Flops --- Global --- ---
Stage --- Total<br>
Max Ratio Max Ratio Max
Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M
%L %R Mflop/s<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
--- Event Stage 0: Main Stage<br>
<br>
KSPGMRESOrthog 70070 1.0 7.8035e+01 2.3 1.94e+10
1.2 0.0e+00 0.0e+00 7.0e+04 29 34 0 0 48 29 34 0
0 48 50538<br>
KSPSetUp 2 1.0 1.5209e-03 1.1 0.00e+00
0.0 0.0e+00 0.0e+00 1.0e+01 0 0 0 0 0 0 0 0
0 0 0<br>
KSPSolve 1001 1.0 1.9097e+02 1.0 5.73e+10
1.2 2.8e+08 6.8e+02 1.5e+05100100100100100
100100100100100 60621<br>
VecMDot 70070 1.0 6.9833e+01 2.8 9.69e+09
1.2 0.0e+00 0.0e+00 7.0e+04 25 17 0 0 48 25 17 0
0 48 28235<br>
VecNorm 74074 1.0 1.1570e+01 1.7 7.28e+08
1.2 0.0e+00 0.0e+00 7.4e+04 5 1 0 0 51 5 1 0
0 51 12804<br>
VecScale 73073 1.0 5.6676e-01 1.3 3.59e+08
1.2 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0
0 0 128930<br>
VecCopy 3003 1.0 1.0008e-01 1.6 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 0<br>
VecSet 77080 1.0 1.3647e+00 1.4 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0
0 0 0<br>
VecAXPY 6006 1.0 1.0779e-01 1.7 5.90e+07
1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 111441<br>
VecMAXPY 73073 1.0 9.2155e+00 1.3 1.04e+10
1.2 0.0e+00 0.0e+00 0.0e+00 4 18 0 0 0 4 18 0
0 0 229192<br>
VecScatterBegin 73073 1.0 7.0538e+00 4.4 0.00e+00
0.0 2.8e+08 6.8e+02 0.0e+00 2 0100100 0 2
0100100 0 0<br>
VecScatterEnd 73073 1.0 7.8382e+00 2.6 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0
0 0 0<br>
VecNormalize 73073 1.0 1.1774e+01 1.6 1.08e+09
1.2 0.0e+00 0.0e+00 7.3e+04 5 2 0 0 50 5 2 0
0 50 18619<br>
MatMult 73073 1.0 8.6056e+01 1.7 1.90e+10
1.3 2.8e+08 6.8e+02 0.0e+00 36 33100100 0 36
33100100 0 44093<br>
MatSolve 74074 1.0 5.4865e+01 1.2 1.71e+10
1.2 0.0e+00 0.0e+00 0.0e+00 27 30 0 0 0 27 30 0
0 0 63153<br>
MatLUFactorNum 1 1.0 4.1230e-03 2.6
9.89e+05241.4 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0
0 0 0 0 0 36155<br>
MatILUFactorSym 1 1.0 2.1942e-03 1.3 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 0<br>
MatAssemblyBegin 2 1.0 5.6112e-03 4.8 0.00e+00
0.0 0.0e+00 0.0e+00 4.0e+00 0 0 0 0 0 0 0 0
0 0 0<br>
MatAssemblyEnd 2 1.0 6.3889e-03 1.0 0.00e+00
0.0 7.8e+03 1.7e+02 8.0e+00 0 0 0 0 0 0 0 0
0 0 0<br>
MatGetRowIJ 1 1.0 2.8849e-0515.1 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 0<br>
MatGetOrdering 1 1.0 1.2279e-04 1.6 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0
0 0 0<br>
PCSetUp 2 1.0 6.6662e-03 1.8
9.89e+05241.4 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0
0 0 0 0 0 22361<br>
PCSetUpOnBlocks 1001 1.0 7.5164e-03 1.7
9.89e+05241.4 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0
0 0 0 0 0 19832<br>
PCApply 74074 1.0 5.9613e+01 1.2 1.71e+10
1.2 0.0e+00 0.0e+00 0.0e+00 29 30 0 0 0 29 30 0
0 0 58123<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
Memory usage is given in bytes:<br>
<br>
Object Type Creations Destructions
Memory Descendants' Mem.<br>
Reports information only for process 0.<br>
<br>
--- Event Stage 0: Main Stage<br>
<br>
Krylov Solver 2 2
19576 0.<br>
DMKSP interface 1 1
656 0.<br>
Vector 1043 1043
42492328 0.<br>
Vector Scatter 2 2
41496 0.<br>
Matrix 4 4
3163588 0.<br>
Distributed Mesh 1 1
5080 0.<br>
Star Forest Bipartite Graph 2
2 1728 0.<br>
Discrete System 1 1
872 0.<br>
Index Set 7 7
71796 0.<br>
IS L to G Mapping 1 1
28068 0.<br>
Preconditioner 2 2
1912 0.<br>
Viewer 1 0
0 0.<br>
========================================================================================================================<br>
Average time to get PetscTime(): 1.90735e-07<br>
Average time for MPI_Barrier(): 0.000184202<br>
Average time for zero size MPI_Send(): 1.03469e-05<br>
#PETSc Option Table entries:<br>
-log_summary<br>
#End of PETSc Option Table entries<br>
Compiled without FORTRAN kernels<br>
Compiled with full precision matrices (default)<br>
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8
sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt)
4<br>
Configure options: --download-f2cblaslapack
--with-fc=0 --with-debugging=0 COPTFLAGS=-O3
CXXOPTFLAGS=-O3</blockquote>
<br>
Regarding Matt's answer: It's generally a rectangular
grid (3D) of predetermined size (not necessarily a
cube). Additionally, objects of arbitrary shape can be
defined by Dirichlet boundary conditions. Is geometric
MG still viable?<br>
<br>
Thanks,<br>
Michael
<div>
<div><br>
<br>
<br>
<div>Am 03.06.2016 um 14:32 schrieb Matthew Knepley:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">On Fri, Jun 3, 2016
at 5:56 AM, Dave May <span dir="ltr"><<a href="mailto:dave.mayhem23@gmail.com" target="_blank"></a><a href="mailto:dave.mayhem23@gmail.com" target="_blank">dave.mayhem23@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote"><span>On 3
June 2016 at 11:37, Michael Becker
<span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank"></a><a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-giessen.de</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear
all,<br>
<br>
I have a few questions regarding
possible performance
enhancements for the PETSc
solver I included in my project.<br>
<br>
It's a particle-in-cell plasma
simulation written in C++, where
Poisson's equation needs to be
solved repeatedly on every
timestep.<br>
The simulation domain is
discretized using finite
differences, so the solver
therefore needs to be able to
efficiently solve the linear
system A x = b successively with
changing b. The solution x of
the previous timestep is
generally a good initial guess
for the solution.<br>
<br>
I wrote a class PETScSolver that
holds all PETSc objects and
necessary information about
domain size and decomposition.
To solve the linear system, two
arrays, 'phi' and 'charge', are
passed to a member function
solve(), where they are copied
to PETSc vectors, and KSPSolve()
is called. After convergence,
the solution is then transferred
again to the phi array so that
other program parts can use it.<br>
<br>
The matrix is created using
DMDA. An array 'bound' is used
to determine whether a node is
either a Dirichlet BC or holds a
charge.<br>
<br>
I attached three files,
petscsolver.h, petscsolver.cpp
and main.cpp, that contain a
shortened version of the solver
class and a set-up to initialize
and run a simple problem.<br>
<br>
Is there anything I can change
to generally make the program
run faster?<br>
</blockquote>
<div><br>
</div>
</span>
<div>Before changing anything, you
should profile your code to see
where time is being spent.<br>
<br>
To that end, you should compile an
optimized build of petsc, link it
to you application and run your
code with the option -log_summary.
The -log_summary flag will
generate a performance profile of
specific functionality within
petsc (KSPSolve, MatMult etc) so
you can see where all the time is
being spent.<br>
<br>
</div>
<div>As a second round of profiling,
you should consider registering
specific functionality in your
code you think is performance
critical. <br>
You can do this using the function
PetscLogStageRegister()<br>
<br>
<a href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Profiling/PetscLogStageRegister.html" target="_blank">http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Profiling/PetscLogStageRegister.html</a>
<br>
</div>
<div><br>
</div>
<div>Check out the examples listed
at the bottom of this web page to
see how to log stages. Once you've
registered stages, these will
appear in the report provided by
-log_summary</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>Do everything Dave said. I will also
note that since you are using FD, I am
guessing you are solving on a square. Then</div>
<div>you should really be using geometric
MG. We support this through the DMDA
object.</div>
<div><br>
</div>
<div> Thanks,</div>
<div><br>
</div>
<div> Matt</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>Thanks,<br>
</div>
<div> Dave<br>
</div>
<span>
<div> <br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
And, since I'm rather
unexperienced with KSP methods,
how do I efficiently choose PC
and KSP? Just by testing every
combination?<br>
Would multigrid be a viable
option as a pure solver
(-ksp_type preonly)?<br>
<br>
Thanks,<br>
Michael<br>
</blockquote>
</span></div>
<br>
</div>
</div>
</blockquote>
</div>
<br>
<br clear="all">
<div><br>
</div>
-- <br>
<div data-smartmail="gmail_signature">What
most experimenters take for granted before
they begin their experiments is infinitely
more interesting than any results to which
their experiments lead.<br>
-- Norbert Wiener</div>
</div>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</div></div></div>
</blockquote></div><br></div></div>