<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On 3 June 2016 at 15:34, Michael Becker <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-giessen.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
So using -log_summary helps me find out how much time is actually
spent on the PETSc routines that are repeatedly called. Since that
part of my code is fairly simple:<br>
<br>
PetscScalar *barray;<br>
VecGetArray(b,&barray);<br>
for (int i=0; i<Nall; i++) {<br>
if (bound[i]==0)<br>
barray[i] = charge[i]*ih*iepsilon0;<br>
else<br>
barray[i] = phi[i];<br>
}<br>
VecRestoreArray(b,&barray);<br>
<br>
KSPSolve(ksp,b,x);<br>
<br>
KSPGetSolution(ksp,&x);<br>
PetscScalar *xarray;<br>
VecGetArray(x,&xarray);<br>
for (int i=0; i<Nall; i++)<br>
phi[i] = xarray[i];<br>
VecRestoreArray(x,&xarray);<br>
<br></div></blockquote><div><br></div><div>Well - you also have user code which assembles a matrix...<br></div><div><br>However it seems assembly is not taking much time.<br></div><div>Note that this is not always the case, for instance if the preallocation was done incorrectly.<br></div><div><br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
, I don't see how additional log states would help me. </div></blockquote><div><br></div><div>The point of profiling is to precisely identify calls which consume the most time, <br>rather than just _assuming_ which functions consume the largest fraction of time.<br><br></div><div>Now we are all on the same page and can correct state that the solve is the problem.<br></div><div><br>Without knowing anything other than you are solving Poisson, the simplest preconditioner to try out which <br>can yield scalable and optimal results is algebraic multigrid.<br>Try the option:<br> -pc_type gamg<br></div><br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">So I would
then still just test which KSP method is the fastest?<br>
<br>
I ran a test over 1000 iterations; this is the output:<br>
<blockquote type="cite"> Max
Max/Min Avg Total <br>
Time (sec): 1.916e+02 1.00055 1.915e+02<br>
Objects: 1.067e+03 1.00000 1.067e+03<br>
Flops: 5.730e+10 1.22776 5.360e+10
1.158e+13<br>
Flops/sec: 2.992e+08 1.22792 2.798e+08
6.044e+10<br>
MPI Messages: 1.900e+06 3.71429 1.313e+06
2.835e+08<br>
MPI Message Lengths: 1.138e+09 2.38189 6.824e+02
1.935e+11<br>
MPI Reductions: 1.462e+05 1.00000<br>
<br>
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)<br>
e.g., VecAXPY() for real vectors of
length N --> 2N flops<br>
and VecAXPY() for complex vectors of
length N --> 8N flops<br>
<br>
Summary of Stages: ----- Time ------ ----- Flops ----- ---
Messages --- -- Message Lengths -- -- Reductions --<br>
Avg %Total Avg %Total
counts %Total Avg %Total counts %Total <br>
0: Main Stage: 1.9154e+02 100.0% 1.1577e+13 100.0%
2.835e+08 100.0% 6.824e+02 100.0% 1.462e+05 100.0% <br>
<br>
------------------------------------------------------------------------------------------------------------------------<br>
See the 'Profiling' chapter of the users' manual for details on
interpreting output.<br>
Phase summary info:<br>
Count: number of times phase was executed<br>
Time and Flops: Max - maximum over all processors<br>
Ratio - ratio of maximum to minimum over all
processors<br>
Mess: number of messages sent<br>
Avg. len: average message length (bytes)<br>
Reduct: number of global reductions<br>
Global: entire computation<br>
Stage: stages of a computation. Set stages with
PetscLogStagePush() and PetscLogStagePop().<br>
%T - percent time in this phase %F - percent flops
in this phase<br>
%M - percent messages in this phase %L - percent message
lengths in this phase<br>
%R - percent reductions in this phase<br>
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max
time over all processors)<br>
------------------------------------------------------------------------------------------------------------------------<br>
Event Count Time (sec)
Flops --- Global --- --- Stage ---
Total<br>
Max Ratio Max Ratio Max Ratio Mess
Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
--- Event Stage 0: Main Stage<br>
<br>
KSPGMRESOrthog 70070 1.0 7.8035e+01 2.3 1.94e+10 1.2 0.0e+00
0.0e+00 7.0e+04 29 34 0 0 48 29 34 0 0 48 50538<br>
KSPSetUp 2 1.0 1.5209e-03 1.1 0.00e+00 0.0 0.0e+00
0.0e+00 1.0e+01 0 0 0 0 0 0 0 0 0 0 0<br>
KSPSolve 1001 1.0 1.9097e+02 1.0 5.73e+10 1.2 2.8e+08
6.8e+02 1.5e+05100100100100100 100100100100100 60621<br>
VecMDot 70070 1.0 6.9833e+01 2.8 9.69e+09 1.2 0.0e+00
0.0e+00 7.0e+04 25 17 0 0 48 25 17 0 0 48 28235<br>
VecNorm 74074 1.0 1.1570e+01 1.7 7.28e+08 1.2 0.0e+00
0.0e+00 7.4e+04 5 1 0 0 51 5 1 0 0 51 12804<br>
VecScale 73073 1.0 5.6676e-01 1.3 3.59e+08 1.2 0.0e+00
0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 128930<br>
VecCopy 3003 1.0 1.0008e-01 1.6 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
VecSet 77080 1.0 1.3647e+00 1.4 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0<br>
VecAXPY 6006 1.0 1.0779e-01 1.7 5.90e+07 1.2 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 111441<br>
VecMAXPY 73073 1.0 9.2155e+00 1.3 1.04e+10 1.2 0.0e+00
0.0e+00 0.0e+00 4 18 0 0 0 4 18 0 0 0 229192<br>
VecScatterBegin 73073 1.0 7.0538e+00 4.4 0.00e+00 0.0 2.8e+08
6.8e+02 0.0e+00 2 0100100 0 2 0100100 0 0<br>
VecScatterEnd 73073 1.0 7.8382e+00 2.6 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0<br>
VecNormalize 73073 1.0 1.1774e+01 1.6 1.08e+09 1.2 0.0e+00
0.0e+00 7.3e+04 5 2 0 0 50 5 2 0 0 50 18619<br>
MatMult 73073 1.0 8.6056e+01 1.7 1.90e+10 1.3 2.8e+08
6.8e+02 0.0e+00 36 33100100 0 36 33100100 0 44093<br>
MatSolve 74074 1.0 5.4865e+01 1.2 1.71e+10 1.2 0.0e+00
0.0e+00 0.0e+00 27 30 0 0 0 27 30 0 0 0 63153<br>
MatLUFactorNum 1 1.0 4.1230e-03 2.6 9.89e+05241.4 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 36155<br>
MatILUFactorSym 1 1.0 2.1942e-03 1.3 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
MatAssemblyBegin 2 1.0 5.6112e-03 4.8 0.00e+00 0.0 0.0e+00
0.0e+00 4.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
MatAssemblyEnd 2 1.0 6.3889e-03 1.0 0.00e+00 0.0 7.8e+03
1.7e+02 8.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
MatGetRowIJ 1 1.0 2.8849e-0515.1 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
MatGetOrdering 1 1.0 1.2279e-04 1.6 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0<br>
PCSetUp 2 1.0 6.6662e-03 1.8 9.89e+05241.4 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 22361<br>
PCSetUpOnBlocks 1001 1.0 7.5164e-03 1.7 9.89e+05241.4 0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 19832<br>
PCApply 74074 1.0 5.9613e+01 1.2 1.71e+10 1.2 0.0e+00
0.0e+00 0.0e+00 29 30 0 0 0 29 30 0 0 0 58123<br>
------------------------------------------------------------------------------------------------------------------------<br>
<br>
Memory usage is given in bytes:<br>
<br>
Object Type Creations Destructions Memory
Descendants' Mem.<br>
Reports information only for process 0.<br>
<br>
--- Event Stage 0: Main Stage<br>
<br>
Krylov Solver 2 2 19576 0.<br>
DMKSP interface 1 1 656 0.<br>
Vector 1043 1043 42492328 0.<br>
Vector Scatter 2 2 41496 0.<br>
Matrix 4 4 3163588 0.<br>
Distributed Mesh 1 1 5080 0.<br>
Star Forest Bipartite Graph 2 2 1728
0.<br>
Discrete System 1 1 872 0.<br>
Index Set 7 7 71796 0.<br>
IS L to G Mapping 1 1 28068 0.<br>
Preconditioner 2 2 1912 0.<br>
Viewer 1 0 0 0.<br>
========================================================================================================================<br>
Average time to get PetscTime(): 1.90735e-07<br>
Average time for MPI_Barrier(): 0.000184202<br>
Average time for zero size MPI_Send(): 1.03469e-05<br>
#PETSc Option Table entries:<br>
-log_summary<br>
#End of PETSc Option Table entries<br>
Compiled without FORTRAN kernels<br>
Compiled with full precision matrices (default)<br>
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4<br>
Configure options: --download-f2cblaslapack --with-fc=0
--with-debugging=0 COPTFLAGS=-O3 CXXOPTFLAGS=-O3</blockquote>
<br>
Regarding Matt's answer: It's generally a rectangular grid (3D) of
predetermined size (not necessarily a cube). Additionally, objects
of arbitrary shape can be defined by Dirichlet boundary conditions.
Is geometric MG still viable?<br>
<br>
Thanks,<br>
Michael<div><div><br>
<br>
<br>
<div>Am 03.06.2016 um 14:32 schrieb Matthew
Knepley:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">On Fri, Jun 3, 2016 at 5:56 AM, Dave
May <span dir="ltr"><<a href="mailto:dave.mayhem23@gmail.com" target="_blank">dave.mayhem23@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote"><span>On 3 June 2016
at 11:37, Michael Becker <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank"></a><a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-giessen.de</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear all,<br>
<br>
I have a few questions regarding possible
performance enhancements for the PETSc solver I
included in my project.<br>
<br>
It's a particle-in-cell plasma simulation
written in C++, where Poisson's equation needs
to be solved repeatedly on every timestep.<br>
The simulation domain is discretized using
finite differences, so the solver therefore
needs to be able to efficiently solve the linear
system A x = b successively with changing b. The
solution x of the previous timestep is generally
a good initial guess for the solution.<br>
<br>
I wrote a class PETScSolver that holds all PETSc
objects and necessary information about domain
size and decomposition. To solve the linear
system, two arrays, 'phi' and 'charge', are
passed to a member function solve(), where they
are copied to PETSc vectors, and KSPSolve() is
called. After convergence, the solution is then
transferred again to the phi array so that other
program parts can use it.<br>
<br>
The matrix is created using DMDA. An array
'bound' is used to determine whether a node is
either a Dirichlet BC or holds a charge.<br>
<br>
I attached three files, petscsolver.h,
petscsolver.cpp and main.cpp, that contain a
shortened version of the solver class and a
set-up to initialize and run a simple problem.<br>
<br>
Is there anything I can change to generally make
the program run faster?<br>
</blockquote>
<div><br>
</div>
</span>
<div>Before changing anything, you should profile
your code to see where time is being spent.<br>
<br>
To that end, you should compile an optimized build
of petsc, link it to you application and run your
code with the option -log_summary. The
-log_summary flag will generate a performance
profile of specific functionality within petsc
(KSPSolve, MatMult etc) so you can see where all
the time is being spent.<br>
<br>
</div>
<div>As a second round of profiling, you should
consider registering specific functionality in
your code you think is performance critical. <br>
You can do this using the function
PetscLogStageRegister()<br>
<br>
<a href="http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Profiling/PetscLogStageRegister.html" target="_blank">http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Profiling/PetscLogStageRegister.html</a>
<br>
</div>
<div><br>
</div>
<div>Check out the examples listed at the bottom of
this web page to see how to log stages. Once
you've registered stages, these will appear in the
report provided by -log_summary</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>Do everything Dave said. I will also note that since
you are using FD, I am guessing you are solving on a
square. Then</div>
<div>you should really be using geometric MG. We support
this through the DMDA object.</div>
<div><br>
</div>
<div> Thanks,</div>
<div><br>
</div>
<div> Matt</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>Thanks,<br>
</div>
<div> Dave<br>
</div>
<span>
<div> <br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
And, since I'm rather unexperienced with KSP
methods, how do I efficiently choose PC and KSP?
Just by testing every combination?<br>
Would multigrid be a viable option as a pure
solver (-ksp_type preonly)?<br>
<br>
Thanks,<br>
Michael<br>
</blockquote>
</span></div>
<br>
</div>
</div>
</blockquote>
</div>
<br>
<br clear="all">
<div><br>
</div>
-- <br>
<div data-smartmail="gmail_signature">What
most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results
to which their experiments lead.<br>
-- Norbert Wiener</div>
</div>
</div>
</blockquote>
<br>
</div></div></div>
</blockquote></div><br></div></div>