<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hello again,</p>
this took me longer than I anticipated, but here we go.<br>
I did reruns of the cases where only half the processes per node
were used (without -log_sync):<br>
<br>
<font face="Courier New, Courier, monospace">
125 procs,1st 125 procs,2nd 1000
procs,1st 1000 procs,2nd<br>
Max Ratio </font><font
face="Courier New, Courier, monospace"><font face="Courier New,
Courier, monospace">Max Ratio</font></font><font
face="Courier New, Courier, monospace"><font face="Courier New,
Courier, monospace"><font face="Courier New, Courier, monospace">
</font><font face="Courier New, Courier, monospace"><font
face="Courier New, Courier, monospace">Max Ratio</font></font></font></font><font
face="Courier New, Courier, monospace"><font face="Courier New,
Courier, monospace"><font face="Courier New, Courier, monospace"><font
face="Courier New, Courier, monospace"><font face="Courier
New, Courier, monospace"> </font><font
face="Courier New, Courier, monospace"><font face="Courier
New, Courier, monospace">Max Ratio<br>
</font></font></font></font></font>KSPSolve
1.203E+02 1.0 1.210E+02 1.0 1.399E+02
1.1 1.365E+02 1.0<br>
VecTDot 6.376E+00 3.7 6.551E+00 4.0
7.885E+00 2.9 7.175E+00 3.4<br>
VecNorm 4.579E+00 7.1 5.803E+00 10.2
8.534E+00 6.9 6.026E+00 4.9<br>
VecScale 1.070E-01 2.1 1.129E-01 2.2
1.301E-01 2.5 1.270E-01 2.4<br>
VecCopy 1.123E-01 1.3 1.149E-01 1.3
1.301E-01 1.6 1.359E-01 1.6<br>
VecSet 7.063E-01 1.7 6.968E-01 1.7
7.432E-01 1.8 7.425E-01 1.8<br>
VecAXPY 1.166E+00 1.4 1.167E+00 1.4
1.221E+00 1.5 1.279E+00 1.6<br>
VecAYPX 1.317E+00 1.6 1.290E+00 1.6
1.536E+00 1.9 1.499E+00 2.0<br>
VecScatterBegin 6.142E+00 3.2 5.974E+00 2.8
6.448E+00 3.0 6.472E+00 2.9<br>
VecScatterEnd 3.606E+01 4.2 3.551E+01 4.0
5.244E+01 2.7 4.995E+01 2.7<br>
MatMult 3.561E+01 1.6 3.403E+01 1.5
3.435E+01 1.4 3.332E+01 1.4<br>
MatMultAdd 1.124E+01 2.0 1.130E+01 2.1
2.093E+01 2.9 1.995E+01 2.7<br>
MatMultTranspose 1.372E+01 2.5 1.388E+01 2.6
1.477E+01 2.2 1.381E+01 2.1<br>
MatSolve 1.949E-02 0.0 1.653E-02 0.0
4.789E-02 0.0 4.466E-02 0.0<br>
MatSOR 6.610E+01 1.3 6.673E+01 1.3
7.111E+01 1.3 7.105E+01 1.3<br>
MatResidual 2.647E+01 1.7 2.667E+01 1.7
2.446E+01 1.4 2.467E+01 1.5<br>
PCSetUpOnBlocks 5.266E-03 1.4 5.295E-03 1.4
5.427E-03 1.5 5.289E-03 1.4<br>
PCApply 1.031E+02 1.0 1.035E+02 1.0
1.180E+02 1.0 1.164E+02 1.0<br>
<br>
</font>I also slimmed down my code and basically wrote a simple weak
scaling test (source files attached) so you can profile it yourself.
I appreciate the offer Junchao, thank you.<br>
You can adjust the system size per processor at runtime via
"-nodes_per_proc 30" and the number of repeated calls to the
function containing KSPsolve() via "-iterations 1000". The physical
problem is simply calculating the electric potential from a
homogeneous charge distribution, done multiple times to accumulate
time in KSPsolve().<br>
A job would be started using something like<br>
<blockquote>mpirun -n 125 ~/petsc_ws/ws_test -nodes_per_proc 30
-mesh_size 1E-4 -iterations 1000 \\<br>
-ksp_rtol 1E-6 \<br>
-log_view -log_sync\<br>
-pc_type gamg -pc_gamg_type classical\<br>
-ksp_type cg \<br>
-ksp_norm_type unpreconditioned \<br>
-mg_levels_ksp_type richardson \<br>
-mg_levels_ksp_norm_type none \<br>
-mg_levels_pc_type sor \<br>
-mg_levels_ksp_max_it 1 \<br>
-mg_levels_pc_sor_its 1 \<br>
-mg_levels_esteig_ksp_type cg \<br>
-mg_levels_esteig_ksp_max_it 10 \<br>
-gamg_est_ksp_type cg<br>
</blockquote>
, ideally started on a cube number of processes for a cubical
process grid.<br>
Using 125 processes and 10.000 iterations I get the output in
"log_view_125_new.txt", which shows the same imbalance for me.<br>
<p>Michael<br>
</p>
<br>
<br>
<div class="moz-cite-prefix">Am 02.06.2018 um 13:40 schrieb Mark
Adams:<br>
</div>
<blockquote type="cite"
cite="mid:CADOhEh7VQSq5j1N8Om7XPFQvrmdyGxgDvRS6AU_KUbSNB-3gPA@mail.gmail.com">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Fri, Jun 1, 2018 at 11:20 PM,
Junchao Zhang <span dir="ltr"><<a
href="mailto:jczhang@mcs.anl.gov" target="_blank"
moz-do-not-send="true">jczhang@mcs.anl.gov</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">Hi,Michael,
<div> You can add -log_sync besides -log_view, which
adds barriers to certain events but measures barrier
time separately from the events. I find this option
makes it easier to interpret log_view output.</div>
</div>
</blockquote>
<div><br>
</div>
<div>That is great (good to know).</div>
<div><br>
</div>
<div>This should give us a better idea if your large
VecScatter costs are from slow communication or if it
catching some sort of load imbalance.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="gmail_extra"><span class="HOEnZb"><font
color="#888888"><br clear="all">
<div>
<div class="m_6325123414585814924gmail_signature"
data-smartmail="gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
</font></span>
<div>
<div class="h5">
<br>
<div class="gmail_quote">On Wed, May 30, 2018 at
3:27 AM, Michael Becker <span dir="ltr"><<a
href="mailto:Michael.Becker@physik.uni-giessen.de"
target="_blank" moz-do-not-send="true">Michael.Becker@physik.uni-<wbr>giessen.de</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0
0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div bgcolor="#FFFFFF"> Barry: On its way. Could
take a couple days again.<br>
<br>
Junchao: I unfortunately don't have access to
a cluster with a faster network. This one has
a mixed 4X QDR-FDR InfiniBand 2:1 blocking
fat-tree network, which I realize causes
parallel slowdown if the nodes are not
connected to the same switch. Each node has 24
processors (2x12/socket) and four NUMA domains
(two for each socket).<br>
The ranks are usually not distributed
perfectly even, i.e. for 125 processes, of the
six required nodes, five would use 21 cores
and one 20.<br>
Would using another CPU type make a difference
communication-wise? I could switch to faster
ones (on the same network), but I always
assumed this would only improve performance of
the stuff that is unrelated to communication.<span
class="m_6325123414585814924gmail-HOEnZb"><font
color="#888888"><br>
<br>
Michael</font></span>
<div>
<div class="m_6325123414585814924gmail-h5"><br>
<br>
<br>
<blockquote type="cite">
<div
class="m_6325123414585814924gmail-m_-6224261030758995442moz-text-html"
lang="x-unicode">
<div dir="ltr">
<div>The log files have something
like "Average time for zero size
MPI_Send(): 1.84231e-05". It looks
you ran on a cluster with a very
slow network. A typical machine
should give less than 1/10 of the
latency you have. An easy way to
try is just running the code on a
machine with a faster network and
see what happens.<br>
</div>
<br>
<div>Also, how many cores & numa
domains does a compute node have?
I could not figure out how you
distributed the 125 MPI ranks
evenly.</div>
</div>
<div class="gmail_extra"><br
clear="all">
<div>
<div
class="m_6325123414585814924gmail-m_-6224261030758995442gmail_signature">
<div dir="ltr">--Junchao Zhang</div>
</div>
</div>
<br>
<div class="gmail_quote">On Tue, May
29, 2018 at 6:18 AM, Michael
Becker <span dir="ltr"><<a
href="mailto:Michael.Becker@physik.uni-giessen.de"
target="_blank"
moz-do-not-send="true">Michael.Becker@physik.uni-gie<wbr>ssen.de</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>Hello again,</p>
<p>here are the updated
log_view files for 125 and
1000 processors. I ran both
problems twice, the first
time with all processors per
node allocated ("-1.txt"),
the second with only half on
twice the number of nodes
("-2.txt").<br>
</p>
<span> <br>
<blockquote type="cite">
<blockquote type="cite">
<pre>On May 24, 2018, at 12:24 AM, Michael Becker <a class="m_6325123414585814924gmail-m_-6224261030758995442m_541343124460185301moz-txt-link-rfc2396E" href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank" moz-do-not-send="true"><Michael.Becker@physik.uni-gie<wbr>ssen.de></a> wrote:
I noticed that for every individual KSP iteration, six vector objects are created and destroyed (with CG, more with e.g. GMRES).
</pre>
</blockquote>
<pre> Hmm, it is certainly not intended at vectors be created and destroyed within each KSPSolve() could you please point us to the code that makes you think they are being created and destroyed? We create all the work vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. Not that this would be a measurable distance.
</pre>
</blockquote>
<br>
</span> I mean this, right in
the log_view output:<br>
<br>
<blockquote type="cite"><font
size="-1">Memory usage is
given in bytes:<br>
</font> <font size="-1"><br>
Object Type Creations
Destructions Memory
Descendants' Mem.<br>
Reports information only
for process 0.<br>
</font> <font size="-1"><br>
--- Event Stage 0: Main
Stage<br>
</font> <font size="-1"><br>
...<br>
</font> <font size="-1"><br>
--- Event Stage 1: First
Solve<br>
</font> <font size="-1"><br>
...<br>
</font> <font size="-1"><br>
--- Event Stage 2:
Remaining Solves<br>
</font> <font size="-1"><br>
Vector 23904 23904
1295501184 0. </font></blockquote>
I logged the exact number of
KSP iterations over the 999
timesteps and its exactly
23904/6 = 3984. <span>
<p>Michael<br>
</p>
<p><br>
</p>
<br>
<div
class="m_6325123414585814924gmail-m_-6224261030758995442m_541343124460185301moz-cite-prefix">Am
24.05.2018 um 19:50
schrieb Smith, Barry F.:<br>
</div>
</span>
<div>
<div
class="m_6325123414585814924gmail-m_-6224261030758995442h5">
<blockquote type="cite">
<pre> Please send the log file for 1000 with cg as the solver.
You should make a bar chart of each event for the two cases to see which ones are taking more time and which are taking less (we cannot tell with the two logs you sent us since they are for different solvers.)
</pre>
<blockquote type="cite">
<pre>On May 24, 2018, at 12:24 AM, Michael Becker <a class="m_6325123414585814924gmail-m_-6224261030758995442m_541343124460185301moz-txt-link-rfc2396E" href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank" moz-do-not-send="true"><Michael.Becker@physik.uni-gie<wbr>ssen.de></a> wrote:
I noticed that for every individual KSP iteration, six vector objects are created and destroyed (with CG, more with e.g. GMRES).
</pre>
</blockquote>
<pre> Hmm, it is certainly not intended at vectors be created and destroyed within each KSPSolve() could you please point us to the code that makes you think they are being created and destroyed? We create all the work vectors at KSPSetUp() and destroy them in KSPReset() not during the solve. Not that this would be a measurable distance.
</pre>
<blockquote type="cite">
<pre>This seems kind of wasteful, is this supposed to be like this? Is this even the reason for my problems? Apart from that, everything seems quite normal to me (but I'm not the expert here).
Thanks in advance.
Michael
<log_view_125procs.txt><log_vi<wbr>ew_1000procs.txt>
</pre>
</blockquote>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
</body>
</html>