<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Yes, the time increment is the problem. Not because of these 8%
in particular, but it gets worse with more processes.</p>
<p>Performance does improve drastically on one processor (which I
previously never tested); I attached the log_view file. If
communication speed is the problem, then I assume fewer processors
per node would improve performance and I could investigate that
(principally). I assume there's no way to reduce the data volume.<br>
</p>
<p>But thanks either way, this helped a lot.</p>
<p>Michael<br>
</p>
<p><br>
</p>
<br>
<div class="moz-cite-prefix">Am 24.05.2018 um 15:22 schrieb Mark
Adams:<br>
</div>
<blockquote type="cite"
cite="mid:CADOhEh6b7Sh+rhzbSZEZLnGL0Sp9-JdFGbgBg8mwN29Em95SPw@mail.gmail.com">
<div dir="ltr">
<div>The KSPSolve time goes from 128 to 138 seconds in going
from 125 to 1000 processes. Is this the problem?<br>
</div>
<div><br>
</div>
<div>And as Lawrence pointed out there is a lot of "load"
imbalance. (This could come from a poor network). VecAXPY has
no communication and has significant imbalance. But you seem
to have perfect actual load imbalance but <span
style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">this
can come from cache effects....</span></div>
<div><br>
</div>
<div>And you are spending almost half the solve time in
VecScatter. If you really have this nice regular partitioning
of the problem, then your communication is slow, even on 125
processors. (So it is not a scaling issue here, but if you do
a one processor test you should see it).</div>
<div><br>
</div>
<div>Note, AMG coarse grids get bigger as the problem gets
bigger, so it is not perfectly scalable pre-asymptotically.
Nothing is really, because you don't saturate communication
until you have at least a 3^D process grid and various random
things will cause some non-perfect weak speedup.</div>
<div><br>
</div>
<div>Mark</div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, May 24, 2018 at 5:10 AM,
Michael Becker <span dir="ltr"><<a
href="mailto:Michael.Becker@physik.uni-giessen.de"
target="_blank">Michael.Becker@physik.uni-giessen.de</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>CG/GCR: I accidentally kept gcr in the batch file.
That's still from when I was experimenting with the
different methods. The performance is quite similar
though.</p>
<p>I use the following setup for the ksp object and the
vectors:</p>
<blockquote><font size="-1">ierr=PetscInitialize(&argc,
&argv, (char*)0, (char*)0);CHKERRQ(ierr);<br>
<br>
ierr=KSPCreate(PETSC_COMM_<wbr>WORLD,
&ksp);CHKERRQ(ierr);<br>
<br>
ierr=DMDACreate3d(PETSC_COMM_<wbr>WORLD,</font><font
size="-1"><font size="-1">DM_BOUNDARY_GHOSTED,DM_<wbr>BOUNDARY_GHOSTED,DM_BOUNDARY_<wbr>GHOSTED</font>,<br>
DMDA_STENCIL_STAR, g_Nx, g_Ny, g_Nz,
dims[0], dims[1], dims[2], 1, 1, l_Nx, l_Ny, l_Nz,
&da);CHKERRQ(ierr);<br>
ierr=DMSetFromOptions(da);<wbr>CHKERRQ(ierr);<br>
ierr=DMSetUp(da);CHKERRQ(ierr)<wbr>;<br>
ierr=KSPSetDM(ksp, da);CHKERRQ(ierr);<br>
<br>
ierr=DMCreateGlobalVector(da, &b);CHKERRQ(ierr);<br>
<br>
ierr=VecDuplicate(b, &x);CHKERRQ(ierr);<br>
<br>
ierr=DMCreateLocalVector(da, &l_x);CHKERRQ(ierr);<br>
ierr=VecSet(x,0);CHKERRQ(ierr)<wbr>;<br>
ierr=VecSet(b,0);CHKERRQ(ierr)<wbr>;</font><br>
</blockquote>
<p>For the 125 case the arrays l_Nx, l_Ny, l_Nz have
dimension 5 and every element has value 30.
VecGetLocalSize() returns 27000 for every rank. Is there
something I didn't consider?</p>
<span class="HOEnZb"><font color="#888888">
<p>Michael<br>
</p>
</font></span>
<div>
<div class="h5">
<p><br>
</p>
<br>
<div class="m_5673279642174115124moz-cite-prefix">Am
24.05.2018 um 09:39 schrieb Lawrence Mitchell:<br>
</div>
<blockquote type="cite">
<blockquote type="cite">
<pre>On 24 May 2018, at 06:24, Michael Becker <a class="m_5673279642174115124moz-txt-link-rfc2396E" href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank"><Michael.Becker@physik.uni-<wbr>giessen.de></a> wrote:
Could you have a look at the attached log_view files and tell me if something is particularly odd? The system size per processor is 30^3 and the simulation ran over 1000 timesteps, which means KSPsolve() was called equally often. I introduced two new logging states - one for the first solve and the final setup and one for the remaining solves.
</pre>
</blockquote>
<pre>The two attached logs use CG for the 125 proc run, but gcr for the 1000 proc run. Is this deliberate?
125 proc:
-gamg_est_ksp_type cg
-ksp_norm_type unpreconditioned
-ksp_type cg
-log_view
-mg_levels_esteig_ksp_max_it 10
-mg_levels_esteig_ksp_type cg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_norm_type none
-mg_levels_ksp_type richardson
-mg_levels_pc_sor_its 1
-mg_levels_pc_type sor
-pc_gamg_type classical
-pc_type gamg
1000 proc:
-gamg_est_ksp_type cg
-ksp_norm_type unpreconditioned
-ksp_type gcr
-log_view
-mg_levels_esteig_ksp_max_it 10
-mg_levels_esteig_ksp_type cg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_norm_type none
-mg_levels_ksp_type richardson
-mg_levels_pc_sor_its 1
-mg_levels_pc_type sor
-pc_gamg_type classical
-pc_type gamg
That aside, it looks like you have quite a bit of load imbalance. e.g. in the smoother, where you're doing MatSOR, you have:
125 proc:
Calls Time Max/Min time
MatSOR 47808 1.0 6.8888e+01 1.7
1000 proc:
MatSOR 41400 1.0 6.3412e+01 1.6
VecScatters show similar behaviour.
How is your problem distributed across the processes?
Cheers,
Lawrence
</pre>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>