<div dir="ltr"><div>The KSPSolve time goes from 128 to 138 seconds in going from 125 to 1000 processes. Is this the problem?<br></div><div><br></div><div>And as Lawrence pointed out there is a lot of "load" imbalance. (This could come from a poor network). VecAXPY has no communication and has significant imbalance. But you seem to have perfect actual load imbalance but <span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">this can come from cache effects....</span></div><div><br></div><div>And you are spending almost half the solve time in VecScatter. If you really have this nice regular partitioning of the problem, then your communication is slow, even on 125 processors. (So it is not a scaling issue here, but if you do a one processor test you should see it).</div><div><br></div><div>Note, AMG coarse grids get bigger as the problem gets bigger, so it is not perfectly scalable pre-asymptotically. Nothing is really, because you don't saturate communication until you have at least a 3^D process grid and various random things will cause some non-perfect weak speedup.</div><div><br></div><div>Mark</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, May 24, 2018 at 5:10 AM, Michael Becker <span dir="ltr"><<a href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank">Michael.Becker@physik.uni-giessen.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p>CG/GCR: I accidentally kept gcr in the batch file. That's still
from when I was experimenting with the different methods. The
performance is quite similar though.</p>
<p>I use the following setup for the ksp object and the vectors:</p>
<blockquote><font size="-1">ierr=PetscInitialize(&argc,
&argv, (char*)0, (char*)0);CHKERRQ(ierr);<br>
<br>
ierr=KSPCreate(PETSC_COMM_<wbr>WORLD, &ksp);CHKERRQ(ierr);<br>
<br>
ierr=DMDACreate3d(PETSC_COMM_<wbr>WORLD,</font><font size="-1"><font size="-1">DM_BOUNDARY_GHOSTED,DM_<wbr>BOUNDARY_GHOSTED,DM_BOUNDARY_<wbr>GHOSTED</font>,<br>
DMDA_STENCIL_STAR, g_Nx, g_Ny, g_Nz, dims[0],
dims[1], dims[2], 1, 1, l_Nx, l_Ny, l_Nz,
&da);CHKERRQ(ierr);<br>
ierr=DMSetFromOptions(da);<wbr>CHKERRQ(ierr);<br>
ierr=DMSetUp(da);CHKERRQ(ierr)<wbr>;<br>
ierr=KSPSetDM(ksp, da);CHKERRQ(ierr);<br>
<br>
ierr=DMCreateGlobalVector(da, &b);CHKERRQ(ierr);<br>
<br>
ierr=VecDuplicate(b, &x);CHKERRQ(ierr);<br>
<br>
ierr=DMCreateLocalVector(da, &l_x);CHKERRQ(ierr);<br>
ierr=VecSet(x,0);CHKERRQ(ierr)<wbr>;<br>
ierr=VecSet(b,0);CHKERRQ(ierr)<wbr>;</font><br>
</blockquote>
<p>For the 125 case the arrays l_Nx, l_Ny, l_Nz have dimension 5 and
every element has value 30. VecGetLocalSize() returns 27000 for
every rank. Is there something I didn't consider?</p><span class="HOEnZb"><font color="#888888">
<p>Michael<br>
</p></font></span><div><div class="h5">
<p><br>
</p>
<br>
<div class="m_5673279642174115124moz-cite-prefix">Am 24.05.2018 um 09:39 schrieb Lawrence
Mitchell:<br>
</div>
<blockquote type="cite">
<pre>
</pre>
<blockquote type="cite">
<pre>On 24 May 2018, at 06:24, Michael Becker <a class="m_5673279642174115124moz-txt-link-rfc2396E" href="mailto:Michael.Becker@physik.uni-giessen.de" target="_blank"><Michael.Becker@physik.uni-<wbr>giessen.de></a> wrote:
Could you have a look at the attached log_view files and tell me if something is particularly odd? The system size per processor is 30^3 and the simulation ran over 1000 timesteps, which means KSPsolve() was called equally often. I introduced two new logging states - one for the first solve and the final setup and one for the remaining solves.
</pre>
</blockquote>
<pre>The two attached logs use CG for the 125 proc run, but gcr for the 1000 proc run. Is this deliberate?
125 proc:
-gamg_est_ksp_type cg
-ksp_norm_type unpreconditioned
-ksp_type cg
-log_view
-mg_levels_esteig_ksp_max_it 10
-mg_levels_esteig_ksp_type cg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_norm_type none
-mg_levels_ksp_type richardson
-mg_levels_pc_sor_its 1
-mg_levels_pc_type sor
-pc_gamg_type classical
-pc_type gamg
1000 proc:
-gamg_est_ksp_type cg
-ksp_norm_type unpreconditioned
-ksp_type gcr
-log_view
-mg_levels_esteig_ksp_max_it 10
-mg_levels_esteig_ksp_type cg
-mg_levels_ksp_max_it 1
-mg_levels_ksp_norm_type none
-mg_levels_ksp_type richardson
-mg_levels_pc_sor_its 1
-mg_levels_pc_type sor
-pc_gamg_type classical
-pc_type gamg
That aside, it looks like you have quite a bit of load imbalance. e.g. in the smoother, where you're doing MatSOR, you have:
125 proc:
Calls Time Max/Min time
MatSOR 47808 1.0 6.8888e+01 1.7
1000 proc:
MatSOR 41400 1.0 6.3412e+01 1.6
VecScatters show similar behaviour.
How is your problem distributed across the processes?
Cheers,
Lawrence
</pre>
</blockquote>
<br>
</div></div></div>
</blockquote></div><br></div>