<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Aug 14, 2019 at 2:19 PM Smith, Barry F. <<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

  Mark,<br>

<br>

    This is great, we can study these for months. <br>

<br>

1) At the top of the plots you say SNES  but that can't be right, there is no way it is getting such speed ups for the entire SNES solve since the Jacobians are CPUs and take much of the time. Do you mean the KSP part of the SNES solve? <br></blockquote><div><br></div><div>It uses KSPONLY. And solve times are KSPSolve with KSPSetUp called before.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

2) For the case of a bit more than 1000 processes the speedup with GPUs is fantastic, more than 6?<br></blockquote><div><br></div><div>I did not see that one, but it is plausible and there is some noise in this data. The largest solve had a speedup of about 4x.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

3) People will ask about runs using all 48 CPUs, since they are there it is a little unfair to only compare 24 CPUs with the GPUs. Presumably due to memory bandwidth limits 48 won't be much better than 24 but you need it in your back pocket for completeness.<br>

<br></blockquote><div><br></div><div>Ah, good point. I just cut and paste but I should run a little test and see where it saturates.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

4) From the table<br>

<br>

KSPSolve               1 1.0 5.4191e-02 1.0 9.35e+06 7.3 1.3e+04 5.6e+02 8.3e+01  0  0  4  0  3  10 57 97 52 81  1911    3494    114 3.06e-01  129 1.38e-01 84<br>

PCApply               17 1.0 4.5053e-02 1.0 9.22e+06 8.5 1.1e+04 5.6e+02 3.4e+01  0  0  3  0  1   8 49 81 44 33  1968    4007     98 2.58e-01  113 1.19e-01 81<br>

<br>

only 84 percent of the total flops in the KSPSolve are on the GPU and only 81 for the PCApply() where are the rest? MatMult() etc are doing 100 percent on the GPU, MatSolve on the coarsest level should be tiny and not taking 19 percent of the flops?<br>

<br></blockquote><div><br></div><div>That is the smallest test with 3465 equations on 24 cores. the R and P and coarse grid are on the CPU. Look at larger tests.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  Thanks<br>

<br>

   Barry<br>

<br>

<br>

> On Aug 14, 2019, at 12:45 PM, Mark Adams <<a href="mailto:mfadams@lbl.gov" target="_blank">mfadams@lbl.gov</a>> wrote:<br>

> <br>

> FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU speedup with 98K dof/proc (3D Q2 elasticity).<br>

> <br>

> This is weak scaling of a solve. There is growth in iteration count folded in here. I should put rtol in the title and/or run a fixed number of iterations and make it clear in the title.<br>

> <br>

> Comments welcome.<br>

> <out_cpu_012288><out_cpu_001536><out_cuda_012288><out_cpu_000024><out_cpu_000192><out_cuda_001536><out_cuda_000192><out_cuda_000024><weak_scaling_cpu.png><weak_scaling_cuda.png><br>

<br>

</blockquote></div></div>