[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Mark Adams mfadams at lbl.gov
Wed Aug 14 15:46:41 CDT 2019


On Wed, Aug 14, 2019 at 2:19 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:

>
>   Mark,
>
>     This is great, we can study these for months.
>
> 1) At the top of the plots you say SNES  but that can't be right, there is
> no way it is getting such speed ups for the entire SNES solve since the
> Jacobians are CPUs and take much of the time. Do you mean the KSP part of
> the SNES solve?
>

It uses KSPONLY. And solve times are KSPSolve with KSPSetUp called before.


>
> 2) For the case of a bit more than 1000 processes the speedup with GPUs is
> fantastic, more than 6?
>

I did not see that one, but it is plausible and there is some noise in this
data. The largest solve had a speedup of about 4x.


>
> 3) People will ask about runs using all 48 CPUs, since they are there it
> is a little unfair to only compare 24 CPUs with the GPUs. Presumably due to
> memory bandwidth limits 48 won't be much better than 24 but you need it in
> your back pocket for completeness.
>
>
Ah, good point. I just cut and paste but I should run a little test and see
where it saturates.


> 4) From the table
>
> KSPSolve               1 1.0 5.4191e-02 1.0 9.35e+06 7.3 1.3e+04 5.6e+02
> 8.3e+01  0  0  4  0  3  10 57 97 52 81  1911    3494    114 3.06e-01  129
> 1.38e-01 84
> PCApply               17 1.0 4.5053e-02 1.0 9.22e+06 8.5 1.1e+04 5.6e+02
> 3.4e+01  0  0  3  0  1   8 49 81 44 33  1968    4007     98 2.58e-01  113
> 1.19e-01 81
>
> only 84 percent of the total flops in the KSPSolve are on the GPU and only
> 81 for the PCApply() where are the rest? MatMult() etc are doing 100
> percent on the GPU, MatSolve on the coarsest level should be tiny and not
> taking 19 percent of the flops?
>
>
That is the smallest test with 3465 equations on 24 cores. the R and P and
coarse grid are on the CPU. Look at larger tests.


>   Thanks
>
>    Barry
>
>
> > On Aug 14, 2019, at 12:45 PM, Mark Adams <mfadams at lbl.gov> wrote:
> >
> > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU
> speedup with 98K dof/proc (3D Q2 elasticity).
> >
> > This is weak scaling of a solve. There is growth in iteration count
> folded in here. I should put rtol in the title and/or run a fixed number of
> iterations and make it clear in the title.
> >
> > Comments welcome.
> >
> <out_cpu_012288><out_cpu_001536><out_cuda_012288><out_cpu_000024><out_cpu_000192><out_cuda_001536><out_cuda_000192><out_cuda_000024><weak_scaling_cpu.png><weak_scaling_cuda.png>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190814/2d90bb81/attachment-0001.html>


More information about the petsc-dev mailing list