[petsc-dev] [petsc-maint] running CUDA on SUMMIT

Wed Aug 14 13:19:27 CDT 2019

  Mark,

    This is great, we can study these for months. 

1) At the top of the plots you say SNES  but that can't be right, there is no way it is getting such speed ups for the entire SNES solve since the Jacobians are CPUs and take much of the time. Do you mean the KSP part of the SNES solve? 

2) For the case of a bit more than 1000 processes the speedup with GPUs is fantastic, more than 6?

3) People will ask about runs using all 48 CPUs, since they are there it is a little unfair to only compare 24 CPUs with the GPUs. Presumably due to memory bandwidth limits 48 won't be much better than 24 but you need it in your back pocket for completeness.

4) From the table

KSPSolve               1 1.0 5.4191e-02 1.0 9.35e+06 7.3 1.3e+04 5.6e+02 8.3e+01  0  0  4  0  3  10 57 97 52 81  1911    3494    114 3.06e-01  129 1.38e-01 84
PCApply               17 1.0 4.5053e-02 1.0 9.22e+06 8.5 1.1e+04 5.6e+02 3.4e+01  0  0  3  0  1   8 49 81 44 33  1968    4007     98 2.58e-01  113 1.19e-01 81

only 84 percent of the total flops in the KSPSolve are on the GPU and only 81 for the PCApply() where are the rest? MatMult() etc are doing 100 percent on the GPU, MatSolve on the coarsest level should be tiny and not taking 19 percent of the flops?

  Thanks

   Barry

> On Aug 14, 2019, at 12:45 PM, Mark Adams <mfadams at lbl.gov> wrote:
> 
> FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU speedup with 98K dof/proc (3D Q2 elasticity).
> 
> This is weak scaling of a solve. There is growth in iteration count folded in here. I should put rtol in the title and/or run a fixed number of iterations and make it clear in the title.
> 
> Comments welcome.
> <out_cpu_012288><out_cpu_001536><out_cuda_012288><out_cpu_000024><out_cpu_000192><out_cuda_001536><out_cuda_000192><out_cuda_000024><weak_scaling_cpu.png><weak_scaling_cuda.png>