[petsc-dev] Feed back on report on performance of vector operations on Summit requested

Thu Oct 31 22:22:14 CDT 2019

  Jed,

    Thanks, this is very useful.

  Barry


> On Oct 31, 2019, at 11:47 AM, Jed Brown <jed at jedbrown.org> wrote:
> 
> "Smith, Barry F." <bsmith at mcs.anl.gov> writes:
> 
>>> On Oct 23, 2019, at 7:15 PM, Jed Brown <jed at jedbrown.org> wrote:
>>> 
>>> IMO, Figures 2 and 7+ are more interesting when the x axis (vector size)
>>> is replaced by execution time.  
>> 
>> 
>>> We don't scale by fixing the resource
>>> and increasing the problem size, we choose the global problem size based
>>> on accuracy/model complexity and choose a Pareto tradeoff of execution
>>> time with efficiency (1/cost) to decide how many nodes to use.  Most of
>>> those sloping tails on the left become vertical lines under that
>>> transformation.
>> 
>>   I don't see the connection between your first sentence and the other sentences.
>> 
>>   How does the plot with time instead of size tell you what number of processors to use?
> 
> The point is that in the planning stage, you don't care how many
> processors are used, you care whether the machine is capable of solving
> problem P in time T.  After determining that, you want to know how much
> it will cost so you can apply for an allocation.  Only once you have an
> allocation and need to configure input parameters for a particular model
> do you care how many elements per process and how many processes in total.
> 
>>   I don't understand the plots with x as a time axis, so I suspect most potential readers won't. The only point of the plots is really to give an idea of the scale of the performance and that performance is low except for large sizes so will keep the plot axis as is.
> 
> Compare these two figures.  When plotting versus size, you see a long
> tail to the left, but can't tell if it's getting faster.  It makes the
> claim that lower latency is a specific capability a squishy and
> imprecise concept, while plotting versus time is directly relevant.  You
> can say we have n microseconds to do a complex task (time step of a
> model) and here each VecDot takes at least k microseconds on
> architecture A no matter how we scale it.
> 
> In these figures, we can read off that VecDot completes 8x faster on CPU
> than GPU.  That the GPU is useless if your time budget is less than 100
> microseconds and clearly preferable if you have at least 200
> microseconds.  We know that intrinsic MPI_Allreduce latency is about 15
> microseconds on a nice machine at any scale (BG, etc.), so if we had an
> application where MPI_Allreduce was limiting performance on a previous
> problem configuration/architecture, then it'll hurt 6x as much here.
> 
> <VecDot_CPU_vs_GPU_time.png><VecDot_CPU_vs_GPU_size.png>
> 
> Hannah, could you please give me access to push?  I modified the script
> to make both kinds of plots.