[petsc-dev] Feed back on report on performance of vector operations on Summit requested

Thu Oct 31 11:47:13 CDT 2019

"Smith, Barry F." <bsmith at mcs.anl.gov> writes:

>> On Oct 23, 2019, at 7:15 PM, Jed Brown <jed at jedbrown.org> wrote:
>> 
>> IMO, Figures 2 and 7+ are more interesting when the x axis (vector size)
>> is replaced by execution time.  
>
>
>> We don't scale by fixing the resource
>> and increasing the problem size, we choose the global problem size based
>> on accuracy/model complexity and choose a Pareto tradeoff of execution
>> time with efficiency (1/cost) to decide how many nodes to use.  Most of
>> those sloping tails on the left become vertical lines under that
>> transformation.
>
>    I don't see the connection between your first sentence and the other sentences.
>
>    How does the plot with time instead of size tell you what number of processors to use?

The point is that in the planning stage, you don't care how many
processors are used, you care whether the machine is capable of solving
problem P in time T.  After determining that, you want to know how much
it will cost so you can apply for an allocation.  Only once you have an
allocation and need to configure input parameters for a particular model
do you care how many elements per process and how many processes in total.

>    I don't understand the plots with x as a time axis, so I suspect most potential readers won't. The only point of the plots is really to give an idea of the scale of the performance and that performance is low except for large sizes so will keep the plot axis as is.

Compare these two figures.  When plotting versus size, you see a long
tail to the left, but can't tell if it's getting faster.  It makes the
claim that lower latency is a specific capability a squishy and
imprecise concept, while plotting versus time is directly relevant.  You
can say we have n microseconds to do a complex task (time step of a
model) and here each VecDot takes at least k microseconds on
architecture A no matter how we scale it.

In these figures, we can read off that VecDot completes 8x faster on CPU
than GPU.  That the GPU is useless if your time budget is less than 100
microseconds and clearly preferable if you have at least 200
microseconds.  We know that intrinsic MPI_Allreduce latency is about 15
microseconds on a nice machine at any scale (BG, etc.), so if we had an
application where MPI_Allreduce was limiting performance on a previous
problem configuration/architecture, then it'll hurt 6x as much here.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: VecDot_CPU_vs_GPU_time.png
Type: image/png
Size: 59681 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20191031/500e34d9/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: VecDot_CPU_vs_GPU_size.png
Type: image/png
Size: 66833 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20191031/500e34d9/attachment-0003.png>
-------------- next part --------------

Hannah, could you please give me access to push?  I modified the script
to make both kinds of plots.