[petsc-users] strong-scaling vs weak-scaling

Mon Aug 29 00:14:49 CDT 2016

Justin Chang <jychang48 at gmail.com> writes:

> Redid some of those experiments for 8 and 20 cores and scaled it up to even
> larger problems. Attached is the plot.
>
> Looking at this "dynamic plot" (if you ask me, I honestly think there could
> be a better word for this out there), 

"performance spectrum"?

> the lines curve up for the smaller problems, have a "flat line" in the
> middle, then slowly tail down as the problem gets bigger. I am
> guessing these downward curves have to do with either memory bandwidth
> effects or simply the solver requiring more effort to handle larger
> problems (or a combination of both). 

The trailing off to the right typically means suboptimal algorithmic
convergence.  Steps degrading performance to the right usually means
dropping out of cache.

> I currently only have access to a small 80 node (20 cores per node)
> HPC cluster so obviously I am unable to experiment with 10k cores or
> more.
>
> If our goal is to see how close flat the lines get, we can easily game the
> system by scaling the problem until we find the "sweet spot(s)". 

This performance spectrum plot spans across those sweet spots, by
design.  I'm not sure how it would be "gamed".

> In the weak-scaling and strong-scaling studies there are perfect lines
> we can compare to, 

But those "perfect" lines are all relative to a base case.  If the base
case is inefficient, the scaling looks better.  This makes it trivial to
show near-perfect scaling despite having a terrible
algorithm/implementation -- the scaling gets more perfect as the
algorithm gets worse.  (This is perhaps the most critical thing to
recognize about the classical strong and week scaling plots.)
Demonstrating that the base case is "optimal" is not easy, rarely
actually the case, and often overlooked by authors and readers alike.
Those plots mean nothing without a great deal of context and careful
reading of scales.

> but there does not seem to be such lines for this type of study even
> in the seemingly flat regions. Seems these plots are useful if we
> simply compare different solvers/preconditioners/etc or different HPC
> platforms.
>
> Also, the solver count iteration increases with problem size - it went from
> 9 KSP iterations for 1,331 dofs to 48 KSP iterations for 3,442,951 dofs.

That suboptimal algorithmic scaling is important and one of the most
important areas in which you could improve performance.

> Algorithmic time-to-solution is not linearly proportional to problem size
> so the RHS of the graph is obviously going to have lower N/time rates at
> some point - similar to what we observe from weak-scaling.
>
> Also, the N/time rate seems very similar to the floating-point rate,
> although I can see why it's more informative.

Not really -- the floating point rate probably doesn't taper off to the
right because you did floating point work on every iteration (and you
need more iterations for the larger problem sizes that appear to the
right).  Meanwhile, the N/time metric is relevant to an end user that
cares about their science or engineering objective rather than bragging
about using flops.

> Any thoughts on anything I said or did thus far? Just wanting to make sure
> I understand these correctly :)
>
> On Mon, Aug 22, 2016 at 9:03 PM, Justin Chang <jychang48 at gmail.com> wrote:
>
>> Thanks all. So this issue was one of our ATPESC2015 exam questions, and
>> turned some friends into foes. Most eventually fell into the strong-scale
>> is harder camp, but some of these "friends" also believed PETSc is *not*
>> capable of handling dense matrices and is not portable. Just wanted to hear
>> some expert opinions on this :)
>>
>> Anyway, in one of my applications, I am comparing the performance of some
>> VI solvers (i.e., with variable bounds) with that of just standard linear
>> solves (i.e., no variable bounds) for 3D advection-diffusion equations in
>> highly heterogeneous and anisotropic porous media. The parallel efficiency
>> in the strong-sense is roughly the same but the parallel efficiency in the
>> weak-sense is significantly worse for VI solvers. I suppose one inference
>> that can be made is that VI solvers take longer to solver as the problem
>> size grows. And yes solver iteration counts do grow so that has some to do
>> with it.
>>
>> As for these "dynamic range" plots, I tried something like this across 1
>> and 8 MPI processes with the following problem sizes for a 3D anisotropic
>> diffusion problem with CG/BoomerAMG:
>>
>> 1,331
>> 9,261
>> 29,791
>> 68,921
>> 132,651
>> 226,981
>> 357,911
>> 531,441
>> 753,571
>> 1,030,301
>>
>> Using a single Intel Xeon E5-2670 compute node for this. Attached is the
>> plot, but instead of flat or incline lines, i get concave down curves. If
>> my problem size gets too big, the N/time rate decreases, whereas for very
>> small problems it increases. I am guessing bandwidth limitation have
>> something to do with the decrease in performance. In that HPGMG
>> presentation you attached the other day, it seems the rate should decrease
>> as problem size decreases. Perhaps this study should be done with more MPI
>> processes?
>>
>>
>> On Mon, Aug 22, 2016 at 4:14 PM, Karl Rupp <rupp at iue.tuwien.ac.at> wrote:
>>
>>> Hi Justin,
>>>
>>>
>>> I have seen some people claim that strong-scaling is harder to achieve
>>>> than weak scaling
>>>> (e.g., https://www.sharcnet.ca/help/index.php/Measuring_Parallel_Sc
>>>> aling_Performance)
>>>> and generally speaking it makes sense - communication overhead increases
>>>> with concurrency.
>>>>
>>>> However, we know that most PETSc solvers/applications are not only
>>>> memory-bandwidth bound, but may not scale as well w.r.t. problem size as
>>>> other solvers (e.g., ILU(0) may beat out GAMG for small elliptic
>>>> problems but GAMG will eventually beat out ILU(0) for larger problems),
>>>> so wouldn't weak-scaling not only be the more interesting but more
>>>> difficult performance metric to achieve? Strong-scaling issues arise
>>>> mostly from communication overhead but weak-scaling issues may come from
>>>> that and also solver/algorithmic scalability w.r.t problem size (e.g.,
>>>> problem size N takes 10*T seconds to compute but problem size 2*N takes
>>>> 50*T seconds to compute).
>>>>
>>>> In other words, if one were to propose or design a new algorithm/solver
>>>> capable of handling large-scale problems, would it be equally if not
>>>> more important to show the weak-scaling potential? Because if you really
>>>> think about it, a "truly efficient" algorithm will be less likely to
>>>> scale in the strong sense but computation time will be close to linearly
>>>> proportional to problem size (hence better scaling in the weak-sense).
>>>> It seems if I am trying to convince someone that a proposed
>>>> computational framework is "high performing" without getting too deep
>>>> into performance modeling, a poor parallel efficiency (arising due to
>>>> good sequential efficiency) in the strong sense may not look promising.
>>>>
>>>
>>> These are all valid thoughts. Let me add another perspective: If you are
>>> only interested in the machine aspects of scaling, you could run for a
>>> fixed number of solver iterations. That allows you to focus on the actual
>>> computational work done and your results will exclusively reflect the
>>> machine's performance. Thus, even though fixing solver iterations and thus
>>> not running solvers to convergence is a bad shortcut from the solver point
>>> of view, it can be a handy way of eliminating algorithmic fluctuations.
>>> (Clearly, this simplistic approach has not only been used, but also
>>> abused...)
>>>
>>> Best regards,
>>> Karli
>>>
>>>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 800 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20160828/a3edfc45/attachment.pgp>