[petsc-users] Obtaining bytes per second

Matthew Knepley knepley at gmail.com
Wed May 6 10:39:30 CDT 2015


On Wed, May 6, 2015 at 10:29 AM, Justin Chang <jychang48 at gmail.com> wrote:

> Perhaps I should explain what my ultimate goal is.
>
> I have two sets of code. First is the classical FEM which gives
> numerically "incorrect answers" and the classical FEM employing convex
> optimization through TAO which fixes these "incorrect answers". We have
> shown numerically that the latter should always be the way to solve these
> types of problems but now I want to compare the computational performance
> between these two methodologies for large problems.
>
> I already have speedup/strong scaling results that essentially depict the
> difference between the KSPSolve() and TaoSolve(). However, I have been told
> by someone that strong-scaling isn't enough - that I should somehow include
> something to show the "efficiency" of these two methodologies. That is, how
> much of the wall-clock time reported by these two very different solvers is
> spent doing useful work.
>
> Is such an "efficiency" metric necessary to report in addition to
> strong-scaling results? The overall computational framework is the same for
> both problems, the only difference being one uses a linear solver and the
> other uses an optimization solver. My first thought was to use PAPI to
> include hardware counters, but these are notoriously inaccurate. Then I
> thought about simply reporting the manual FLOPS and FLOPS/s via PETSc, but
> these metrics ignore memory bandwidth. And so here I am looking at the idea
> of implementing the Roofline model, but now I am wondering if any of this
> is worth the trouble.
>

I think its always best to start with a goal, and then explain how to
achieve that goal.

If your goal is "Run a particular problem as fast as possible", then strong
scaling is a good metric. It shows
you how much time improvement you get by adding more processors. However,
suppose I just put a TON
of useless work into my solver which is perfectly parallel, like factoring
a bunch of large numbers. Now I can
show perfect strong scaling, but this is not the most _efficient_ way to
implement this solver.

Now there are at least two kinds of efficiency, hardware efficiency and
algorithmic efficiency. For both you have
to have some conceptual model of the computation. Then I can say "This is
how long I expect things to take in
a perfect world", and "this is how long your algorithm takes". Also, I can
say "This is how fast it should run on
this hardware in a perfect world", and "this is how fast it runs on my
machine".

Both measures are interesting and useful. As Jed points out, you must have
some kind of model for the computation
in order to make claims about "efficiency". For example, I expect VecDot()
for length N vectors to use 2N flops and
pull 2N data, and do one reduction. You can imagine algorithms which do
this slightly differently. You can imagine
hardware where this is flop limited and where this is memory bandwidth
limited. I can talk about these tradeoffs once
I have the simple computational model.

Does this help?

  Thanks,

    Matt


> Thanks,
> Justin
>
>
> On Wed, May 6, 2015 at 8:34 AM, Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Wed, May 6, 2015 at 8:29 AM, Jed Brown <jed at jedbrown.org> wrote:
>>
>>> Matthew Knepley <knepley at gmail.com> writes:
>>> > I think the idea is to be explicit. I would just use something like
>>> >
>>> >   # Vecs * Vec size * 8 bytes/double + <same for Mat>
>>> >
>>> > and forget about small stuff. This model is uncached,
>>>
>>> This is for a perfect cache model -- each byte of the data structures
>>> needs to be fetched from DRAM only once.
>>>
>>
>> I meant uncached, in which you count # Vecs for any operation you are
>> doing. If you count # Vecs for the whole program, then you have perfect
>> cache.
>>
>>   Matt
>>
>>
>>> Make sure that "# Vecs" actually means the number that are accessed.
>>> For example, GMRES accesses an increasing number of vectors on each
>>> iteration (until a restart).
>>>
>>> > and you just state that up front. Someone else can take the data and
>>> > produce a more nuanced model.
>>> >
>>> > You can use the Flop counts (not Flop/s) from PETSc,  and get a rough
>>> > estimate
>>> > of flop/byte ratio.
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150506/f444f7d4/attachment-0001.html>


More information about the petsc-users mailing list