[petsc-users] Obtaining bytes per second

Wed May 6 10:29:14 CDT 2015

Perhaps I should explain what my ultimate goal is.

I have two sets of code. First is the classical FEM which gives numerically
"incorrect answers" and the classical FEM employing convex optimization
through TAO which fixes these "incorrect answers". We have shown
numerically that the latter should always be the way to solve these types
of problems but now I want to compare the computational performance between
these two methodologies for large problems.

I already have speedup/strong scaling results that essentially depict the
difference between the KSPSolve() and TaoSolve(). However, I have been told
by someone that strong-scaling isn't enough - that I should somehow include
something to show the "efficiency" of these two methodologies. That is, how
much of the wall-clock time reported by these two very different solvers is
spent doing useful work.

Is such an "efficiency" metric necessary to report in addition to
strong-scaling results? The overall computational framework is the same for
both problems, the only difference being one uses a linear solver and the
other uses an optimization solver. My first thought was to use PAPI to
include hardware counters, but these are notoriously inaccurate. Then I
thought about simply reporting the manual FLOPS and FLOPS/s via PETSc, but
these metrics ignore memory bandwidth. And so here I am looking at the idea
of implementing the Roofline model, but now I am wondering if any of this
is worth the trouble.

Thanks,
Justin

On Wed, May 6, 2015 at 8:34 AM, Matthew Knepley <knepley at gmail.com> wrote:

> On Wed, May 6, 2015 at 8:29 AM, Jed Brown <jed at jedbrown.org> wrote:
>
>> Matthew Knepley <knepley at gmail.com> writes:
>> > I think the idea is to be explicit. I would just use something like
>> >
>> >   # Vecs * Vec size * 8 bytes/double + <same for Mat>
>> >
>> > and forget about small stuff. This model is uncached,
>>
>> This is for a perfect cache model -- each byte of the data structures
>> needs to be fetched from DRAM only once.
>>
>
> I meant uncached, in which you count # Vecs for any operation you are
> doing. If you count # Vecs for the whole program, then you have perfect
> cache.
>
>   Matt
>
>
>> Make sure that "# Vecs" actually means the number that are accessed.
>> For example, GMRES accesses an increasing number of vectors on each
>> iteration (until a restart).
>>
>> > and you just state that up front. Someone else can take the data and
>> > produce a more nuanced model.
>> >
>> > You can use the Flop counts (not Flop/s) from PETSc,  and get a rough
>> > estimate
>> > of flop/byte ratio.
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150506/b9edfabe/attachment.html>