[petsc-users] Obtaining bytes per second

Justin Chang jychang48 at gmail.com
Wed May 6 13:28:03 CDT 2015


Jed,

I am working with anisotropic diffusion and most standard numerical
formulations (e.g., FEM, FVM, etc.) are "wrong" because they violate the
discrete maximum principle, see Nakshatrala & Valocci (JCP 2009) for more
on this. What we have seen people do is simply "ignore" or chop off these
values but to us that is a complete and utter abomination. My goal here is
to show that our proposed methodologies work by leveraging on the
capabilities within PETSc and TAO and to also show how computationally
expensive it is compared to solving the same problem using the standard
Galerkin method.

Matt,

Okay, so then I guess I still have questions regarding how to obtain the
bytes. How exactly would I count all the number of Vecs and their
respective sizes, because it seems all the DMPlex related functions create
many vectors. Or do I only count the DM created vectors used for my
solution vector, residual, lower/upper bound, optimization routines, etc?

And when you say "forget about small stuff", does that include all the
DMPlex creation routines, PetscMalloc'ed arrays, pointwise functions, and
all the jazz that goes on within the FE/discretization routines?

Lastly, for a Matrix, wouldn't I just get the number of bytes from the
memory usage section in -log_summary?

Thanks,
Justin

On Wed, May 6, 2015 at 11:48 AM, Matthew Knepley <knepley at gmail.com> wrote:

> On Wed, May 6, 2015 at 11:41 AM, Justin Chang <jychang48 at gmail.com> wrote:
>
>> I suppose I have two objectives that I think are achievable within PETSc
>> means:
>>
>> 1) How much wall-clock time can be reduced as you increase the number of
>> processors. I have strong-scaling and parallel efficiency metrics that
>> convey this.
>>
>> 2) The "optimal" problem size for these two methods/solvers are. What I
>> mean by this is, at what point do I achieve the maximum FLOPS/s. If
>> starting off with a really small problem then this metric should increase
>> with problem size. My hypothesis is that as problem size increases, the
>> ratio of wall-clock time spent in idle (e.g., waiting for cache to free up,
>> accessing main memory, etc) to performing work also increases, and the
>> reported FLOPS/s should start decreasing at some point. "Efficiency" in
>> this context simply means the highest possible FLOPS/s.
>>
>> Does that make sense and/or is "interesting" enough?
>>
>
> I think 2) is not really that interesting because
>
>   a) it is so easily gamed. Just stick in high flop count operations, like
> DGEMM.
>
>   b) Time really matters to people who run the code, but flops never do.
>
>   c) Floating point performance is not your limiting factor for time
>
> I think it would be much more interesting, and no more work to
>
>   a) Model the flop/byte \beta ratio simply
>
>   b) Report how close you get to the max performance given \beta on your
> machine
>
>   Thanks,
>
>      Matt
>
>
>> Thanks,
>> Justin
>>
>> On Wed, May 6, 2015 at 11:28 AM, Jed Brown <jed at jedbrown.org> wrote:
>>
>>> Justin Chang <jychang48 at gmail.com> writes:
>>> > I already have speedup/strong scaling results that essentially depict
>>> the
>>> > difference between the KSPSolve() and TaoSolve(). However, I have been
>>> told
>>> > by someone that strong-scaling isn't enough - that I should somehow
>>> include
>>> > something to show the "efficiency" of these two methodologies.
>>>
>>> "Efficiency" is irrelevant if one is wrong.  Can you set up a problem
>>> where both get the right answer and vary a parameter to get to the case
>>> where one fails?  Then you can look at efficiency for a given accuracy
>>> (and you might have to refine the grid differently) as you vary the
>>> parameter.
>>>
>>> It's really hard to demonstrate that an implicit solver is optimal in
>>> terms of mathematical convergence rate.  Improvements there can dwarf
>>> any differences in implementation efficiency.
>>>
>>> > That is, how much of the wall-clock time reported by these two very
>>> > different solvers is spent doing useful work.
>>> >
>>> > Is such an "efficiency" metric necessary to report in addition to
>>> > strong-scaling results? The overall computational framework is the
>>> same for
>>> > both problems, the only difference being one uses a linear solver and
>>> the
>>> > other uses an optimization solver. My first thought was to use PAPI to
>>> > include hardware counters, but these are notoriously inaccurate. Then I
>>> > thought about simply reporting the manual FLOPS and FLOPS/s via PETSc,
>>> but
>>> > these metrics ignore memory bandwidth. And so here I am looking at the
>>> idea
>>> > of implementing the Roofline model, but now I am wondering if any of
>>> this
>>> > is worth the trouble.
>>>
>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150506/eecd5288/attachment.html>


More information about the petsc-users mailing list