[petsc-users] Obtaining bytes per second

Matthew Knepley knepley at gmail.com
Wed May 6 13:38:09 CDT 2015


On Wed, May 6, 2015 at 1:28 PM, Justin Chang <jychang48 at gmail.com> wrote:

> Jed,
>
> I am working with anisotropic diffusion and most standard numerical
> formulations (e.g., FEM, FVM, etc.) are "wrong" because they violate the
> discrete maximum principle, see Nakshatrala & Valocci (JCP 2009) for more
> on this. What we have seen people do is simply "ignore" or chop off these
> values but to us that is a complete and utter abomination. My goal here is
> to show that our proposed methodologies work by leveraging on the
> capabilities within PETSc and TAO and to also show how computationally
> expensive it is compared to solving the same problem using the standard
> Galerkin method.
>
> Matt,
>
> Okay, so then I guess I still have questions regarding how to obtain the
> bytes. How exactly would I count all the number of Vecs and their
> respective sizes, because it seems all the DMPlex related functions create
> many vectors. Or do I only count the DM created vectors used for my
> solution vector, residual, lower/upper bound, optimization routines, etc?
>

This is laborious. You would build up from the small stuff. So a Krylov
solver have MatMult, for which there is an analysis in the paper with
Dinesh/Bill/Barry/David, and
Vec ops which are easy. This is a lot of counting, especially if you have a
TAO solver in there. I would make sure you really care.


> And when you say "forget about small stuff", does that include all the
> DMPlex creation routines, PetscMalloc'ed arrays, pointwise functions, and
> all the jazz that goes on within the FE/discretization routines?
>

Yep.


> Lastly, for a Matrix, wouldn't I just get the number of bytes from the
> memory usage section in -log_summary?
>

That is a good way. You can also ask MatInfo how many nonzeros the matrix
has.

   Matt


> Thanks,
> Justin
>
> On Wed, May 6, 2015 at 11:48 AM, Matthew Knepley <knepley at gmail.com>
> wrote:
>
>> On Wed, May 6, 2015 at 11:41 AM, Justin Chang <jychang48 at gmail.com>
>> wrote:
>>
>>> I suppose I have two objectives that I think are achievable within PETSc
>>> means:
>>>
>>> 1) How much wall-clock time can be reduced as you increase the number of
>>> processors. I have strong-scaling and parallel efficiency metrics that
>>> convey this.
>>>
>>> 2) The "optimal" problem size for these two methods/solvers are. What I
>>> mean by this is, at what point do I achieve the maximum FLOPS/s. If
>>> starting off with a really small problem then this metric should increase
>>> with problem size. My hypothesis is that as problem size increases, the
>>> ratio of wall-clock time spent in idle (e.g., waiting for cache to free up,
>>> accessing main memory, etc) to performing work also increases, and the
>>> reported FLOPS/s should start decreasing at some point. "Efficiency" in
>>> this context simply means the highest possible FLOPS/s.
>>>
>>> Does that make sense and/or is "interesting" enough?
>>>
>>
>> I think 2) is not really that interesting because
>>
>>   a) it is so easily gamed. Just stick in high flop count operations,
>> like DGEMM.
>>
>>   b) Time really matters to people who run the code, but flops never do.
>>
>>   c) Floating point performance is not your limiting factor for time
>>
>> I think it would be much more interesting, and no more work to
>>
>>   a) Model the flop/byte \beta ratio simply
>>
>>   b) Report how close you get to the max performance given \beta on your
>> machine
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> Thanks,
>>> Justin
>>>
>>> On Wed, May 6, 2015 at 11:28 AM, Jed Brown <jed at jedbrown.org> wrote:
>>>
>>>> Justin Chang <jychang48 at gmail.com> writes:
>>>> > I already have speedup/strong scaling results that essentially depict
>>>> the
>>>> > difference between the KSPSolve() and TaoSolve(). However, I have
>>>> been told
>>>> > by someone that strong-scaling isn't enough - that I should somehow
>>>> include
>>>> > something to show the "efficiency" of these two methodologies.
>>>>
>>>> "Efficiency" is irrelevant if one is wrong.  Can you set up a problem
>>>> where both get the right answer and vary a parameter to get to the case
>>>> where one fails?  Then you can look at efficiency for a given accuracy
>>>> (and you might have to refine the grid differently) as you vary the
>>>> parameter.
>>>>
>>>> It's really hard to demonstrate that an implicit solver is optimal in
>>>> terms of mathematical convergence rate.  Improvements there can dwarf
>>>> any differences in implementation efficiency.
>>>>
>>>> > That is, how much of the wall-clock time reported by these two very
>>>> > different solvers is spent doing useful work.
>>>> >
>>>> > Is such an "efficiency" metric necessary to report in addition to
>>>> > strong-scaling results? The overall computational framework is the
>>>> same for
>>>> > both problems, the only difference being one uses a linear solver and
>>>> the
>>>> > other uses an optimization solver. My first thought was to use PAPI to
>>>> > include hardware counters, but these are notoriously inaccurate. Then
>>>> I
>>>> > thought about simply reporting the manual FLOPS and FLOPS/s via
>>>> PETSc, but
>>>> > these metrics ignore memory bandwidth. And so here I am looking at
>>>> the idea
>>>> > of implementing the Roofline model, but now I am wondering if any of
>>>> this
>>>> > is worth the trouble.
>>>>
>>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150506/87eca075/attachment-0001.html>


More information about the petsc-users mailing list