[petsc-users] Obtaining bytes per second
Justin Chang
jychang48 at gmail.com
Wed May 6 14:28:47 CDT 2015
So basically I just need these types of operations:
VecTDot 60 1.0 1.6928e-05 1.0 2.69e+04 1.0 0.0e+00 0.0e+00
0.0e+00 0 14 0 0 0 0 14 0 0 0 1591
VecNorm 31 1.0 9.0599e-06 1.0 1.39e+04 1.0 0.0e+00 0.0e+00
0.0e+00 0 7 0 0 0 0 7 0 0 0 1536
VecCopy 3 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 36 1.0 9.4175e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecAXPY 60 1.0 1.7166e-05 1.0 2.70e+04 1.0 0.0e+00 0.0e+00
0.0e+00 0 14 0 0 0 0 14 0 0 0 1573
VecAYPX 29 1.0 1.9312e-05 1.0 1.30e+04 1.0 0.0e+00 0.0e+00
0.0e+00 0 7 0 0 0 0 7 0 0 0 676
VecWAXPY 1 1.0 1.9073e-06 1.0 2.25e+02 1.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 118
VecPointwiseMult 31 1.0 1.8358e-05 1.0 6.98e+03 1.0 0.0e+00 0.0e+00
0.0e+00 0 4 0 0 0 0 4 0 0 0 380
MatMult 30 1.0 7.5340e-05 1.0 8.07e+04 1.0 0.0e+00 0.0e+00
0.0e+00 0 42 0 0 0 0 42 0 0 0 1071
Given the matrix size and the number of nonzeros, vector size of my
solution, and the number of calls to each of these vec ops, I can estimate
the total bytes transferred (loads/stores)?
For multiple processors, do I calculate the local or global sizes? If I
calculate the local sizes, then do I need to do an MPI_Allreduce (with
mpi_sum) of the TBT across all processors just like with total flops?
Do these vec/mat ops account for Dirichlet constraints? That is, should the
global/local size include those constraints?
Also, is there a way to extract the event count for these operations
besides dumping -log_summary each time?
Thanks,
Justin
On Wed, May 6, 2015 at 1:38 PM, Matthew Knepley <knepley at gmail.com> wrote:
> On Wed, May 6, 2015 at 1:28 PM, Justin Chang <jychang48 at gmail.com> wrote:
>
>> Jed,
>>
>> I am working with anisotropic diffusion and most standard numerical
>> formulations (e.g., FEM, FVM, etc.) are "wrong" because they violate the
>> discrete maximum principle, see Nakshatrala & Valocci (JCP 2009) for more
>> on this. What we have seen people do is simply "ignore" or chop off these
>> values but to us that is a complete and utter abomination. My goal here is
>> to show that our proposed methodologies work by leveraging on the
>> capabilities within PETSc and TAO and to also show how computationally
>> expensive it is compared to solving the same problem using the standard
>> Galerkin method.
>>
>> Matt,
>>
>> Okay, so then I guess I still have questions regarding how to obtain the
>> bytes. How exactly would I count all the number of Vecs and their
>> respective sizes, because it seems all the DMPlex related functions create
>> many vectors. Or do I only count the DM created vectors used for my
>> solution vector, residual, lower/upper bound, optimization routines, etc?
>>
>
> This is laborious. You would build up from the small stuff. So a Krylov
> solver have MatMult, for which there is an analysis in the paper with
> Dinesh/Bill/Barry/David, and
> Vec ops which are easy. This is a lot of counting, especially if you have
> a TAO solver in there. I would make sure you really care.
>
>
>> And when you say "forget about small stuff", does that include all the
>> DMPlex creation routines, PetscMalloc'ed arrays, pointwise functions, and
>> all the jazz that goes on within the FE/discretization routines?
>>
>
> Yep.
>
>
>> Lastly, for a Matrix, wouldn't I just get the number of bytes from the
>> memory usage section in -log_summary?
>>
>
> That is a good way. You can also ask MatInfo how many nonzeros the matrix
> has.
>
> Matt
>
>
>> Thanks,
>> Justin
>>
>> On Wed, May 6, 2015 at 11:48 AM, Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Wed, May 6, 2015 at 11:41 AM, Justin Chang <jychang48 at gmail.com>
>>> wrote:
>>>
>>>> I suppose I have two objectives that I think are achievable within
>>>> PETSc means:
>>>>
>>>> 1) How much wall-clock time can be reduced as you increase the number
>>>> of processors. I have strong-scaling and parallel efficiency metrics that
>>>> convey this.
>>>>
>>>> 2) The "optimal" problem size for these two methods/solvers are. What I
>>>> mean by this is, at what point do I achieve the maximum FLOPS/s. If
>>>> starting off with a really small problem then this metric should increase
>>>> with problem size. My hypothesis is that as problem size increases, the
>>>> ratio of wall-clock time spent in idle (e.g., waiting for cache to free up,
>>>> accessing main memory, etc) to performing work also increases, and the
>>>> reported FLOPS/s should start decreasing at some point. "Efficiency" in
>>>> this context simply means the highest possible FLOPS/s.
>>>>
>>>> Does that make sense and/or is "interesting" enough?
>>>>
>>>
>>> I think 2) is not really that interesting because
>>>
>>> a) it is so easily gamed. Just stick in high flop count operations,
>>> like DGEMM.
>>>
>>> b) Time really matters to people who run the code, but flops never do.
>>>
>>> c) Floating point performance is not your limiting factor for time
>>>
>>> I think it would be much more interesting, and no more work to
>>>
>>> a) Model the flop/byte \beta ratio simply
>>>
>>> b) Report how close you get to the max performance given \beta on your
>>> machine
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>>
>>>> Thanks,
>>>> Justin
>>>>
>>>> On Wed, May 6, 2015 at 11:28 AM, Jed Brown <jed at jedbrown.org> wrote:
>>>>
>>>>> Justin Chang <jychang48 at gmail.com> writes:
>>>>> > I already have speedup/strong scaling results that essentially
>>>>> depict the
>>>>> > difference between the KSPSolve() and TaoSolve(). However, I have
>>>>> been told
>>>>> > by someone that strong-scaling isn't enough - that I should somehow
>>>>> include
>>>>> > something to show the "efficiency" of these two methodologies.
>>>>>
>>>>> "Efficiency" is irrelevant if one is wrong. Can you set up a problem
>>>>> where both get the right answer and vary a parameter to get to the case
>>>>> where one fails? Then you can look at efficiency for a given accuracy
>>>>> (and you might have to refine the grid differently) as you vary the
>>>>> parameter.
>>>>>
>>>>> It's really hard to demonstrate that an implicit solver is optimal in
>>>>> terms of mathematical convergence rate. Improvements there can dwarf
>>>>> any differences in implementation efficiency.
>>>>>
>>>>> > That is, how much of the wall-clock time reported by these two very
>>>>> > different solvers is spent doing useful work.
>>>>> >
>>>>> > Is such an "efficiency" metric necessary to report in addition to
>>>>> > strong-scaling results? The overall computational framework is the
>>>>> same for
>>>>> > both problems, the only difference being one uses a linear solver
>>>>> and the
>>>>> > other uses an optimization solver. My first thought was to use PAPI
>>>>> to
>>>>> > include hardware counters, but these are notoriously inaccurate.
>>>>> Then I
>>>>> > thought about simply reporting the manual FLOPS and FLOPS/s via
>>>>> PETSc, but
>>>>> > these metrics ignore memory bandwidth. And so here I am looking at
>>>>> the idea
>>>>> > of implementing the Roofline model, but now I am wondering if any of
>>>>> this
>>>>> > is worth the trouble.
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150506/5ad261e5/attachment.html>
More information about the petsc-users
mailing list