[petsc-users] Obtaining bytes per second
Matthew Knepley
knepley at gmail.com
Wed May 6 14:34:13 CDT 2015
On Wed, May 6, 2015 at 2:28 PM, Justin Chang <jychang48 at gmail.com> wrote:
> So basically I just need these types of operations:
>
> VecTDot 60 1.0 1.6928e-05 1.0 2.69e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 14 0 0 0 0 14 0 0 0 1591
> VecNorm 31 1.0 9.0599e-06 1.0 1.39e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 7 0 0 0 0 7 0 0 0 1536
> VecCopy 3 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 36 1.0 9.4175e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecAXPY 60 1.0 1.7166e-05 1.0 2.70e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 14 0 0 0 0 14 0 0 0 1573
> VecAYPX 29 1.0 1.9312e-05 1.0 1.30e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 7 0 0 0 0 7 0 0 0 676
> VecWAXPY 1 1.0 1.9073e-06 1.0 2.25e+02 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 118
> VecPointwiseMult 31 1.0 1.8358e-05 1.0 6.98e+03 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 4 0 0 0 0 4 0 0 0 380
> MatMult 30 1.0 7.5340e-05 1.0 8.07e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 42 0 0 0 0 42 0 0 0 1071
>
> Given the matrix size and the number of nonzeros, vector size of my
> solution, and the number of calls to each of these vec ops, I can estimate
> the total bytes transferred (loads/stores)?
>
Yes, exactly.
> For multiple processors, do I calculate the local or global sizes? If I
> calculate the local sizes, then do I need to do an MPI_Allreduce (with
> mpi_sum) of the TBT across all processors just like with total flops?
>
Just do this analysis locally I think. Dealing with reductions adds another
level of complexity.
> Do these vec/mat ops account for Dirichlet constraints? That is, should
> the global/local size include those constraints?
>
Hopefully the Dirichlet constraints are not significant. Certainly in the
limit of large N they drop out. The solver operations
are in the global space, and the assembly operations are in the local space.
> Also, is there a way to extract the event count for these operations
> besides dumping -log_summary each time?
>
Yes, using the PetscStageLog and PetscEventLog interfaces. Here is an
example from dtfe.c:
PetscStageLog stageLog;
PetscEventPerfLog eventLog = NULL;
PetscInt stage;
PetscErrorCode ierr;
PetscFunctionBegin;
ierr = PetscLogGetStageLog(&stageLog);CHKERRQ(ierr);
ierr = PetscStageLogGetCurrent(stageLog, &stage);CHKERRQ(ierr);
ierr = PetscStageLogGetEventPerfLog(stageLog, stage,
&eventLog);CHKERRQ(ierr);
/* Log performance info */
eventLog->eventInfo[ocl->residualEvent].count++;
eventLog->eventInfo[ocl->residualEvent].time += time;
eventLog->eventInfo[ocl->residualEvent].flops += flops;
Thanks,
Matt
> Thanks,
> Justin
>
> On Wed, May 6, 2015 at 1:38 PM, Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Wed, May 6, 2015 at 1:28 PM, Justin Chang <jychang48 at gmail.com> wrote:
>>
>>> Jed,
>>>
>>> I am working with anisotropic diffusion and most standard numerical
>>> formulations (e.g., FEM, FVM, etc.) are "wrong" because they violate the
>>> discrete maximum principle, see Nakshatrala & Valocci (JCP 2009) for more
>>> on this. What we have seen people do is simply "ignore" or chop off these
>>> values but to us that is a complete and utter abomination. My goal here is
>>> to show that our proposed methodologies work by leveraging on the
>>> capabilities within PETSc and TAO and to also show how computationally
>>> expensive it is compared to solving the same problem using the standard
>>> Galerkin method.
>>>
>>> Matt,
>>>
>>> Okay, so then I guess I still have questions regarding how to obtain the
>>> bytes. How exactly would I count all the number of Vecs and their
>>> respective sizes, because it seems all the DMPlex related functions create
>>> many vectors. Or do I only count the DM created vectors used for my
>>> solution vector, residual, lower/upper bound, optimization routines, etc?
>>>
>>
>> This is laborious. You would build up from the small stuff. So a Krylov
>> solver have MatMult, for which there is an analysis in the paper with
>> Dinesh/Bill/Barry/David, and
>> Vec ops which are easy. This is a lot of counting, especially if you have
>> a TAO solver in there. I would make sure you really care.
>>
>>
>>> And when you say "forget about small stuff", does that include all the
>>> DMPlex creation routines, PetscMalloc'ed arrays, pointwise functions, and
>>> all the jazz that goes on within the FE/discretization routines?
>>>
>>
>> Yep.
>>
>>
>>> Lastly, for a Matrix, wouldn't I just get the number of bytes from the
>>> memory usage section in -log_summary?
>>>
>>
>> That is a good way. You can also ask MatInfo how many nonzeros the matrix
>> has.
>>
>> Matt
>>
>>
>>> Thanks,
>>> Justin
>>>
>>> On Wed, May 6, 2015 at 11:48 AM, Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Wed, May 6, 2015 at 11:41 AM, Justin Chang <jychang48 at gmail.com>
>>>> wrote:
>>>>
>>>>> I suppose I have two objectives that I think are achievable within
>>>>> PETSc means:
>>>>>
>>>>> 1) How much wall-clock time can be reduced as you increase the number
>>>>> of processors. I have strong-scaling and parallel efficiency metrics that
>>>>> convey this.
>>>>>
>>>>> 2) The "optimal" problem size for these two methods/solvers are. What
>>>>> I mean by this is, at what point do I achieve the maximum FLOPS/s. If
>>>>> starting off with a really small problem then this metric should increase
>>>>> with problem size. My hypothesis is that as problem size increases, the
>>>>> ratio of wall-clock time spent in idle (e.g., waiting for cache to free up,
>>>>> accessing main memory, etc) to performing work also increases, and the
>>>>> reported FLOPS/s should start decreasing at some point. "Efficiency" in
>>>>> this context simply means the highest possible FLOPS/s.
>>>>>
>>>>> Does that make sense and/or is "interesting" enough?
>>>>>
>>>>
>>>> I think 2) is not really that interesting because
>>>>
>>>> a) it is so easily gamed. Just stick in high flop count operations,
>>>> like DGEMM.
>>>>
>>>> b) Time really matters to people who run the code, but flops never do.
>>>>
>>>> c) Floating point performance is not your limiting factor for time
>>>>
>>>> I think it would be much more interesting, and no more work to
>>>>
>>>> a) Model the flop/byte \beta ratio simply
>>>>
>>>> b) Report how close you get to the max performance given \beta on
>>>> your machine
>>>>
>>>> Thanks,
>>>>
>>>> Matt
>>>>
>>>>
>>>>> Thanks,
>>>>> Justin
>>>>>
>>>>> On Wed, May 6, 2015 at 11:28 AM, Jed Brown <jed at jedbrown.org> wrote:
>>>>>
>>>>>> Justin Chang <jychang48 at gmail.com> writes:
>>>>>> > I already have speedup/strong scaling results that essentially
>>>>>> depict the
>>>>>> > difference between the KSPSolve() and TaoSolve(). However, I have
>>>>>> been told
>>>>>> > by someone that strong-scaling isn't enough - that I should somehow
>>>>>> include
>>>>>> > something to show the "efficiency" of these two methodologies.
>>>>>>
>>>>>> "Efficiency" is irrelevant if one is wrong. Can you set up a problem
>>>>>> where both get the right answer and vary a parameter to get to the
>>>>>> case
>>>>>> where one fails? Then you can look at efficiency for a given accuracy
>>>>>> (and you might have to refine the grid differently) as you vary the
>>>>>> parameter.
>>>>>>
>>>>>> It's really hard to demonstrate that an implicit solver is optimal in
>>>>>> terms of mathematical convergence rate. Improvements there can dwarf
>>>>>> any differences in implementation efficiency.
>>>>>>
>>>>>> > That is, how much of the wall-clock time reported by these two very
>>>>>> > different solvers is spent doing useful work.
>>>>>> >
>>>>>> > Is such an "efficiency" metric necessary to report in addition to
>>>>>> > strong-scaling results? The overall computational framework is the
>>>>>> same for
>>>>>> > both problems, the only difference being one uses a linear solver
>>>>>> and the
>>>>>> > other uses an optimization solver. My first thought was to use PAPI
>>>>>> to
>>>>>> > include hardware counters, but these are notoriously inaccurate.
>>>>>> Then I
>>>>>> > thought about simply reporting the manual FLOPS and FLOPS/s via
>>>>>> PETSc, but
>>>>>> > these metrics ignore memory bandwidth. And so here I am looking at
>>>>>> the idea
>>>>>> > of implementing the Roofline model, but now I am wondering if any
>>>>>> of this
>>>>>> > is worth the trouble.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150506/285befc0/attachment-0001.html>
More information about the petsc-users
mailing list