[petsc-users] Obtaining bytes per second

Wed May 6 14:34:13 CDT 2015

On Wed, May 6, 2015 at 2:28 PM, Justin Chang <jychang48 at gmail.com> wrote:

> So basically I just need these types of operations:
>
> VecTDot               60 1.0 1.6928e-05 1.0 2.69e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00  0 14  0  0  0   0 14  0  0  0  1591
> VecNorm               31 1.0 9.0599e-06 1.0 1.39e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  7  0  0  0   0  7  0  0  0  1536
> VecCopy                3 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                36 1.0 9.4175e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPY               60 1.0 1.7166e-05 1.0 2.70e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00  0 14  0  0  0   0 14  0  0  0  1573
> VecAYPX               29 1.0 1.9312e-05 1.0 1.30e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  7  0  0  0   0  7  0  0  0   676
> VecWAXPY               1 1.0 1.9073e-06 1.0 2.25e+02 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   118
> VecPointwiseMult      31 1.0 1.8358e-05 1.0 6.98e+03 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  4  0  0  0   0  4  0  0  0   380
> MatMult               30 1.0 7.5340e-05 1.0 8.07e+04 1.0 0.0e+00 0.0e+00
> 0.0e+00  0 42  0  0  0   0 42  0  0  0  1071
>
> Given the matrix size and the number of nonzeros, vector size of my
> solution, and the number of calls to each of these vec ops, I can estimate
> the total bytes transferred (loads/stores)?
>

Yes, exactly.

> For multiple processors, do I calculate the local or global sizes? If I
> calculate the local sizes, then do I need to do an MPI_Allreduce (with
> mpi_sum) of the TBT across all processors just like with total flops?
>

Just do this analysis locally I think. Dealing with reductions adds another
level of complexity.

> Do these vec/mat ops account for Dirichlet constraints? That is, should
> the global/local size include those constraints?
>

Hopefully the Dirichlet constraints are not significant. Certainly in the
limit of large N they drop out. The solver operations
are in the global space, and the assembly operations are in the local space.

> Also, is there a way to extract the event count for these operations
> besides dumping -log_summary each time?
>

Yes, using the PetscStageLog and PetscEventLog interfaces. Here is an
example from dtfe.c:

  PetscStageLog     stageLog;
  PetscEventPerfLog eventLog = NULL;
  PetscInt          stage;
  PetscErrorCode    ierr;

  PetscFunctionBegin;
  ierr = PetscLogGetStageLog(&stageLog);CHKERRQ(ierr);
  ierr = PetscStageLogGetCurrent(stageLog, &stage);CHKERRQ(ierr);
  ierr = PetscStageLogGetEventPerfLog(stageLog, stage,
&eventLog);CHKERRQ(ierr);
    /* Log performance info */
  eventLog->eventInfo[ocl->residualEvent].count++;
  eventLog->eventInfo[ocl->residualEvent].time  += time;
  eventLog->eventInfo[ocl->residualEvent].flops += flops;

  Thanks,

     Matt

> Thanks,
> Justin
>
> On Wed, May 6, 2015 at 1:38 PM, Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Wed, May 6, 2015 at 1:28 PM, Justin Chang <jychang48 at gmail.com> wrote:
>>
>>> Jed,
>>>
>>> I am working with anisotropic diffusion and most standard numerical
>>> formulations (e.g., FEM, FVM, etc.) are "wrong" because they violate the
>>> discrete maximum principle, see Nakshatrala & Valocci (JCP 2009) for more
>>> on this. What we have seen people do is simply "ignore" or chop off these
>>> values but to us that is a complete and utter abomination. My goal here is
>>> to show that our proposed methodologies work by leveraging on the
>>> capabilities within PETSc and TAO and to also show how computationally
>>> expensive it is compared to solving the same problem using the standard
>>> Galerkin method.
>>>
>>> Matt,
>>>
>>> Okay, so then I guess I still have questions regarding how to obtain the
>>> bytes. How exactly would I count all the number of Vecs and their
>>> respective sizes, because it seems all the DMPlex related functions create
>>> many vectors. Or do I only count the DM created vectors used for my
>>> solution vector, residual, lower/upper bound, optimization routines, etc?
>>>
>>
>> This is laborious. You would build up from the small stuff. So a Krylov
>> solver have MatMult, for which there is an analysis in the paper with
>> Dinesh/Bill/Barry/David, and
>> Vec ops which are easy. This is a lot of counting, especially if you have
>> a TAO solver in there. I would make sure you really care.
>>
>>
>>> And when you say "forget about small stuff", does that include all the
>>> DMPlex creation routines, PetscMalloc'ed arrays, pointwise functions, and
>>> all the jazz that goes on within the FE/discretization routines?
>>>
>>
>> Yep.
>>
>>
>>> Lastly, for a Matrix, wouldn't I just get the number of bytes from the
>>> memory usage section in -log_summary?
>>>
>>
>> That is a good way. You can also ask MatInfo how many nonzeros the matrix
>> has.
>>
>>    Matt
>>
>>
>>> Thanks,
>>> Justin
>>>
>>> On Wed, May 6, 2015 at 11:48 AM, Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Wed, May 6, 2015 at 11:41 AM, Justin Chang <jychang48 at gmail.com>
>>>> wrote:
>>>>
>>>>> I suppose I have two objectives that I think are achievable within
>>>>> PETSc means:
>>>>>
>>>>> 1) How much wall-clock time can be reduced as you increase the number
>>>>> of processors. I have strong-scaling and parallel efficiency metrics that
>>>>> convey this.
>>>>>
>>>>> 2) The "optimal" problem size for these two methods/solvers are. What
>>>>> I mean by this is, at what point do I achieve the maximum FLOPS/s. If
>>>>> starting off with a really small problem then this metric should increase
>>>>> with problem size. My hypothesis is that as problem size increases, the
>>>>> ratio of wall-clock time spent in idle (e.g., waiting for cache to free up,
>>>>> accessing main memory, etc) to performing work also increases, and the
>>>>> reported FLOPS/s should start decreasing at some point. "Efficiency" in
>>>>> this context simply means the highest possible FLOPS/s.
>>>>>
>>>>> Does that make sense and/or is "interesting" enough?
>>>>>
>>>>
>>>> I think 2) is not really that interesting because
>>>>
>>>>   a) it is so easily gamed. Just stick in high flop count operations,
>>>> like DGEMM.
>>>>
>>>>   b) Time really matters to people who run the code, but flops never do.
>>>>
>>>>   c) Floating point performance is not your limiting factor for time
>>>>
>>>> I think it would be much more interesting, and no more work to
>>>>
>>>>   a) Model the flop/byte \beta ratio simply
>>>>
>>>>   b) Report how close you get to the max performance given \beta on
>>>> your machine
>>>>
>>>>   Thanks,
>>>>
>>>>      Matt
>>>>
>>>>
>>>>> Thanks,
>>>>> Justin
>>>>>
>>>>> On Wed, May 6, 2015 at 11:28 AM, Jed Brown <jed at jedbrown.org> wrote:
>>>>>
>>>>>> Justin Chang <jychang48 at gmail.com> writes:
>>>>>> > I already have speedup/strong scaling results that essentially
>>>>>> depict the
>>>>>> > difference between the KSPSolve() and TaoSolve(). However, I have
>>>>>> been told
>>>>>> > by someone that strong-scaling isn't enough - that I should somehow
>>>>>> include
>>>>>> > something to show the "efficiency" of these two methodologies.
>>>>>>
>>>>>> "Efficiency" is irrelevant if one is wrong.  Can you set up a problem
>>>>>> where both get the right answer and vary a parameter to get to the
>>>>>> case
>>>>>> where one fails?  Then you can look at efficiency for a given accuracy
>>>>>> (and you might have to refine the grid differently) as you vary the
>>>>>> parameter.
>>>>>>
>>>>>> It's really hard to demonstrate that an implicit solver is optimal in
>>>>>> terms of mathematical convergence rate.  Improvements there can dwarf
>>>>>> any differences in implementation efficiency.
>>>>>>
>>>>>> > That is, how much of the wall-clock time reported by these two very
>>>>>> > different solvers is spent doing useful work.
>>>>>> >
>>>>>> > Is such an "efficiency" metric necessary to report in addition to
>>>>>> > strong-scaling results? The overall computational framework is the
>>>>>> same for
>>>>>> > both problems, the only difference being one uses a linear solver
>>>>>> and the
>>>>>> > other uses an optimization solver. My first thought was to use PAPI
>>>>>> to
>>>>>> > include hardware counters, but these are notoriously inaccurate.
>>>>>> Then I
>>>>>> > thought about simply reporting the manual FLOPS and FLOPS/s via
>>>>>> PETSc, but
>>>>>> > these metrics ignore memory bandwidth. And so here I am looking at
>>>>>> the idea
>>>>>> > of implementing the Roofline model, but now I am wondering if any
>>>>>> of this
>>>>>> > is worth the trouble.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150506/285befc0/attachment-0001.html>