[petsc-users] Obtaining bytes per second
Justin Chang
jychang48 at gmail.com
Thu May 7 08:26:39 CDT 2015
Matt again thank you very much for all your help, last question(s) though.
I am not quite sure I follow your second comment. Do you mean that this
whole analysis should be done on one processor (in sequential mode)? Or do
you mean that calculating the TBT for these vec and matmult ops should
assume the local size of the vector/matrix? When I said MPI_Allreduce, i
meant collecting everything after calculating the local TBT for each
processor. Most of these vec ops' calculations are local so it shouldn't be
too much trouble, but for things like matmult and vecnorm, would it be
quite convoluted because of interprocessor communication?
Thanks,
Justin
On Wed, May 6, 2015 at 2:34 PM, Matthew Knepley <knepley at gmail.com> wrote:
> On Wed, May 6, 2015 at 2:28 PM, Justin Chang <jychang48 at gmail.com> wrote:
>
>> So basically I just need these types of operations:
>>
>> VecTDot 60 1.0 1.6928e-05 1.0 2.69e+04 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 14 0 0 0 0 14 0 0 0 1591
>> VecNorm 31 1.0 9.0599e-06 1.0 1.39e+04 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 7 0 0 0 0 7 0 0 0 1536
>> VecCopy 3 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>> VecSet 36 1.0 9.4175e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>> VecAXPY 60 1.0 1.7166e-05 1.0 2.70e+04 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 14 0 0 0 0 14 0 0 0 1573
>> VecAYPX 29 1.0 1.9312e-05 1.0 1.30e+04 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 7 0 0 0 0 7 0 0 0 676
>> VecWAXPY 1 1.0 1.9073e-06 1.0 2.25e+02 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 118
>> VecPointwiseMult 31 1.0 1.8358e-05 1.0 6.98e+03 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 4 0 0 0 0 4 0 0 0 380
>> MatMult 30 1.0 7.5340e-05 1.0 8.07e+04 1.0 0.0e+00 0.0e+00
>> 0.0e+00 0 42 0 0 0 0 42 0 0 0 1071
>>
>> Given the matrix size and the number of nonzeros, vector size of my
>> solution, and the number of calls to each of these vec ops, I can estimate
>> the total bytes transferred (loads/stores)?
>>
>
> Yes, exactly.
>
>
>> For multiple processors, do I calculate the local or global sizes? If I
>> calculate the local sizes, then do I need to do an MPI_Allreduce (with
>> mpi_sum) of the TBT across all processors just like with total flops?
>>
>
> Just do this analysis locally I think. Dealing with reductions adds
> another level of complexity.
>
>
>> Do these vec/mat ops account for Dirichlet constraints? That is, should
>> the global/local size include those constraints?
>>
>
> Hopefully the Dirichlet constraints are not significant. Certainly in the
> limit of large N they drop out. The solver operations
> are in the global space, and the assembly operations are in the local
> space.
>
>
>> Also, is there a way to extract the event count for these operations
>> besides dumping -log_summary each time?
>>
>
> Yes, using the PetscStageLog and PetscEventLog interfaces. Here is an
> example from dtfe.c:
>
> PetscStageLog stageLog;
> PetscEventPerfLog eventLog = NULL;
> PetscInt stage;
> PetscErrorCode ierr;
>
> PetscFunctionBegin;
> ierr = PetscLogGetStageLog(&stageLog);CHKERRQ(ierr);
> ierr = PetscStageLogGetCurrent(stageLog, &stage);CHKERRQ(ierr);
> ierr = PetscStageLogGetEventPerfLog(stageLog, stage,
> &eventLog);CHKERRQ(ierr);
> /* Log performance info */
> eventLog->eventInfo[ocl->residualEvent].count++;
> eventLog->eventInfo[ocl->residualEvent].time += time;
> eventLog->eventInfo[ocl->residualEvent].flops += flops;
>
>
> Thanks,
>
> Matt
>
>
>> Thanks,
>> Justin
>>
>> On Wed, May 6, 2015 at 1:38 PM, Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Wed, May 6, 2015 at 1:28 PM, Justin Chang <jychang48 at gmail.com>
>>> wrote:
>>>
>>>> Jed,
>>>>
>>>> I am working with anisotropic diffusion and most standard numerical
>>>> formulations (e.g., FEM, FVM, etc.) are "wrong" because they violate the
>>>> discrete maximum principle, see Nakshatrala & Valocci (JCP 2009) for more
>>>> on this. What we have seen people do is simply "ignore" or chop off these
>>>> values but to us that is a complete and utter abomination. My goal here is
>>>> to show that our proposed methodologies work by leveraging on the
>>>> capabilities within PETSc and TAO and to also show how computationally
>>>> expensive it is compared to solving the same problem using the standard
>>>> Galerkin method.
>>>>
>>>> Matt,
>>>>
>>>> Okay, so then I guess I still have questions regarding how to obtain
>>>> the bytes. How exactly would I count all the number of Vecs and their
>>>> respective sizes, because it seems all the DMPlex related functions create
>>>> many vectors. Or do I only count the DM created vectors used for my
>>>> solution vector, residual, lower/upper bound, optimization routines, etc?
>>>>
>>>
>>> This is laborious. You would build up from the small stuff. So a Krylov
>>> solver have MatMult, for which there is an analysis in the paper with
>>> Dinesh/Bill/Barry/David, and
>>> Vec ops which are easy. This is a lot of counting, especially if you
>>> have a TAO solver in there. I would make sure you really care.
>>>
>>>
>>>> And when you say "forget about small stuff", does that include all the
>>>> DMPlex creation routines, PetscMalloc'ed arrays, pointwise functions, and
>>>> all the jazz that goes on within the FE/discretization routines?
>>>>
>>>
>>> Yep.
>>>
>>>
>>>> Lastly, for a Matrix, wouldn't I just get the number of bytes from the
>>>> memory usage section in -log_summary?
>>>>
>>>
>>> That is a good way. You can also ask MatInfo how many nonzeros the
>>> matrix has.
>>>
>>> Matt
>>>
>>>
>>>> Thanks,
>>>> Justin
>>>>
>>>> On Wed, May 6, 2015 at 11:48 AM, Matthew Knepley <knepley at gmail.com>
>>>> wrote:
>>>>
>>>>> On Wed, May 6, 2015 at 11:41 AM, Justin Chang <jychang48 at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I suppose I have two objectives that I think are achievable within
>>>>>> PETSc means:
>>>>>>
>>>>>> 1) How much wall-clock time can be reduced as you increase the number
>>>>>> of processors. I have strong-scaling and parallel efficiency metrics that
>>>>>> convey this.
>>>>>>
>>>>>> 2) The "optimal" problem size for these two methods/solvers are. What
>>>>>> I mean by this is, at what point do I achieve the maximum FLOPS/s. If
>>>>>> starting off with a really small problem then this metric should increase
>>>>>> with problem size. My hypothesis is that as problem size increases, the
>>>>>> ratio of wall-clock time spent in idle (e.g., waiting for cache to free up,
>>>>>> accessing main memory, etc) to performing work also increases, and the
>>>>>> reported FLOPS/s should start decreasing at some point. "Efficiency" in
>>>>>> this context simply means the highest possible FLOPS/s.
>>>>>>
>>>>>> Does that make sense and/or is "interesting" enough?
>>>>>>
>>>>>
>>>>> I think 2) is not really that interesting because
>>>>>
>>>>> a) it is so easily gamed. Just stick in high flop count operations,
>>>>> like DGEMM.
>>>>>
>>>>> b) Time really matters to people who run the code, but flops never
>>>>> do.
>>>>>
>>>>> c) Floating point performance is not your limiting factor for time
>>>>>
>>>>> I think it would be much more interesting, and no more work to
>>>>>
>>>>> a) Model the flop/byte \beta ratio simply
>>>>>
>>>>> b) Report how close you get to the max performance given \beta on
>>>>> your machine
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Matt
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Justin
>>>>>>
>>>>>> On Wed, May 6, 2015 at 11:28 AM, Jed Brown <jed at jedbrown.org> wrote:
>>>>>>
>>>>>>> Justin Chang <jychang48 at gmail.com> writes:
>>>>>>> > I already have speedup/strong scaling results that essentially
>>>>>>> depict the
>>>>>>> > difference between the KSPSolve() and TaoSolve(). However, I have
>>>>>>> been told
>>>>>>> > by someone that strong-scaling isn't enough - that I should
>>>>>>> somehow include
>>>>>>> > something to show the "efficiency" of these two methodologies.
>>>>>>>
>>>>>>> "Efficiency" is irrelevant if one is wrong. Can you set up a problem
>>>>>>> where both get the right answer and vary a parameter to get to the
>>>>>>> case
>>>>>>> where one fails? Then you can look at efficiency for a given
>>>>>>> accuracy
>>>>>>> (and you might have to refine the grid differently) as you vary the
>>>>>>> parameter.
>>>>>>>
>>>>>>> It's really hard to demonstrate that an implicit solver is optimal in
>>>>>>> terms of mathematical convergence rate. Improvements there can dwarf
>>>>>>> any differences in implementation efficiency.
>>>>>>>
>>>>>>> > That is, how much of the wall-clock time reported by these two very
>>>>>>> > different solvers is spent doing useful work.
>>>>>>> >
>>>>>>> > Is such an "efficiency" metric necessary to report in addition to
>>>>>>> > strong-scaling results? The overall computational framework is the
>>>>>>> same for
>>>>>>> > both problems, the only difference being one uses a linear solver
>>>>>>> and the
>>>>>>> > other uses an optimization solver. My first thought was to use
>>>>>>> PAPI to
>>>>>>> > include hardware counters, but these are notoriously inaccurate.
>>>>>>> Then I
>>>>>>> > thought about simply reporting the manual FLOPS and FLOPS/s via
>>>>>>> PETSc, but
>>>>>>> > these metrics ignore memory bandwidth. And so here I am looking at
>>>>>>> the idea
>>>>>>> > of implementing the Roofline model, but now I am wondering if any
>>>>>>> of this
>>>>>>> > is worth the trouble.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> What most experimenters take for granted before they begin their
>>>>> experiments is infinitely more interesting than any results to which their
>>>>> experiments lead.
>>>>> -- Norbert Wiener
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150507/341b0a2f/attachment-0001.html>
More information about the petsc-users
mailing list