[petsc-users] Obtaining bytes per second

Matthew Knepley knepley at gmail.com
Thu May 7 08:37:57 CDT 2015


On Thu, May 7, 2015 at 8:26 AM, Justin Chang <jychang48 at gmail.com> wrote:

> Matt again thank you very much for all your help, last question(s) though.
>
> I am not quite sure I follow your second comment. Do you mean that this
> whole analysis should be done on one processor (in sequential mode)? Or do
> you mean that calculating the TBT for these vec and matmult ops should
> assume the local size of the vector/matrix? When I said MPI_Allreduce, i
> meant collecting everything after calculating the local TBT for each
> processor. Most of these vec ops' calculations are local so it shouldn't be
> too much trouble, but for things like matmult and vecnorm, would it be
> quite convoluted because of interprocessor communication?
>

I meant that I would do the analysis for one process and assume that is
generalized smoothly.

  Matt


> Thanks,
> Justin
>
> On Wed, May 6, 2015 at 2:34 PM, Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Wed, May 6, 2015 at 2:28 PM, Justin Chang <jychang48 at gmail.com> wrote:
>>
>>> So basically I just need these types of operations:
>>>
>>> VecTDot               60 1.0 1.6928e-05 1.0 2.69e+04 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0 14  0  0  0   0 14  0  0  0  1591
>>> VecNorm               31 1.0 9.0599e-06 1.0 1.39e+04 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  7  0  0  0   0  7  0  0  0  1536
>>> VecCopy                3 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet                36 1.0 9.4175e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> VecAXPY               60 1.0 1.7166e-05 1.0 2.70e+04 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0 14  0  0  0   0 14  0  0  0  1573
>>> VecAYPX               29 1.0 1.9312e-05 1.0 1.30e+04 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  7  0  0  0   0  7  0  0  0   676
>>> VecWAXPY               1 1.0 1.9073e-06 1.0 2.25e+02 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   118
>>> VecPointwiseMult      31 1.0 1.8358e-05 1.0 6.98e+03 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  4  0  0  0   0  4  0  0  0   380
>>> MatMult               30 1.0 7.5340e-05 1.0 8.07e+04 1.0 0.0e+00 0.0e+00
>>> 0.0e+00  0 42  0  0  0   0 42  0  0  0  1071
>>>
>>> Given the matrix size and the number of nonzeros, vector size of my
>>> solution, and the number of calls to each of these vec ops, I can estimate
>>> the total bytes transferred (loads/stores)?
>>>
>>
>> Yes, exactly.
>>
>>
>>> For multiple processors, do I calculate the local or global sizes? If I
>>> calculate the local sizes, then do I need to do an MPI_Allreduce (with
>>> mpi_sum) of the TBT across all processors just like with total flops?
>>>
>>
>> Just do this analysis locally I think. Dealing with reductions adds
>> another level of complexity.
>>
>>
>>> Do these vec/mat ops account for Dirichlet constraints? That is, should
>>> the global/local size include those constraints?
>>>
>>
>> Hopefully the Dirichlet constraints are not significant. Certainly in the
>> limit of large N they drop out. The solver operations
>> are in the global space, and the assembly operations are in the local
>> space.
>>
>>
>>> Also, is there a way to extract the event count for these operations
>>> besides dumping -log_summary each time?
>>>
>>
>> Yes, using the PetscStageLog and PetscEventLog interfaces. Here is an
>> example from dtfe.c:
>>
>>   PetscStageLog     stageLog;
>>   PetscEventPerfLog eventLog = NULL;
>>   PetscInt          stage;
>>   PetscErrorCode    ierr;
>>
>>   PetscFunctionBegin;
>>   ierr = PetscLogGetStageLog(&stageLog);CHKERRQ(ierr);
>>   ierr = PetscStageLogGetCurrent(stageLog, &stage);CHKERRQ(ierr);
>>   ierr = PetscStageLogGetEventPerfLog(stageLog, stage,
>> &eventLog);CHKERRQ(ierr);
>>     /* Log performance info */
>>   eventLog->eventInfo[ocl->residualEvent].count++;
>>   eventLog->eventInfo[ocl->residualEvent].time  += time;
>>   eventLog->eventInfo[ocl->residualEvent].flops += flops;
>>
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> Thanks,
>>> Justin
>>>
>>> On Wed, May 6, 2015 at 1:38 PM, Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Wed, May 6, 2015 at 1:28 PM, Justin Chang <jychang48 at gmail.com>
>>>> wrote:
>>>>
>>>>> Jed,
>>>>>
>>>>> I am working with anisotropic diffusion and most standard numerical
>>>>> formulations (e.g., FEM, FVM, etc.) are "wrong" because they violate the
>>>>> discrete maximum principle, see Nakshatrala & Valocci (JCP 2009) for more
>>>>> on this. What we have seen people do is simply "ignore" or chop off these
>>>>> values but to us that is a complete and utter abomination. My goal here is
>>>>> to show that our proposed methodologies work by leveraging on the
>>>>> capabilities within PETSc and TAO and to also show how computationally
>>>>> expensive it is compared to solving the same problem using the standard
>>>>> Galerkin method.
>>>>>
>>>>> Matt,
>>>>>
>>>>> Okay, so then I guess I still have questions regarding how to obtain
>>>>> the bytes. How exactly would I count all the number of Vecs and their
>>>>> respective sizes, because it seems all the DMPlex related functions create
>>>>> many vectors. Or do I only count the DM created vectors used for my
>>>>> solution vector, residual, lower/upper bound, optimization routines, etc?
>>>>>
>>>>
>>>> This is laborious. You would build up from the small stuff. So a Krylov
>>>> solver have MatMult, for which there is an analysis in the paper with
>>>> Dinesh/Bill/Barry/David, and
>>>> Vec ops which are easy. This is a lot of counting, especially if you
>>>> have a TAO solver in there. I would make sure you really care.
>>>>
>>>>
>>>>> And when you say "forget about small stuff", does that include all the
>>>>> DMPlex creation routines, PetscMalloc'ed arrays, pointwise functions, and
>>>>> all the jazz that goes on within the FE/discretization routines?
>>>>>
>>>>
>>>> Yep.
>>>>
>>>>
>>>>> Lastly, for a Matrix, wouldn't I just get the number of bytes from the
>>>>> memory usage section in -log_summary?
>>>>>
>>>>
>>>> That is a good way. You can also ask MatInfo how many nonzeros the
>>>> matrix has.
>>>>
>>>>    Matt
>>>>
>>>>
>>>>> Thanks,
>>>>> Justin
>>>>>
>>>>> On Wed, May 6, 2015 at 11:48 AM, Matthew Knepley <knepley at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> On Wed, May 6, 2015 at 11:41 AM, Justin Chang <jychang48 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I suppose I have two objectives that I think are achievable within
>>>>>>> PETSc means:
>>>>>>>
>>>>>>> 1) How much wall-clock time can be reduced as you increase the
>>>>>>> number of processors. I have strong-scaling and parallel efficiency metrics
>>>>>>> that convey this.
>>>>>>>
>>>>>>> 2) The "optimal" problem size for these two methods/solvers are.
>>>>>>> What I mean by this is, at what point do I achieve the maximum FLOPS/s. If
>>>>>>> starting off with a really small problem then this metric should increase
>>>>>>> with problem size. My hypothesis is that as problem size increases, the
>>>>>>> ratio of wall-clock time spent in idle (e.g., waiting for cache to free up,
>>>>>>> accessing main memory, etc) to performing work also increases, and the
>>>>>>> reported FLOPS/s should start decreasing at some point. "Efficiency" in
>>>>>>> this context simply means the highest possible FLOPS/s.
>>>>>>>
>>>>>>> Does that make sense and/or is "interesting" enough?
>>>>>>>
>>>>>>
>>>>>> I think 2) is not really that interesting because
>>>>>>
>>>>>>   a) it is so easily gamed. Just stick in high flop count operations,
>>>>>> like DGEMM.
>>>>>>
>>>>>>   b) Time really matters to people who run the code, but flops never
>>>>>> do.
>>>>>>
>>>>>>   c) Floating point performance is not your limiting factor for time
>>>>>>
>>>>>> I think it would be much more interesting, and no more work to
>>>>>>
>>>>>>   a) Model the flop/byte \beta ratio simply
>>>>>>
>>>>>>   b) Report how close you get to the max performance given \beta on
>>>>>> your machine
>>>>>>
>>>>>>   Thanks,
>>>>>>
>>>>>>      Matt
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Justin
>>>>>>>
>>>>>>> On Wed, May 6, 2015 at 11:28 AM, Jed Brown <jed at jedbrown.org> wrote:
>>>>>>>
>>>>>>>> Justin Chang <jychang48 at gmail.com> writes:
>>>>>>>> > I already have speedup/strong scaling results that essentially
>>>>>>>> depict the
>>>>>>>> > difference between the KSPSolve() and TaoSolve(). However, I have
>>>>>>>> been told
>>>>>>>> > by someone that strong-scaling isn't enough - that I should
>>>>>>>> somehow include
>>>>>>>> > something to show the "efficiency" of these two methodologies.
>>>>>>>>
>>>>>>>> "Efficiency" is irrelevant if one is wrong.  Can you set up a
>>>>>>>> problem
>>>>>>>> where both get the right answer and vary a parameter to get to the
>>>>>>>> case
>>>>>>>> where one fails?  Then you can look at efficiency for a given
>>>>>>>> accuracy
>>>>>>>> (and you might have to refine the grid differently) as you vary the
>>>>>>>> parameter.
>>>>>>>>
>>>>>>>> It's really hard to demonstrate that an implicit solver is optimal
>>>>>>>> in
>>>>>>>> terms of mathematical convergence rate.  Improvements there can
>>>>>>>> dwarf
>>>>>>>> any differences in implementation efficiency.
>>>>>>>>
>>>>>>>> > That is, how much of the wall-clock time reported by these two
>>>>>>>> very
>>>>>>>> > different solvers is spent doing useful work.
>>>>>>>> >
>>>>>>>> > Is such an "efficiency" metric necessary to report in addition to
>>>>>>>> > strong-scaling results? The overall computational framework is
>>>>>>>> the same for
>>>>>>>> > both problems, the only difference being one uses a linear solver
>>>>>>>> and the
>>>>>>>> > other uses an optimization solver. My first thought was to use
>>>>>>>> PAPI to
>>>>>>>> > include hardware counters, but these are notoriously inaccurate.
>>>>>>>> Then I
>>>>>>>> > thought about simply reporting the manual FLOPS and FLOPS/s via
>>>>>>>> PETSc, but
>>>>>>>> > these metrics ignore memory bandwidth. And so here I am looking
>>>>>>>> at the idea
>>>>>>>> > of implementing the Roofline model, but now I am wondering if any
>>>>>>>> of this
>>>>>>>> > is worth the trouble.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> What most experimenters take for granted before they begin their
>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>> experiments lead.
>>>>>> -- Norbert Wiener
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150507/8e5d853c/attachment.html>


More information about the petsc-users mailing list