[petsc-dev] Hardware counter logging in PETSc (was Re: Where next with PETSc and KNL?)

Matthew Knepley knepley at gmail.com
Wed Sep 28 20:05:33 CDT 2016

On Wed, Sep 28, 2016 at 7:39 PM, Justin Chang <jychang48 at gmail.com> wrote:

> The so called "perfect cache" model in my paper
> <http://link.springer.com/article/10.1007/s10915-016-0250-5>, which was
> counted by hand, only works for solvers and preconditioners which rely on
> sparse matrix-vector multiply. I reluctantly used ILU in the first part of
> the paper for that very reason. It can certainly be improved by looking at
> distance in memory between references, but I don't know of any such model
> for sparse matrix-matrix multiplication, which is important for the better
> preconditioners.
> I also like the idea of having a model which counts these metrics by hand
> since it's easily portable, but unless someone has some robust algorithm or
> technique for counting the total bytes transferred for MatMatMult()
> operations and its relatives, this performance model won't really be that
> useful.

I am not sure I buy that there is significant cache reuse in sparse MatMat,
but I would love to see something proving me wrong.


> On Wed, Sep 28, 2016 at 7:16 PM, Matthew Knepley <knepley at gmail.com>
> wrote:
>> On Wed, Sep 28, 2016 at 7:02 PM, Richard Mills <richardtmills at gmail.com>
>> wrote:
>>> Hi Barry,
>>> Thanks for starting a thread about this on petsc-dev; I was planning to
>>> do so but still hadn't gotten to it.
>>> We can certainly get the performance data we need from various
>>> performance analysis tools, and for some kinds of data, those are the best
>>> way to try to get it.  The reasons I'd like to add some PETSc logging
>>> support for collecting hardware data are primarily
>>> 1) Many of the tools are rather "heavy weight" or otherwise cumbersome
>>> to use.  I've always loved how lightweight the PETSc logging framework is,
>>> and have always preferred to use that for performance tuning work until I
>>> get down to a level that requires the use of independent tools.  I also
>>> like working with the text reports that I get from PETSc.  Some performance
>>> tools do a decent job generating text reports, but many require me to fire
>>> up an annoying GUI to do even trivial tasks.  This is especially annoying
>>> when the data I want to work with are on some supercomputer to which I have
>>> a slow Internet connection.
>>> 2) External performance analysis tools know nothing of things like PETSc
>>> logging stages or events.  If I am using a tool like VTune to analyze
>>> something like a flow and reactive transport problem in PFLOTRAN, VTune
>>> doesn't know that I want to consider calls to SNESSolve() and children in
>>> the flow stage separately from those made in the transport stage.  Many
>>> tools provide ways to identify things like this, but it generally requires
>>> instrumenting the code by hand using a proprietary API.  Furthermore, most
>>> of these APIs don't have a sort of push/pop mechanism like we have for
>>> PETSc stages.  I really don't want to have to instrument my code for each
>>> tool that I might want to use, especially since I've already gone to the
>>> trouble of defining various stages/events with PETSc -- I'd like to just
>>> use those!
>> I have a 3rd. There are some thing that perf tools are just shit at
>> counting, and they need to be counted by hand. There are excellent
>> blog posts by McCalpin on these caveats, which arise due to the
>> complexity of modern processors, Justin had all of these problems
>> in his paper about the performance of convex programming solvers for
>> enforcing maximum principles.
>> I am not against using VTune and its ilk. I just think we should still be
>> counting flops and bytes transferred from memory by hand because
>> you cannot trust the counting.
>>   Matt
>>> Both of the above are important motivations, but I think (2) is my
>>> primary driver.  I'd be happier with many of the tools if they were aware
>>> of PETSc stages and events.
>>> --Richard
>>> On Wed, Sep 28, 2016 at 2:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>    Moving to petsc-dev so everyone can see this discussion.
>>>>     To get more detailed "performance" information on runs we have two
>>>> (not necessarily orthogonal) choices.
>>>>   1) use an integrated system that is independent of PETSc. These
>>>> sometimes require compiling with additional options and then running a
>>>> post-processor after the run. These systems then display the results in
>>>> some kind of GUI. Intel has such a thing, as does Apple. Do they allow
>>>> logging/display of things we care about such as cache misses, ....? Depends
>>>> on each system, and some of the systems are improving over time.
>>>>   2) add additional logging of values into the PETSc logging and then
>>>> have PetscLogView() process the raw logged values into useful information.
>>>>    Both approaches have advantages and disadvantages but we do take on
>>>> a large development and maintenance burden if we try to incorporate more
>>>> logging directly into PETSc. So what does incorporating into PETSc buy us
>>>> that is worth the extra hassle? That is can we do something with the "in
>>>> PETSc" approach we could not achieve otherwise? (I don't thing arguments
>>>> about it being more portable and not requiring you to buy vtune etc from
>>>> Intel are enough reason to do the work internally.)
>>>>    In other words if I am interested in finding out why my MatMult() is
>>>> slower then I think it should be is it such a terrible thing to have crank
>>>> up vtune (or similar beast) to get details about the computational phase I
>>>> am interested in?
>>>>    Barry
>>>> You should be able to guess that I am leaning towards 1) and want to
>>>> know why that is a fatal mistake, if it is?
>>>> > On Sep 24, 2016, at 12:00 PM, Richard Tran Mills <
>>>> richard.t.mills at intel.com> wrote:
>>>> >
>>>> > Hi Folks,
>>>> >
>>>> > I'm breaking up replies to my long email message into smaller chunks
>>>> to make it easier to keep track of the discussion.  Just address the perf
>>>> counter issue here.
>>>> >
>>>> > On 9/24/16 6:54 AM, Jed Brown wrote:
>>>> >> 7) I still think we should add some support for collecting hardware
>>>> >>> counter information in the PETSc logging framework.  I see that the
>>>> >>> latest PAPI release adds some KNL support, though I don't know if it
>>>> >>> supports the uncore counters.  Anyhow, I should start a thread on
>>>> >>> petsc-dev about this...
>>>> >> There was some PAPI support once upon a time (before my time), but I
>>>> >> think Barry stripped it out because it's crappy software.  I haven't
>>>> >> seriously looked at using the linux performance counter interface
>>>> >> directly, but it would be less to install and not streaked with
>>>> Dongarra
>>>> >> poo.
>>>> > An alternative that I came across is something written by some Intel
>>>> folks, with the terribly generic name of "Intel Performance Counter
>>>> Monitor".  The webpage for it is at
>>>> >
>>>> > https://software.intel.com/en-us/articles/intel-performance-
>>>> counter-monitor
>>>> >
>>>> > It provides a simple C++ API (I wish there was a C one; we'd need to
>>>> wrap things to keep from polluting the PETSc code with C++ stuff) that lets
>>>> you capture essentially any of the PMU events.  This looks a lot nicer than
>>>> PAPI in several ways, but has the downside of being Intel-specific.  I also
>>>> don't see any KNL-specific counter support yet.
>>>> >
>>>> > --Richard
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener

What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20160928/4d0bddf6/attachment.html>

More information about the petsc-dev mailing list