[petsc-dev] Hardware counter logging in PETSc (was Re: Where next with PETSc and KNL?)

Wed Sep 28 19:39:25 CDT 2016

The so called "perfect cache" model in my paper
<http://link.springer.com/article/10.1007/s10915-016-0250-5>, which was
counted by hand, only works for solvers and preconditioners which rely on
sparse matrix-vector multiply. I reluctantly used ILU in the first part of
the paper for that very reason. It can certainly be improved by looking at
distance in memory between references, but I don't know of any such model
for sparse matrix-matrix multiplication, which is important for the better
preconditioners.

I also like the idea of having a model which counts these metrics by hand
since it's easily portable, but unless someone has some robust algorithm or
technique for counting the total bytes transferred for MatMatMult()
operations and its relatives, this performance model won't really be that
useful.

On Wed, Sep 28, 2016 at 7:16 PM, Matthew Knepley <knepley at gmail.com> wrote:

> On Wed, Sep 28, 2016 at 7:02 PM, Richard Mills <richardtmills at gmail.com>
> wrote:
>
>> Hi Barry,
>>
>> Thanks for starting a thread about this on petsc-dev; I was planning to
>> do so but still hadn't gotten to it.
>>
>> We can certainly get the performance data we need from various
>> performance analysis tools, and for some kinds of data, those are the best
>> way to try to get it.  The reasons I'd like to add some PETSc logging
>> support for collecting hardware data are primarily
>>
>> 1) Many of the tools are rather "heavy weight" or otherwise cumbersome to
>> use.  I've always loved how lightweight the PETSc logging framework is, and
>> have always preferred to use that for performance tuning work until I get
>> down to a level that requires the use of independent tools.  I also like
>> working with the text reports that I get from PETSc.  Some performance
>> tools do a decent job generating text reports, but many require me to fire
>> up an annoying GUI to do even trivial tasks.  This is especially annoying
>> when the data I want to work with are on some supercomputer to which I have
>> a slow Internet connection.
>>
>> 2) External performance analysis tools know nothing of things like PETSc
>> logging stages or events.  If I am using a tool like VTune to analyze
>> something like a flow and reactive transport problem in PFLOTRAN, VTune
>> doesn't know that I want to consider calls to SNESSolve() and children in
>> the flow stage separately from those made in the transport stage.  Many
>> tools provide ways to identify things like this, but it generally requires
>> instrumenting the code by hand using a proprietary API.  Furthermore, most
>> of these APIs don't have a sort of push/pop mechanism like we have for
>> PETSc stages.  I really don't want to have to instrument my code for each
>> tool that I might want to use, especially since I've already gone to the
>> trouble of defining various stages/events with PETSc -- I'd like to just
>> use those!
>>
>
> I have a 3rd. There are some thing that perf tools are just shit at
> counting, and they need to be counted by hand. There are excellent
> blog posts by McCalpin on these caveats, which arise due to the complexity
> of modern processors, Justin had all of these problems
> in his paper about the performance of convex programming solvers for
> enforcing maximum principles.
>
> I am not against using VTune and its ilk. I just think we should still be
> counting flops and bytes transferred from memory by hand because
> you cannot trust the counting.
>
>   Matt
>
>
>> Both of the above are important motivations, but I think (2) is my
>> primary driver.  I'd be happier with many of the tools if they were aware
>> of PETSc stages and events.
>>
>> --Richard
>>
>> On Wed, Sep 28, 2016 at 2:33 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>
>>>
>>>    Moving to petsc-dev so everyone can see this discussion.
>>>
>>>     To get more detailed "performance" information on runs we have two
>>> (not necessarily orthogonal) choices.
>>>
>>>   1) use an integrated system that is independent of PETSc. These
>>> sometimes require compiling with additional options and then running a
>>> post-processor after the run. These systems then display the results in
>>> some kind of GUI. Intel has such a thing, as does Apple. Do they allow
>>> logging/display of things we care about such as cache misses, ....? Depends
>>> on each system, and some of the systems are improving over time.
>>>
>>>   2) add additional logging of values into the PETSc logging and then
>>> have PetscLogView() process the raw logged values into useful information.
>>>
>>>
>>>    Both approaches have advantages and disadvantages but we do take on a
>>> large development and maintenance burden if we try to incorporate more
>>> logging directly into PETSc. So what does incorporating into PETSc buy us
>>> that is worth the extra hassle? That is can we do something with the "in
>>> PETSc" approach we could not achieve otherwise? (I don't thing arguments
>>> about it being more portable and not requiring you to buy vtune etc from
>>> Intel are enough reason to do the work internally.)
>>>
>>>    In other words if I am interested in finding out why my MatMult() is
>>> slower then I think it should be is it such a terrible thing to have crank
>>> up vtune (or similar beast) to get details about the computational phase I
>>> am interested in?
>>>
>>>    Barry
>>>
>>> You should be able to guess that I am leaning towards 1) and want to
>>> know why that is a fatal mistake, if it is?
>>>
>>>
>>>
>>> > On Sep 24, 2016, at 12:00 PM, Richard Tran Mills <
>>> richard.t.mills at intel.com> wrote:
>>> >
>>> > Hi Folks,
>>> >
>>> > I'm breaking up replies to my long email message into smaller chunks
>>> to make it easier to keep track of the discussion.  Just address the perf
>>> counter issue here.
>>> >
>>> > On 9/24/16 6:54 AM, Jed Brown wrote:
>>> >> 7) I still think we should add some support for collecting hardware
>>> >>> counter information in the PETSc logging framework.  I see that the
>>> >>> latest PAPI release adds some KNL support, though I don't know if it
>>> >>> supports the uncore counters.  Anyhow, I should start a thread on
>>> >>> petsc-dev about this...
>>> >> There was some PAPI support once upon a time (before my time), but I
>>> >> think Barry stripped it out because it's crappy software.  I haven't
>>> >> seriously looked at using the linux performance counter interface
>>> >> directly, but it would be less to install and not streaked with
>>> Dongarra
>>> >> poo.
>>> > An alternative that I came across is something written by some Intel
>>> folks, with the terribly generic name of "Intel Performance Counter
>>> Monitor".  The webpage for it is at
>>> >
>>> > https://software.intel.com/en-us/articles/intel-performance-
>>> counter-monitor
>>> >
>>> > It provides a simple C++ API (I wish there was a C one; we'd need to
>>> wrap things to keep from polluting the PETSc code with C++ stuff) that lets
>>> you capture essentially any of the PMU events.  This looks a lot nicer than
>>> PAPI in several ways, but has the downside of being Intel-specific.  I also
>>> don't see any KNL-specific counter support yet.
>>> >
>>> > --Richard
>>>
>>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20160928/3f61b4ec/attachment.html>