[petsc-dev] Hardware counter logging in PETSc (was Re: Where next with PETSc and KNL?)

Thu Sep 29 01:18:38 CDT 2016

On Wed, Sep 28, 2016 at 6:10 PM, Matthew Knepley <knepley at gmail.com> wrote:

> On Wed, Sep 28, 2016 at 8:01 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>>
>> > On Sep 28, 2016, at 7:16 PM, Matthew Knepley <knepley at gmail.com> wrote:
>> >
>> > On Wed, Sep 28, 2016 at 7:02 PM, Richard Mills <richardtmills at gmail.com>
>> wrote:
>> > Hi Barry,
>> >
>> > Thanks for starting a thread about this on petsc-dev; I was planning to
>> do so but still hadn't gotten to it.
>> >
>> > We can certainly get the performance data we need from various
>> performance analysis tools, and for some kinds of data, those are the best
>> way to try to get it.  The reasons I'd like to add some PETSc logging
>> support for collecting hardware data are primarily
>> >
>> > 1) Many of the tools are rather "heavy weight" or otherwise cumbersome
>> to use.  I've always loved how lightweight the PETSc logging framework is,
>> and have always preferred to use that for performance tuning work until I
>> get down to a level that requires the use of independent tools.  I also
>> like working with the text reports that I get from PETSc.  Some performance
>> tools do a decent job generating text reports, but many require me to fire
>> up an annoying GUI to do even trivial tasks.  This is especially annoying
>> when the data I want to work with are on some supercomputer to which I have
>> a slow Internet connection.
>> >
>> > 2) External performance analysis tools know nothing of things like
>> PETSc logging stages or events.  If I am using a tool like VTune to analyze
>> something like a flow and reactive transport problem in PFLOTRAN, VTune
>> doesn't know that I want to consider calls to SNESSolve() and children in
>> the flow stage separately from those made in the transport stage.  Many
>> tools provide ways to identify things like this, but it generally requires
>> instrumenting the code by hand using a proprietary API.  Furthermore, most
>> of these APIs don't have a sort of push/pop mechanism like we have for
>> PETSc stages.  I really don't want to have to instrument my code for each
>> tool that I might want to use, especially since I've already gone to the
>> trouble of defining various stages/events with PETSc -- I'd like to just
>> use those!
>> >
>> > I have a 3rd. There are some thing that perf tools are just shit at
>> counting, and they need to be counted by hand. There are excellent
>> > blog posts by McCalpin on these caveats, which arise due to the
>> complexity of modern processors, Justin had all of these problems
>> > in his paper about the performance of convex programming solvers for
>> enforcing maximum principles.
>> >
>> > I am not against using VTune and its ilk. I just think we should still
>> be counting flops and bytes transferred from memory by hand because
>> > you cannot trust the counting.
>>
>>   Ok, but this is somewhat orthogonal, you are proposing something like
>> PetscLogBytes() ? Which of course we should have put in initially with
>> PetscLogFlops() twenty years ago?  I don't object to such a thing.
>
>
> Then I will go a bit farther. I think manual counting is more important
> and should be done before integrating other dubious performance
> measurements like PAPI counters.
>

I agree that having the manual counts (where feasible) would be great, and
far preferable to relying on hardware counters.

>
> It would help me if we could list what we expect to learn from external
> tools. So far I have heard mentioned
>
>   - Mem. band., which I find dubious at best
>

The latest Xeon and Xeon Phi processors actually have pretty good counters
for this, which seem fairly reliable.  (Don't take it from me: John
McCalpin of STREAM fame says they are pretty good, and I'll believe him.)
This is one of the things I'd like to have the counter data for, since for
some things (I'm mostly thinking of stuff outside of PETSc, but that
happens in "user code" inside a PETSc-based application) it isn't feasible
to get manual counts but it's useful to know the bandwidth utilization.

>   - Cache misses, somewhat more reliable
>

Yes, I want cache misses.  Cache counters and memory bandwidth utilization
are mostly what I want.

>
>   - Vectorization measure, I do not understand what his does
>

We want some sort of measure of the percentage of operations that are able
to use the vector hardware.  Unfortunately, this is really hard to get.
The only machine I've used that let us get a good picture of this was the
Cray X1.  KNL has counters that you can use to tell the mix of vector vs.
scalar instructions executed, but this actually tells you almost nothing.
Why?  Because a vector instruction that masks all elements of a vector
except one still gets counted as a vector instruction, even though really
it is doing a scalar operation.  We need hardware counters that account for
the masks being used, but we don't have those on KNL.

>   - Flops, This seems useful
>

Flops would be useful, though we prefer the manual counts, of course, and
the hardware flop counts have problems along the lines of what I was
talking about above.  On KNL, you won't get real flop counts because there
is no awareness of the vector masks.  On some older processors I've used,
the problem was that the flop counters would count instructions *issued*,
but not instructions *retired*.  Since all modern processors are pipelined
and do speculation, counting what is issued is useless, or maybe worse than
useless.

>
> What else?
>

It gets into way more details of counters than I'm familiar with, but
hardware counters can be useful for identifying problems such as bad
speculation.  (Though, with PETSc apps, I think we are usually worried
mostly about "back end" problems with the memory hierarchy.)  There are
proponents at Intel of the so called "top down" microarchitecural analysis
method that uses a bunch of counters that I have trouble keeping straight
to broadly categorize performance problems.  There is a bit about this in
the VTune manual here:

https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-lin

There is also the original paper describing this method (which I can find a
reference for if anyone is interested).  I'm not sure how useful this might
be for PETSc applications, and I think if one wants to do this, it may be
best to say they need to use a tool like VTune.  What I mostly want to see
are memory bandwidth, data cache, and i-cache miss information in the
performance logs to supplement the timings and hand flop counts (and having
hand byte counts would be great).

--Richard

>
>    Matt
>
>
>>
>>   Barry
>>
>> >
>> >   Matt
>> >
>> > Both of the above are important motivations, but I think (2) is my
>> primary driver.  I'd be happier with many of the tools if they were aware
>> of PETSc stages and events.
>> >
>> > --Richard
>> >
>> > On Wed, Sep 28, 2016 at 2:33 PM, Barry Smith <bsmith at mcs.anl.gov>
>> wrote:
>> >
>> >    Moving to petsc-dev so everyone can see this discussion.
>> >
>> >     To get more detailed "performance" information on runs we have two
>> (not necessarily orthogonal) choices.
>> >
>> >   1) use an integrated system that is independent of PETSc. These
>> sometimes require compiling with additional options and then running a
>> post-processor after the run. These systems then display the results in
>> some kind of GUI. Intel has such a thing, as does Apple. Do they allow
>> logging/display of things we care about such as cache misses, ....? Depends
>> on each system, and some of the systems are improving over time.
>> >
>> >   2) add additional logging of values into the PETSc logging and then
>> have PetscLogView() process the raw logged values into useful information.
>> >
>> >
>> >    Both approaches have advantages and disadvantages but we do take on
>> a large development and maintenance burden if we try to incorporate more
>> logging directly into PETSc. So what does incorporating into PETSc buy us
>> that is worth the extra hassle? That is can we do something with the "in
>> PETSc" approach we could not achieve otherwise? (I don't thing arguments
>> about it being more portable and not requiring you to buy vtune etc from
>> Intel are enough reason to do the work internally.)
>> >
>> >    In other words if I am interested in finding out why my MatMult() is
>> slower then I think it should be is it such a terrible thing to have crank
>> up vtune (or similar beast) to get details about the computational phase I
>> am interested in?
>> >
>> >    Barry
>> >
>> > You should be able to guess that I am leaning towards 1) and want to
>> know why that is a fatal mistake, if it is?
>> >
>> >
>> >
>> > > On Sep 24, 2016, at 12:00 PM, Richard Tran Mills <
>> richard.t.mills at intel.com> wrote:
>> > >
>> > > Hi Folks,
>> > >
>> > > I'm breaking up replies to my long email message into smaller chunks
>> to make it easier to keep track of the discussion.  Just address the perf
>> counter issue here.
>> > >
>> > > On 9/24/16 6:54 AM, Jed Brown wrote:
>> > >> 7) I still think we should add some support for collecting hardware
>> > >>> counter information in the PETSc logging framework.  I see that the
>> > >>> latest PAPI release adds some KNL support, though I don't know if it
>> > >>> supports the uncore counters.  Anyhow, I should start a thread on
>> > >>> petsc-dev about this...
>> > >> There was some PAPI support once upon a time (before my time), but I
>> > >> think Barry stripped it out because it's crappy software.  I haven't
>> > >> seriously looked at using the linux performance counter interface
>> > >> directly, but it would be less to install and not streaked with
>> Dongarra
>> > >> poo.
>> > > An alternative that I came across is something written by some Intel
>> folks, with the terribly generic name of "Intel Performance Counter
>> Monitor".  The webpage for it is at
>> > >
>> > > https://software.intel.com/en-us/articles/intel-performance-
>> counter-monitor
>> > >
>> > > It provides a simple C++ API (I wish there was a C one; we'd need to
>> wrap things to keep from polluting the PETSc code with C++ stuff) that lets
>> you capture essentially any of the PMU events.  This looks a lot nicer than
>> PAPI in several ways, but has the downside of being Intel-specific.  I also
>> don't see any KNL-specific counter support yet.
>> > >
>> > > --Richard
>> >
>> >
>> >
>> >
>> >
>> > --
>> > What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> > -- Norbert Wiener
>>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20160928/d7e4f20d/attachment.html>