<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Sep 28, 2016 at 6:10 PM, Matthew Knepley <span dir="ltr"><<a target="_blank" href="mailto:knepley@gmail.com">knepley@gmail.com</a>></span> wrote:<br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="gmail-">On Wed, Sep 28, 2016 at 8:01 PM, Barry Smith <span dir="ltr"><<a target="_blank" href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><span><br>

> On Sep 28, 2016, at 7:16 PM, Matthew Knepley <<a target="_blank" href="mailto:knepley@gmail.com">knepley@gmail.com</a>> wrote:<br>

><br>

> On Wed, Sep 28, 2016 at 7:02 PM, Richard Mills <<a target="_blank" href="mailto:richardtmills@gmail.com">richardtmills@gmail.com</a>> wrote:<br>

> Hi Barry,<br>

><br>

> Thanks for starting a thread about this on petsc-dev; I was planning to do so but still hadn't gotten to it.<br>

><br>

> We can certainly get the performance data we need from various performance analysis tools, and for some kinds of data, those are the best way to try to get it.  The reasons I'd like to add some PETSc logging support for collecting hardware data are primarily<br>

><br>

> 1) Many of the tools are rather "heavy weight" or otherwise cumbersome to use.  I've always loved how lightweight the PETSc logging framework is, and have always preferred to use that for performance tuning work until I get down to a level that requires the use of independent tools.  I also like working with the text reports that I get from PETSc.  Some performance tools do a decent job generating text reports, but many require me to fire up an annoying GUI to do even trivial tasks.  This is especially annoying when the data I want to work with are on some supercomputer to which I have a slow Internet connection.<br>

><br>

> 2) External performance analysis tools know nothing of things like PETSc logging stages or events.  If I am using a tool like VTune to analyze something like a flow and reactive transport problem in PFLOTRAN, VTune doesn't know that I want to consider calls to SNESSolve() and children in the flow stage separately from those made in the transport stage.  Many tools provide ways to identify things like this, but it generally requires instrumenting the code by hand using a proprietary API.  Furthermore, most of these APIs don't have a sort of push/pop mechanism like we have for PETSc stages.  I really don't want to have to instrument my code for each tool that I might want to use, especially since I've already gone to the trouble of defining various stages/events with PETSc -- I'd like to just use those!<br>

><br>

> I have a 3rd. There are some thing that perf tools are just shit at counting, and they need to be counted by hand. There are excellent<br>

> blog posts by McCalpin on these caveats, which arise due to the complexity of modern processors, Justin had all of these problems<br>

> in his paper about the performance of convex programming solvers for enforcing maximum principles.<br>

><br>

> I am not against using VTune and its ilk. I just think we should still be counting flops and bytes transferred from memory by hand because<br>

> you cannot trust the counting.<br>

<br>

</span>  Ok, but this is somewhat orthogonal, you are proposing something like PetscLogBytes() ? Which of course we should have put in initially with PetscLogFlops() twenty years ago?  I don't object to such a thing.</blockquote><div><br></div></span><div>Then I will go a bit farther. I think manual counting is more important and should be done before integrating other dubious performance</div><div>measurements like PAPI counters.</div></div></div></div></blockquote><div><br></div><div>I agree that having the manual counts (where feasible) would be great, and far preferable to relying on hardware counters.<br> <br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>It would help me if we could list what we expect to learn from external tools. So far I have heard mentioned</div><div><br></div><div>  - Mem. band., which I find dubious at best</div></div></div></div></blockquote><div><br></div><div>The latest Xeon and Xeon Phi processors actually have pretty good counters for this, which seem fairly reliable.  (Don't take it from me: John McCalpin of STREAM fame says they are pretty good, and I'll believe him.)  This is one of the things I'd like to have the counter data for, since for some things (I'm mostly thinking of stuff outside of PETSc, but that happens in "user code" inside a PETSc-based application) it isn't feasible to get manual counts but it's useful to know the bandwidth utilization.<br> <br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div></div><div>  - Cache misses, somewhat more reliable</div></div></div></div></blockquote><div><br></div><div>Yes, I want cache misses.  Cache counters and memory bandwidth utilization are mostly what I want. <br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>  - Vectorization measure, I do not understand what his does</div></div></div></div></blockquote><div><br></div><div>We want some sort of measure of the percentage of operations that are able to use the vector hardware.  Unfortunately, this is really hard to get.  The only machine I've used that let us get a good picture of this was the Cray X1.  KNL has counters that you can use to tell the mix of vector vs. scalar instructions executed, but this actually tells you almost nothing.  Why?  Because a vector instruction that masks all elements of a vector except one still gets counted as a vector instruction, even though really it is doing a scalar operation.  We need hardware counters that account for the masks being used, but we don't have those on KNL.<br> <br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div></div><div>  - Flops, This seems useful</div></div></div></div></blockquote><div><br></div><div>Flops would be useful, though we prefer the manual counts, of course, and the hardware flop counts have problems along the lines of what I was talking about above.  On KNL, you won't get real flop counts because there is no awareness of the vector masks.  On some older processors I've used, the problem was that the flop counters would count instructions *issued*, but not instructions *retired*.  Since all modern processors are pipelined and do speculation, counting what is issued is useless, or maybe worse than useless.<br> <br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>What else?</div></div></div></div></blockquote><div><br></div><div>It gets into way more details of counters than I'm familiar with, but hardware counters can be useful for identifying problems such as bad speculation.  (Though, with PETSc apps, I think we are usually worried mostly about "back end" problems with the memory hierarchy.)  There are proponents at Intel of the so called "top down" microarchitecural analysis method that uses a bunch of counters that I have trouble keeping straight to broadly categorize performance problems.  There is a bit about this in the VTune manual here:<br><br>  <a href="https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-lin">https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-lin</a><br><br></div><div>There is also the original paper describing this method (which I can find a reference for if anyone is interested).  I'm not sure how useful this might be for PETSc applications, and I think if one wants to do this, it may be best to say they need to use a tool like VTune.  What I mostly want to see are memory bandwidth, data cache, and i-cache miss information in the performance logs to supplement the timings and hand flop counts (and having hand byte counts would be great).<br><br></div><div>--Richard<br></div><div> <br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>   Matt</div><div><div class="gmail-h5"><div> </div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><span><font color="#888888"><br>

  Barry<br>

</font></span><div><div><br>

><br>

>   Matt<br>

><br>

> Both of the above are important motivations, but I think (2) is my primary driver.  I'd be happier with many of the tools if they were aware of PETSc stages and events.<br>

><br>

> --Richard<br>

><br>

> On Wed, Sep 28, 2016 at 2:33 PM, Barry Smith <<a target="_blank" href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>> wrote:<br>

><br>

>    Moving to petsc-dev so everyone can see this discussion.<br>

><br>

>     To get more detailed "performance" information on runs we have two (not necessarily orthogonal) choices.<br>

><br>

>   1) use an integrated system that is independent of PETSc. These sometimes require compiling with additional options and then running a post-processor after the run. These systems then display the results in some kind of GUI. Intel has such a thing, as does Apple. Do they allow logging/display of things we care about such as cache misses, ....? Depends on each system, and some of the systems are improving over time.<br>

><br>

>   2) add additional logging of values into the PETSc logging and then have PetscLogView() process the raw logged values into useful information.<br>

><br>

><br>

>    Both approaches have advantages and disadvantages but we do take on a large development and maintenance burden if we try to incorporate more logging directly into PETSc. So what does incorporating into PETSc buy us that is worth the extra hassle? That is can we do something with the "in PETSc" approach we could not achieve otherwise? (I don't thing arguments about it being more portable and not requiring you to buy vtune etc from Intel are enough reason to do the work internally.)<br>

><br>

>    In other words if I am interested in finding out why my MatMult() is slower then I think it should be is it such a terrible thing to have crank up vtune (or similar beast) to get details about the computational phase I am interested in?<br>

><br>

>    Barry<br>

><br>

> You should be able to guess that I am leaning towards 1) and want to know why that is a fatal mistake, if it is?<br>

><br>

><br>

><br>

> > On Sep 24, 2016, at 12:00 PM, Richard Tran Mills <<a target="_blank" href="mailto:richard.t.mills@intel.com">richard.t.mills@intel.com</a>> wrote:<br>

> ><br>

> > Hi Folks,<br>

> ><br>

> > I'm breaking up replies to my long email message into smaller chunks to make it easier to keep track of the discussion.  Just address the perf counter issue here.<br>

> ><br>

> > On 9/24/16 6:54 AM, Jed Brown wrote:<br>

> >> 7) I still think we should add some support for collecting hardware<br>

> >>> counter information in the PETSc logging framework.  I see that the<br>

> >>> latest PAPI release adds some KNL support, though I don't know if it<br>

> >>> supports the uncore counters.  Anyhow, I should start a thread on<br>

> >>> petsc-dev about this...<br>

> >> There was some PAPI support once upon a time (before my time), but I<br>

> >> think Barry stripped it out because it's crappy software.  I haven't<br>

> >> seriously looked at using the linux performance counter interface<br>

> >> directly, but it would be less to install and not streaked with Dongarra<br>

> >> poo.<br>

> > An alternative that I came across is something written by some Intel folks, with the terribly generic name of "Intel Performance Counter Monitor".  The webpage for it is at<br>

> ><br>

> > <a target="_blank" rel="noreferrer" href="https://software.intel.com/en-us/articles/intel-performance-counter-monitor">https://software.intel.com/en-<wbr>us/articles/intel-performance-<wbr>counter-monitor</a><br>

> ><br>

> > It provides a simple C++ API (I wish there was a C one; we'd need to wrap things to keep from polluting the PETSc code with C++ stuff) that lets you capture essentially any of the PMU events.  This looks a lot nicer than PAPI in several ways, but has the downside of being Intel-specific.  I also don't see any KNL-specific counter support yet.<br>

> ><br>

> > --Richard<br>

><br>

><br>

><br>

><br>

><br>

> --<br>

> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

> -- Norbert Wiener<br>

<br>

</div></div></blockquote></div></div></div><div><div class="gmail-h5"><br><br clear="all"><div><br></div>-- <br><div>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener</div>

</div></div></div></div>

</blockquote></div><br></div></div>