[petsc-dev] Hardware counter logging in PETSc (was Re: Where next with PETSc and KNL?)

Thu Sep 29 19:46:23 CDT 2016

> On Sep 29, 2016, at 6:35 PM, Jed Brown <jed at jedbrown.org> wrote:
> 
> Barry Smith <bsmith at mcs.anl.gov> writes:
>>  Ok, but this is somewhat orthogonal, you are proposing something
>>  like PetscLogBytes() ? Which of course we should have put in
>>  initially with PetscLogFlops() twenty years ago?  I don't object to
>>  such a thing.
> 
> What does it mean?  Depending on vector sizes and whatever happened
> last, that could already be in cache.  If you log those bytes, you can
> get a "bandwidth" that is basically the instruction rate.  Maybe we can
> still interpret it.
> 
> With sparse or irregular operations, it's very common that you don't use
> a whole cache line every time you fetch it.  The performance counters
> don't know that, so they say you got 64 bytes even though you might have
> only used 4-8 bytes.  You could easily conclude that you are
> bandwidth-bound and have saturated DRAM bandwidth so there is little
> opportunity for improvement.  Then you restructure the code and get a
> huge performance gain despite lower claimed memory bandwidth.
> 
> Anyway, I think it will be somewhat hard to precisely define the
> analytic counting and that you will absolutely need both analytic and
> perf counters (like cache misses) to make any sense of bandwidth-limited
> performance.

   Agreed. It is not trivial. I was thinking to use the "model formula" for each kernel. For something like VecDot is is two PetscScalar loads for each entry so 2*n.  For sparse matrix product it is 1 PetscScalar for each nonzero, 1 PetscInt (for the column index) for each nonzero plus 1 PetscScalar store for each row and 1 PetscScalar load per column (assuming perfect cache reuse of the input vector) (AIJ format).  Yes the results have to be interpreted carefully but it is something that is relatively easy to add for these operations. Models for sparse-sparse products and sparse factorizations are more iffy but presumably possible. The models have to be close enough so that these numbers can serve as a "sanity test" in the same way that the flop counts/rates serve as a "sanity test" that something isn't completely out of whack. 

  Barry

Now a more ambitious person than I would suggest providing the information to the counters in a "symbolic" form so that one can plug in different model formula and see the "bandwidth" achieved with different model assumptions. But I am happy with just simple minded formula.