[petsc-users] Obtaining bytes per second

Jed Brown jed at jedbrown.org
Fri May 8 22:55:27 CDT 2015


Justin Chang <jychang48 at gmail.com> writes:

> For clarification purposes:
>
> 1) What is the definition of "performance model" and "cache model"? I see
> those two terms used in this thread but want to know the exact difference
> if any.

Cache model is more specific, in this case telling you what needs to
move from DRAM versus what can be reused.  A cache model is often part
of a performance model (allowing you to estimate bandwidth).  It is
occasionally irrelevant, like with STREAM and sufficiently large sizes,
because there is no opportunity for reuse.

> 2) Is what's described in Dinesh's paper a "cache model"?
> What exactly is the caveat or what are the major assumptions that it makes?

The common first assumption for SpMV is that matrix data is loaded only
once, that vector data is also loaded only once, and that latency to
access vector data is negligible.  These assumptions can be utterly
destroyed if you permute the matrix by a randomized ordering.  The
matrix data will still be accessed only once and contiguously, but (for
large enough matrices) the vector data will almost always generate a
cache miss (high latency) and the vector entry will be evicted before it
is used again.  Moreover, typically nothing else on the cache line will
be used.  For a typical 8-byte scalar and 64-byte cache lines, this
means that for an m×m sparse matrix with N nonzeros, the performance
model predicts

  m*sizeof(int) + N*(sizeof(int) + sizeof(scalar)) + 2*m*sizeof(scalar)

where the first two terms are matrix data and the last are the source
and destination vectors.  In a randomized ordering, you'll need

  m*sizeof(int) + N*(sizeof(int) + sizeof(scalar)) + m*sizeof(scalar) + 8*N*sizeof(scalar)

where the last term now represents that each matrix nonzero triggers a
new cache miss accessing the vector.  This last term clearly swamps
everything else.

Furthermore, in the limit of a small cache, you get very poor temporal
locality from vector entries when using any matrix ordering, so that
last term once again dominates.

> 3) Is quantifying the "useful bandwidth sustained for some level of catch"
> analogous/related to cache register reuse and/or vectorization (e.g., how
> well one can maximize SIMD on the machine if that makes any sense)

SIMD and register reuse are orthogonal to cache reuse and you typically
have less control over them (especially if you don't write assembly).
For irregular sparse operations, even with custom implementations, it
may not be worth the overhead of detecting and arranging vectorizable
steps.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150508/fe49354d/attachment.pgp>


More information about the petsc-users mailing list