[petsc-users] Obtaining bytes per second

Justin Chang jychang48 at gmail.com
Sat May 9 06:01:30 CDT 2015


Jed thanks that helps a lot. I do have more related questions but they are
not necessarily related to PETSc or the original topic of this thread, so I
shall refer to the computational science stackexchange :)

Thanks,
Justin

On Fri, May 8, 2015 at 10:55 PM, Jed Brown <jed at jedbrown.org> wrote:

> Justin Chang <jychang48 at gmail.com> writes:
>
> > For clarification purposes:
> >
> > 1) What is the definition of "performance model" and "cache model"? I see
> > those two terms used in this thread but want to know the exact difference
> > if any.
>
> Cache model is more specific, in this case telling you what needs to
> move from DRAM versus what can be reused.  A cache model is often part
> of a performance model (allowing you to estimate bandwidth).  It is
> occasionally irrelevant, like with STREAM and sufficiently large sizes,
> because there is no opportunity for reuse.
>
> > 2) Is what's described in Dinesh's paper a "cache model"?
> > What exactly is the caveat or what are the major assumptions that it
> makes?
>
> The common first assumption for SpMV is that matrix data is loaded only
> once, that vector data is also loaded only once, and that latency to
> access vector data is negligible.  These assumptions can be utterly
> destroyed if you permute the matrix by a randomized ordering.  The
> matrix data will still be accessed only once and contiguously, but (for
> large enough matrices) the vector data will almost always generate a
> cache miss (high latency) and the vector entry will be evicted before it
> is used again.  Moreover, typically nothing else on the cache line will
> be used.  For a typical 8-byte scalar and 64-byte cache lines, this
> means that for an m×m sparse matrix with N nonzeros, the performance
> model predicts
>
>   m*sizeof(int) + N*(sizeof(int) + sizeof(scalar)) + 2*m*sizeof(scalar)
>
> where the first two terms are matrix data and the last are the source
> and destination vectors.  In a randomized ordering, you'll need
>
>   m*sizeof(int) + N*(sizeof(int) + sizeof(scalar)) + m*sizeof(scalar) +
> 8*N*sizeof(scalar)
>
> where the last term now represents that each matrix nonzero triggers a
> new cache miss accessing the vector.  This last term clearly swamps
> everything else.
>
> Furthermore, in the limit of a small cache, you get very poor temporal
> locality from vector entries when using any matrix ordering, so that
> last term once again dominates.
>
> > 3) Is quantifying the "useful bandwidth sustained for some level of
> catch"
> > analogous/related to cache register reuse and/or vectorization (e.g., how
> > well one can maximize SIMD on the machine if that makes any sense)
>
> SIMD and register reuse are orthogonal to cache reuse and you typically
> have less control over them (especially if you don't write assembly).
> For irregular sparse operations, even with custom implementations, it
> may not be worth the overhead of detecting and arranging vectorizable
> steps.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150509/08a88913/attachment.html>


More information about the petsc-users mailing list