<div dir="ltr"><div>Jed thanks that helps a lot. I do have more related questions but they are not necessarily related to PETSc or the original topic of this thread, so I shall refer to the computational science stackexchange :)<br><br></div><div>Thanks,<br></div><div>Justin<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, May 8, 2015 at 10:55 PM, Jed Brown <span dir="ltr"><<a href="mailto:jed@jedbrown.org" target="_blank">jed@jedbrown.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Justin Chang <<a href="mailto:jychang48@gmail.com">jychang48@gmail.com</a>> writes:<br>

<br>

> For clarification purposes:<br>

><br>

> 1) What is the definition of "performance model" and "cache model"? I see<br>

> those two terms used in this thread but want to know the exact difference<br>

> if any.<br>

<br>

</span>Cache model is more specific, in this case telling you what needs to<br>

move from DRAM versus what can be reused.  A cache model is often part<br>

of a performance model (allowing you to estimate bandwidth).  It is<br>

occasionally irrelevant, like with STREAM and sufficiently large sizes,<br>

because there is no opportunity for reuse.<br>

<span class=""><br>

> 2) Is what's described in Dinesh's paper a "cache model"?<br>

> What exactly is the caveat or what are the major assumptions that it makes?<br>

<br>

</span>The common first assumption for SpMV is that matrix data is loaded only<br>

once, that vector data is also loaded only once, and that latency to<br>

access vector data is negligible.  These assumptions can be utterly<br>

destroyed if you permute the matrix by a randomized ordering.  The<br>

matrix data will still be accessed only once and contiguously, but (for<br>

large enough matrices) the vector data will almost always generate a<br>

cache miss (high latency) and the vector entry will be evicted before it<br>

is used again.  Moreover, typically nothing else on the cache line will<br>

be used.  For a typical 8-byte scalar and 64-byte cache lines, this<br>

means that for an m×m sparse matrix with N nonzeros, the performance<br>

model predicts<br>

<br>

  m*sizeof(int) + N*(sizeof(int) + sizeof(scalar)) + 2*m*sizeof(scalar)<br>

<br>

where the first two terms are matrix data and the last are the source<br>

and destination vectors.  In a randomized ordering, you'll need<br>

<br>

  m*sizeof(int) + N*(sizeof(int) + sizeof(scalar)) + m*sizeof(scalar) + 8*N*sizeof(scalar)<br>

<br>

where the last term now represents that each matrix nonzero triggers a<br>

new cache miss accessing the vector.  This last term clearly swamps<br>

everything else.<br>

<br>

Furthermore, in the limit of a small cache, you get very poor temporal<br>

locality from vector entries when using any matrix ordering, so that<br>

last term once again dominates.<br>

<span class=""><br>

> 3) Is quantifying the "useful bandwidth sustained for some level of catch"<br>

> analogous/related to cache register reuse and/or vectorization (e.g., how<br>

> well one can maximize SIMD on the machine if that makes any sense)<br>

<br>

</span>SIMD and register reuse are orthogonal to cache reuse and you typically<br>

have less control over them (especially if you don't write assembly).<br>

For irregular sparse operations, even with custom implementations, it<br>

may not be worth the overhead of detecting and arranging vectorizable<br>

steps.<br>

</blockquote></div><br></div>