<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On 13 June 2014 22:22, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

  The main reason to “pull out” a single component is, for example, to solve a linear system for that single component; that is, to work on that single component a great deal. You wouldn’t pull out the individual components to iterate on them all together.<br>


</blockquote><div><br></div><div>I needed to pull out a single component to perform a matrix-vector multiply for further processing, but I realised I could just rearrange the matrix instead.<br><br>Thank you, Barry and Jed. That was very helpful.<br>


 <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class="HOEnZb"><font color="#888888"><br>

  Barry<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

On Jun 13, 2014, at 9:09 PM, Jed Brown <<a href="mailto:jed@jedbrown.org">jed@jedbrown.org</a>> wrote:<br>

<br>

> Anush Krishnan <<a href="mailto:anush@bu.edu">anush@bu.edu</a>> writes:<br>

>> With regard to the interlaced memory performing better: If I used three<br>

>> vectors created from the same DMDA for each degree of freedom, how<br>

>> different would that be in performance compared to a fully interlaced<br>

>> vector? Wouldn't cache reuse be about the same for both cases?<br>

><br>

> No, when you traverse the grid accessing all three components, you will<br>

> have three times as many prefetch streams (typically reducing prefetch<br>

> capability, thus generating more cold cache misses) and will spill<br>

> irregularly over cache lines more frequently, thus reducing the<br>

> effective cache size.  This can result in an integer-factor slowdown as<br>

> compared to interlaced storage.  By all means, run the experiment, but<br>

> the expected result for memory bandwidth/cache-limited operations is<br>

> that interlaced delivers significantly better performance.<br>

<br>

</div></div></blockquote></div><br></div></div>