<div class="gmail_quote">On Sun, Jun 19, 2011 at 21:44, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov">bsmith@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div id=":oh">> here is the stream benchmark results that Hongzhang Shan collected on Hopper for Nick's COE studies.   The red curve shows performance when you run stream when all of the data ends up mapped to a single memory controller.  The blue curve shows the case when you correctly map data using first-touch so that the stream benchmark accesses data on its local memory controller (the correct NUMA mapping).<br>

</div></blockquote><div><br></div><div>If I have it correct, each socket of this machine (2-socket 12-core) has 4 DDR3-1333 memory buses, for a theoretical peak of 85 GB/s per node. That they get 50 GB/s is "good" by current standards.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div id=":oh">

><br>

> The bottom line is that it is essential that data is touched first on the memory controller that is nearest the OpenMP processes that will be accessing it (otherwise memory bandwidth will tank).  This should occur naturally if you configure as 4 NUMA nodes with 6 threads each, as per Nathan's suggestion.  If we want to be more aggressive and use 24-way threaded parallelism per node, then extra care must be taken to ensure the memory affinity is not screwed up.</div>

</blockquote></div><br><div>Note that this is when the memory is first touched, not when it is first allocated (allocation just sets a pointer, it doesn't find you memory or decide where it will come from). Memset() will not do this correctly at all.</div>

<div><br></div><div>One tricky case is sparse matrix allocation if the number of nonzeros per row is not nearly constant (or random). Banded matrices are a worst-case scenario. In such cases, it is difficult to get both the matrix and the vector faulted more-or-less in the correct place.</div>

<div><br></div><div>If you use MPI processes down to the die level (6 threads), then you don't have to put extra effort into memory affinity. This is not a bad solution right now, but my prediction for the future is that we'll still see a hierarchy with a similar granularity (2 to 8-way branching factor at each level). In that case, saying that all threads of a given process will have a flat (aside from cache which is more local) memory model is about as progressive as ignoring threads entirely is today.</div>