[petsc-dev] Fwd: Poisson step in GTS

Sun Jun 19 15:19:08 CDT 2011

On Sun, Jun 19, 2011 at 8:12 PM, Jed Brown <jed at 59a2.org> wrote:

> On Sun, Jun 19, 2011 at 21:44, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>> > here is the stream benchmark results that Hongzhang Shan collected on
>> Hopper for Nick's COE studies.   The red curve shows performance when you
>> run stream when all of the data ends up mapped to a single memory
>> controller.  The blue curve shows the case when you correctly map data using
>> first-touch so that the stream benchmark accesses data on its local memory
>> controller (the correct NUMA mapping).
>>
>
> If I have it correct, each socket of this machine (2-socket 12-core) has 4
> DDR3-1333 memory buses, for a theoretical peak of 85 GB/s per node. That
> they get 50 GB/s is "good" by current standards.
>
>
>> >
>> > The bottom line is that it is essential that data is touched first on
>> the memory controller that is nearest the OpenMP processes that will be
>> accessing it (otherwise memory bandwidth will tank).  This should occur
>> naturally if you configure as 4 NUMA nodes with 6 threads each, as per
>> Nathan's suggestion.  If we want to be more aggressive and use 24-way
>> threaded parallelism per node, then extra care must be taken to ensure the
>> memory affinity is not screwed up.
>>
>
> Note that this is when the memory is first touched, not when it is first
> allocated (allocation just sets a pointer, it doesn't find you memory or
> decide where it will come from). Memset() will not do this correctly at all.
>
> One tricky case is sparse matrix allocation if the number of nonzeros per
> row is not nearly constant (or random). Banded matrices are a worst-case
> scenario. In such cases, it is difficult to get both the matrix and the
> vector faulted more-or-less in the correct place.
>
> If you use MPI processes down to the die level (6 threads), then you don't
> have to put extra effort into memory affinity. This is not a bad solution
> right now, but my prediction for the future is that we'll still see a
> hierarchy with a similar granularity (2 to 8-way branching factor at each
> level). In that case, saying that all threads of a given process will have a
> flat (aside from cache which is more local) memory model is about as
> progressive as ignoring threads entirely is today.
>

Isn't there an API for prescribing affinity, or is this too hard to do for
most things? It seems like we know what we want in matvec.

   Matt

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110619/8c92ac07/attachment.html>