[petsc-dev] Fwd: Poisson step in GTS

Jed Brown jed at 59A2.org
Sun Jun 19 15:12:50 CDT 2011


On Sun, Jun 19, 2011 at 21:44, Barry Smith <bsmith at mcs.anl.gov> wrote:

> > here is the stream benchmark results that Hongzhang Shan collected on
> Hopper for Nick's COE studies.   The red curve shows performance when you
> run stream when all of the data ends up mapped to a single memory
> controller.  The blue curve shows the case when you correctly map data using
> first-touch so that the stream benchmark accesses data on its local memory
> controller (the correct NUMA mapping).
>

If I have it correct, each socket of this machine (2-socket 12-core) has 4
DDR3-1333 memory buses, for a theoretical peak of 85 GB/s per node. That
they get 50 GB/s is "good" by current standards.


> >
> > The bottom line is that it is essential that data is touched first on the
> memory controller that is nearest the OpenMP processes that will be
> accessing it (otherwise memory bandwidth will tank).  This should occur
> naturally if you configure as 4 NUMA nodes with 6 threads each, as per
> Nathan's suggestion.  If we want to be more aggressive and use 24-way
> threaded parallelism per node, then extra care must be taken to ensure the
> memory affinity is not screwed up.
>

Note that this is when the memory is first touched, not when it is first
allocated (allocation just sets a pointer, it doesn't find you memory or
decide where it will come from). Memset() will not do this correctly at all.

One tricky case is sparse matrix allocation if the number of nonzeros per
row is not nearly constant (or random). Banded matrices are a worst-case
scenario. In such cases, it is difficult to get both the matrix and the
vector faulted more-or-less in the correct place.

If you use MPI processes down to the die level (6 threads), then you don't
have to put extra effort into memory affinity. This is not a bad solution
right now, but my prediction for the future is that we'll still see a
hierarchy with a similar granularity (2 to 8-way branching factor at each
level). In that case, saying that all threads of a given process will have a
flat (aside from cache which is more local) memory model is about as
progressive as ignoring threads entirely is today.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110619/6b74e58c/attachment.html>


More information about the petsc-dev mailing list