[petsc-dev] Fwd: Poisson step in GTS

Sun Jun 19 18:15:19 CDT 2011

Begin forwarded message:

> From: John Shalf <jshalf at lbl.gov>
> Date: June 19, 2011 5:15:49 PM CDT
> To: Barry Smith <bsmith at mcs.anl.gov>
> Cc: Nathan Wichmann <wichmann at cray.com>, Lois Curfman McInnes <curfman at mcs.anl.gov>, Satish Balay <balay at mcs.anl.gov>, Alice Koniges <aekoniges at lbl.gov>, Robert Preissl <rpreissl at lbl.gov>, Erich Strohmaier <EStrohmaier at lbl.gov>, Stephane Ethier <ethier at pppl.gov>
> Subject: Re: Poisson step in GTS
> 
> 
> On Jun 19, 2011, at 9:44 PM, Barry Smith wrote:
>> On Jun 19, 2011, at 5:34 AM, John Shalf wrote:
>>> Hi Barry,
>>> here is the stream benchmark results that Hongzhang Shan collected on Hopper for Nick's COE studies.   The red curve shows performance when you run stream when all of the data ends up mapped to a single memory controller.  The blue curve shows the case when you correctly map data using first-touch so that the stream benchmark accesses data on its local memory controller (the correct NUMA mapping). 
>> 
>>  How does one "correctly map data using first-touch"? (Reference ok).
> 
> The AMD nodes (and even the Intel Nehalem nodes) have memory controllers integrated onto the processor chips.  The processor chips are integrated together into a node using HyperTransport for the AMD chips or QPI, which happens to be slower per link than the memory bandwidth of the local memory controllers on each these chips.  Consequently, the bandwidth of accessing memory using the memory controller that is co-located on the die with the CPUs is much lower latency and higher bandwidth than accessing the memory controller on one of the other dies.  
> 
> So you need to have some way of identifying which memory controller should "own" a piece of memory so that you can keep it closer to the processors that will primarily be using it.  The "first touch" memory affinity policy says that a memory page gets mapped to the memory controller that is *closest* to the first processor core to write a value to that page.  So you can malloc() data at any time, but the pages get assigned to memory controllers base on the first processor to "touch" that memory page.  If you touch the data (and thereby assign it to memory controllers) in a different layout than you actually use it, then most accesses will be non-local, and therefore will be very slow.  If you touch data using the processor that will primarily be accessing the data later on, then it will get mapped to a local memory controller and will therefore be much faster.  So you have to be very carefully about how you first touch data to ensure good memory/stream performance.
> 
> You can ref. the NERSC FAQ on the "first touch" principle.
> http://www.nersc.gov/users/computational-systems/hopper/getting-started/multi-core-faq/
> 
>>> The bottom line is that it is essential that data is touched first on the memory controller that is nearest the OpenMP processes that will be accessing it (otherwise memory bandwidth will tank).  This should occur naturally if you configure as 4 NUMA nodes with 6 threads each, as per Nathan's suggestion.
>> 
>>  How does one "configure as 4 NUMA nodes with 6 threads each"? Do you mean 4 MPI processes (each with 6 threads or something different?)
> 
> That is correct.  The Cray XE6 node (the dual-socket AMD magnycours) has a total of 4 dies (and hence 4 NUMA domains).  Cray refers to these dies (each of which has its own local memory controller) as a NUMA-node to distinguish it from the full node that contains for of these dies.  Within a NUMA node, there a no NUMA issues. 
> 
> So Cray refers to these dies (these sub-sections of a node where there are no NUMA issues) as numa_nodes.  You can use 'aprun' to launch tasks so that you get one task per numa_node and the threads within that numa_node will not have to worry about the first touch stuff we talked about above.  For Hopper, that is 4 numa_nodes per node, and 6 OpenMP threads per numa_node.
> 
> e.g.
> 	aprun -S threads_per_numa_node=6 -sn numa_nodes_per_node=4
> 
>>> If we want to be more aggressive and use 24-way threaded parallelism per node, then extra care must be taken to ensure the memory affinity is not screwed up.
>> 
>>  BTW: What is an "OpenMP thread"  mapped to on the Cray systems? A pthread? Some other kind of thread?
> 
> I'm not sure what you mean here.  OpenMP is directives for threading.  So an OpenMP "thread" is just how many threads you assign to each MPI task (with OpenMP operating within each MPI task).
> 
> -john
>