[petsc-dev] Fwd: Poisson step in GTS

Tue Jun 21 19:58:46 CDT 2011

Hi

I have been experimenting with NUMA optimisations since coming across a
tutorial by Georg Hager et al from SC2010.
http://www.hpc.lsu.edu/training/tutorials/sc10/tutorials/SC10Tutorials/docs/M16/M16.pdf
Page 36 is really what got me motivated.

Assuming Linux with kernel version >= 2.4, the short version seems to be
that you have to ensure you two two things
- exploit the first touch policy to ensure page faults are satisfied by
the nearest NUMA memory node
- ensure thread-core affinity

Dealing with the first requirement is relatively straightforward. You
just make sure the thread that's initialising the memory is the same one
using it. The standard example is repeated in the Intel doc

""
  !$OMP parallel do                      !$OMP parallel do
        do i=1,n                                   do i=1,n
          do j=1,m                                  do j=1,m
            A(j,i) = 0.0                               dowork(A(j,i))
          enddo                                      enddo
        enddo                                      enddo
""
from
http://software.intel.com/en-us/articles/threading-fortran-applications-for-parallel-performance-on-multi-core-systems/
In the context of PETSc: if you preallocate PETSc matrices then you can
make sure MatSetValues is called by the right thread. Taking a finite
element perspective on this: you can avoid racy conditions when
assembling matrices by using coloring to define independent sets; you
also need to renumber to ensure data is traversed sensibly (ie when a
page is faulted by a write, you want the next few writes to that page to
be from the same thread). Most of these changes are clearly on the
application code side - although when the dust settles it would be nice
to add OpenMP guidelines to the PETSc docs. The problem we've ran into
here (and I'm rather hoping other people start having the same problem
so we can get it fixed ;) is that stashing (required for non-local
assembly) is not itself thread safe.

The second issue is thread affinity. The 2.6 linux kernel by default
applies soft CPU affinity - ie threads do not frequently migrate between
cores. You can also define hard affinity: gcc uses the env
GOMP_CPU_AFFINITY, intel uses env KMP_AFFINITY (there are additional
options that can be compiled in). Playing around with hard affinity I
found that it certainly makes a significent difference - but what's yet
clear to me is if it makes much difference over soft affinity when
you're using exclusive compute nodes in a cluster.

As a final comment - I really think we should be pushing for having one
MPI process per NUMA compute and just get to grips with NUMA
optimisations as they don't appear to be over burdensome. And there is
quite a lot of literature out there illustrating the benefit. For many
algorithms, simply have additional partitions retards scaling of the
algorithm.

Cheers
Gerard

Barry Smith emailed the following on 20/06/11 00:15:
>
> Begin forwarded message:
>
>> From: John Shalf <jshalf at lbl.gov>
>> Date: June 19, 2011 5:15:49 PM CDT
>> To: Barry Smith <bsmith at mcs.anl.gov>
>> Cc: Nathan Wichmann <wichmann at cray.com>, Lois Curfman McInnes <curfman at mcs.anl.gov>, Satish Balay <balay at mcs.anl.gov>, Alice Koniges <aekoniges at lbl.gov>, Robert Preissl <rpreissl at lbl.gov>, Erich Strohmaier <EStrohmaier at lbl.gov>, Stephane Ethier <ethier at pppl.gov>
>> Subject: Re: Poisson step in GTS
>>
>>
>> On Jun 19, 2011, at 9:44 PM, Barry Smith wrote:
>>> On Jun 19, 2011, at 5:34 AM, John Shalf wrote:
>>>> Hi Barry,
>>>> here is the stream benchmark results that Hongzhang Shan collected on Hopper for Nick's COE studies.   The red curve shows performance when you run stream when all of the data ends up mapped to a single memory controller.  The blue curve shows the case when you correctly map data using first-touch so that the stream benchmark accesses data on its local memory controller (the correct NUMA mapping). 
>>>  How does one "correctly map data using first-touch"? (Reference ok).
>> The AMD nodes (and even the Intel Nehalem nodes) have memory controllers integrated onto the processor chips.  The processor chips are integrated together into a node using HyperTransport for the AMD chips or QPI, which happens to be slower per link than the memory bandwidth of the local memory controllers on each these chips.  Consequently, the bandwidth of accessing memory using the memory controller that is co-located on the die with the CPUs is much lower latency and higher bandwidth than accessing the memory controller on one of the other dies.  
>>
>> So you need to have some way of identifying which memory controller should "own" a piece of memory so that you can keep it closer to the processors that will primarily be using it.  The "first touch" memory affinity policy says that a memory page gets mapped to the memory controller that is *closest* to the first processor core to write a value to that page.  So you can malloc() data at any time, but the pages get assigned to memory controllers base on the first processor to "touch" that memory page.  If you touch the data (and thereby assign it to memory controllers) in a different layout than you actually use it, then most accesses will be non-local, and therefore will be very slow.  If you touch data using the processor that will primarily be accessing the data later on, then it will get mapped to a local memory controller and will therefore be much faster.  So you have to be very carefully about how you first touch data to ensure good memory/stream performance.
>>
>> You can ref. the NERSC FAQ on the "first touch" principle.
>> http://www.nersc.gov/users/computational-systems/hopper/getting-started/multi-core-faq/
>>
>>>> The bottom line is that it is essential that data is touched first on the memory controller that is nearest the OpenMP processes that will be accessing it (otherwise memory bandwidth will tank).  This should occur naturally if you configure as 4 NUMA nodes with 6 threads each, as per Nathan's suggestion.
>>>  How does one "configure as 4 NUMA nodes with 6 threads each"? Do you mean 4 MPI processes (each with 6 threads or something different?)
>> That is correct.  The Cray XE6 node (the dual-socket AMD magnycours) has a total of 4 dies (and hence 4 NUMA domains).  Cray refers to these dies (each of which has its own local memory controller) as a NUMA-node to distinguish it from the full node that contains for of these dies.  Within a NUMA node, there a no NUMA issues. 
>>
>> So Cray refers to these dies (these sub-sections of a node where there are no NUMA issues) as numa_nodes.  You can use 'aprun' to launch tasks so that you get one task per numa_node and the threads within that numa_node will not have to worry about the first touch stuff we talked about above.  For Hopper, that is 4 numa_nodes per node, and 6 OpenMP threads per numa_node.
>>
>> e.g.
>> 	aprun -S threads_per_numa_node=6 -sn numa_nodes_per_node=4
>>
>>>> If we want to be more aggressive and use 24-way threaded parallelism per node, then extra care must be taken to ensure the memory affinity is not screwed up.
>>>  BTW: What is an "OpenMP thread"  mapped to on the Cray systems? A pthread? Some other kind of thread?
>> I'm not sure what you mean here.  OpenMP is directives for threading.  So an OpenMP "thread" is just how many threads you assign to each MPI task (with OpenMP operating within each MPI task).
>>
>> -john
>>