[petsc-dev] Fwd: Poisson step in GTS

Barry Smith bsmith at mcs.anl.gov
Sun Jun 19 18:15:09 CDT 2011



Begin forwarded message:

> From: Nathan Wichmann <wichmann at cray.com>
> Date: June 19, 2011 4:15:48 PM CDT
> To: Barry Smith <bsmith at mcs.anl.gov>, John Shalf <JShalf at lbl.gov>
> Cc: Lois Curfman McInnes <curfman at mcs.anl.gov>, Satish Balay <balay at mcs.anl.gov>, Alice Koniges <aekoniges at lbl.gov>, Robert Preissl <rpreissl at lbl.gov>, Erich Strohmaier <EStrohmaier at lbl.gov>, Stephane Ethier <ethier at pppl.gov>
> Subject: RE: Poisson step in GTS
> 
> Q:  How does one "correctly map data using first-touch"? (Reference ok).
> 
> A:  The default policy on the XE6 is that whichever process/thread is the first to access a page associated with the memory location, then that page will be physically allocated on the memory closest to the core which is running that process/thread if possible.  This basically means that if you are running with 4 or more mpi ranks and 6 or fewer omp threads per node, then you don't have to do anything.  If you want to run with more than 6 omp threads then you have to worry about this a lot more.  
> 
> Q:  How does one "configure as 4 NUMA nodes with 6 threads each"? Do you mean 4 MPI processes (each with 6 threads or something different?)
> 
> A:  If/when you run on a Cray XE system you will find out that you have to use something called aprun to lauch jobs.  Various options to aprun will inform it how to launch your job.  Options include, but are not limited to "-n" which specifies the number of mpi ranks and "-d" which specifies the number of core allocated to that rank which are available to run OMP threads.  The actual number of OMP threads is set via the env var OMP_NUM_THREADS.  There are many more options that can affect placement, but these the the easiest to understand and the most important for this discussion, IMHO.
> 
> Q:  What is an "OpenMP thread"  mapped to on the Cray systems? A pthread? Some other kind of thread?
> 
> A:  I think that it is mapped to pthreads, but I prefer to keep it abstract as it is implementation dependent and one cannot mix user pthreads with omp in the same app and hope to get decent results.  :-)
> 
> Nathan
> 
> 
> 
> 
> Nathan Wichmann                Cray Inc.
> wichmann at cray.com              380 Jackson St
> Applications Engineer          St Paul, MN 55101
> office:  1-800-284-2729  x605-9079          
>  cell:  651-428-1131
> 
> -----Original Message-----
> From: Barry Smith [mailto:bsmith at mcs.anl.gov] 
> Sent: Sunday, June 19, 2011 2:44 PM
> To: John Shalf
> Cc: Nathan Wichmann; Lois Curfman McInnes; Satish Balay; Alice Koniges; Robert Preissl; Erich Strohmaier; Stephane Ethier
> Subject: Re: Poisson step in GTS
> 
> 
> On Jun 19, 2011, at 5:34 AM, John Shalf wrote:
> 
>> Hi Barry,
>> here is the stream benchmark results that Hongzhang Shan collected on Hopper for Nick's COE studies.   The red curve shows performance when you run stream when all of the data ends up mapped to a single memory controller.  The blue curve shows the case when you correctly map data using first-touch so that the stream benchmark accesses data on its local memory controller (the correct NUMA mapping). 
> 
>   How does one "correctly map data using first-touch"? (Reference ok).
>> 
>> 
>> The bottom line is that it is essential that data is touched first on the memory controller that is nearest the OpenMP processes that will be accessing it (otherwise memory bandwidth will tank).  This should occur naturally if you configure as 4 NUMA nodes with 6 threads each, as per Nathan's suggestion.
> 
>   How does one "configure as 4 NUMA nodes with 6 threads each"? Do you mean 4 MPI processes (each with 6 threads or something different?)
> 
>> If we want to be more aggressive and use 24-way threaded parallelism per node, then extra care must be taken to ensure the memory affinity is not screwed up.
> 
>   BTW: What is an "OpenMP thread"  mapped to on the Cray systems? A pthread? Some other kind of thread?
> 
>   Barry
> 
>> 
>> -john
>> 
>> On Jun 18, 2011, at 10:13 AM, Barry Smith wrote:
>>> On Jun 18, 2011, at 9:35 AM, Nathan Wichmann wrote:
>>>> Hi Robert, Barry and all,
>>>> 
>>>> Is it our assumption that the Poisson version of GTS will normally be run with 1 mpi rank per die and 6 (on AMD Magny cours) omp threads?
>>> 
>>> Our new vector and matrix classes will allow the flexibility of any number of MPI processes and any number of threads under that. So 1 MPI rank and 6 threads is supportable.
>>> 
>>>> In that case there should be sufficient bandwidth for decent scaling; I would say something Barry's Intel experience.  Barry is certainly correct that as one uses more cores one will be more bandwidth limited.
>>> 
>>> I would be interested in seeing the OpenMP streams for this system.
>>>> 
>>>> I also like John's comment: "we have little faith that the compiler will do anything intelligent."  Which compiler are you using?  If you are using CCE then you should get a lst file to see what it is doing.  Probably the only thing that can and should be done is unroll the inner loop.
>>> 
>>> Do you folks a provide a thread based BLAS 1 operations? For example ddot, dscale, daxpy? If so, we can piggy-back on those to get the best possible performance on the vector operations.,
>>>> 
>>>> Another consideration is the typical size of "n".  Normally the dense the matrix the large n is, no?  But still, it would be interesting to know.
>>> 
>>> In this application the matrix is extremely sparse, likely between 7 and 27 nonzeros per row. Matrices, of course, can get as big as you like.
>>> 
>>> Barry
>> 
>> <PastedGraphic-1.pdf>
> 




More information about the petsc-dev mailing list