[petsc-dev] Fwd: Poisson step in GTS

Sun Jun 19 14:44:26 CDT 2011

Begin forwarded message:

> From: John Shalf <jshalf at lbl.gov>
> Date: June 19, 2011 5:34:59 AM CDT
> To: Barry Smith <bsmith at mcs.anl.gov>
> Cc: Nathan Wichmann <wichmann at cray.com>, Lois Curfman McInnes <curfman at mcs.anl.gov>, Satish Balay <balay at mcs.anl.gov>, Alice Koniges <aekoniges at lbl.gov>, Robert Preissl <rpreissl at lbl.gov>, Erich Strohmaier <EStrohmaier at lbl.gov>, Stephane Ethier <ethier at pppl.gov>
> Subject: Re: Poisson step in GTS
> 
> Hi Barry,
> here is the stream benchmark results that Hongzhang Shan collected on Hopper for Nick's COE studies.   The red curve shows performance when you run stream when all of the data ends up mapped to a single memory controller.  The blue curve shows the case when you correctly map data using first-touch so that the stream benchmark accesses data on its local memory controller (the correct NUMA mapping). 
> 
> The bottom line is that it is essential that data is touched first on the memory controller that is nearest the OpenMP processes that will be accessing it (otherwise memory bandwidth will tank).  This should occur naturally if you configure as 4 NUMA nodes with 6 threads each, as per Nathan's suggestion.  If we want to be more aggressive and use 24-way threaded parallelism per node, then extra care must be taken to ensure the memory affinity is not screwed up.
> 
> -john
> 
> On Jun 18, 2011, at 10:13 AM, Barry Smith wrote:
>> On Jun 18, 2011, at 9:35 AM, Nathan Wichmann wrote:
>>> Hi Robert, Barry and all,
>>> 
>>> Is it our assumption that the Poisson version of GTS will normally be run with 1 mpi rank per die and 6 (on AMD Magny cours) omp threads?
>> 
>>  Our new vector and matrix classes will allow the flexibility of any number of MPI processes and any number of threads under that. So 1 MPI rank and 6 threads is supportable.
>> 
>>> In that case there should be sufficient bandwidth for decent scaling; I would say something Barry's Intel experience.  Barry is certainly correct that as one uses more cores one will be more bandwidth limited.
>> 
>>  I would be interested in seeing the OpenMP streams for this system.
>>> 
>>> I also like John's comment: "we have little faith that the compiler will do anything intelligent."  Which compiler are you using?  If you are using CCE then you should get a lst file to see what it is doing.  Probably the only thing that can and should be done is unroll the inner loop.
>> 
>> Do you folks a provide a thread based BLAS 1 operations? For example ddot, dscale, daxpy? If so, we can piggy-back on those to get the best possible performance on the vector operations.,
>>> 
>>> Another consideration is the typical size of "n".  Normally the dense the matrix the large n is, no?  But still, it would be interesting to know.
>> 
>> In this application the matrix is extremely sparse, likely between 7 and 27 nonzeros per row. Matrices, of course, can get as big as you like.
>> 
>>  Barry
> 
[see attached file: PastedGraphic-1.pdf]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PastedGraphic-1.pdf
Type: application/pdf
Size: 30010 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110619/7e4caa75/attachment.pdf>