[petsc-users] Understanding matmult memory performance

Fri Sep 29 09:24:23 CDT 2017

> On 29 Sep 2017, at 15:05, Tobin Isaac <tisaac at cc.gatech.edu> wrote:
> 
> On Fri, Sep 29, 2017 at 09:04:47AM -0400, Tobin Isaac wrote:
>> On Fri, Sep 29, 2017 at 12:19:54PM +0100, Lawrence Mitchell wrote:
>>> Dear all,
>>> 
>>> I'm attempting to understand some results I'm getting for matmult performance.  In particular, it looks like I'm obtaining timings that suggest that I'm getting more main memory bandwidth than I think is possible.
>>> 
>>> The run setup is using 2 24 core (dual socket) ivybridge nodes (Xeon E5-2697 v2).  The specced main memory bandwidth is 85.3 GB/s per node, and I measure a STREAM triad bandwidth using 48 MPI processes (two nodes) of 148.2 GB/s.  The last level cache is 30MB (shared between 12 cores)
>> 
>> One thought: triad has a 1:2 write:read ratio, but with your MatMult()
>> for P3 you would have about 1:50.  Unless triad used nontemporal
>> stores, the reported bandwidth from triad will be about 3./4. of the
>> bandwidth available to pure streaming reads, so maybe you actually
>> have ~197 GB/s of read bandwidth available.  MatMult() would still be
>> doing suspiciously well, but it would be within the measurements.  How
>> confident are you in the specced bandwidth?

I thought I was quite confident, but as you note:

> Are you running on archer?  I found one site [1] that lists the
> bandwidth you gave, which corresponds to DDR3-1333, but other sites
> [2] all say the nodes have DDR3-1833, in which case you would be
> getting about 80% of spec bandwidth.
> 
> [1]: https://www.archer.ac.uk/documentation/best-practice-guide/arch.php
> [2]: https://www.epcc.ed.ac.uk/blog/2013/11/20/archer-next-national-hpc-service-academic-research

I am, yes.  I will write to them to confirm, and get them to change their website!  Yeah, 85.3 was computed via the assumption of DDR3-1333, whereas DDR3-1866 gives me 102.4.  Phew :).

Karl wrote:

> according to
> https://ark.intel.com/products/75283/Intel-Xeon-Processor-E5-2697-v2-30M-Cache-2_70-GHz
> you get 59.7 GB/sec of peak memory bandwidth per CPU, so you should get about 240 GB/sec for your two-node system.

That's assuming I have DDR3-2133 RAM chips, but as noted above, it looks like the nodes probably have 1866 RAM.  Giving 204.8 GB/s.

> If you use PETSc's `make streams`, then processor placement may - unfortunately - not be ideal and hence underestimating the achievable performance. Have a look at the new PETSc 3.8 manual [1], Chapter 14, where Richard and I nailed down some of these performance aspects.

Thanks, I am using make streams (or rather, just running the MPIVersion by hand).  I have long been led to believe that Cray's aprun does a reasonable job of process placement (and pinning) for pure MPI jobs, so I have just trusted that.

On another note, I modified the MPIVersion to not just report the TRIAD number, and got:

Copy:         91467.3281    Rate (MB/s)
Scale:        63774.9615    Rate (MB/s)
Add:          73994.6106    Rate (MB/s)
Triad:        73564.8991    Rate (MB/s)

Inspecting the assembly, none of these are using non-temporal stores, but the copy has been turned into a call to memcpy (which might be).

In any case, those numbers would have led me to question the website-stated spec sheet numbers sooner.

Thanks all,

Lawrence
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170929/5e9e20b9/attachment.sig>