[petsc-users] Understanding matmult memory performance
Lawrence Mitchell
lawrence.mitchell at imperial.ac.uk
Fri Sep 29 09:24:23 CDT 2017
> On 29 Sep 2017, at 15:05, Tobin Isaac <tisaac at cc.gatech.edu> wrote:
>
> On Fri, Sep 29, 2017 at 09:04:47AM -0400, Tobin Isaac wrote:
>> On Fri, Sep 29, 2017 at 12:19:54PM +0100, Lawrence Mitchell wrote:
>>> Dear all,
>>>
>>> I'm attempting to understand some results I'm getting for matmult performance. In particular, it looks like I'm obtaining timings that suggest that I'm getting more main memory bandwidth than I think is possible.
>>>
>>> The run setup is using 2 24 core (dual socket) ivybridge nodes (Xeon E5-2697 v2). The specced main memory bandwidth is 85.3 GB/s per node, and I measure a STREAM triad bandwidth using 48 MPI processes (two nodes) of 148.2 GB/s. The last level cache is 30MB (shared between 12 cores)
>>
>> One thought: triad has a 1:2 write:read ratio, but with your MatMult()
>> for P3 you would have about 1:50. Unless triad used nontemporal
>> stores, the reported bandwidth from triad will be about 3./4. of the
>> bandwidth available to pure streaming reads, so maybe you actually
>> have ~197 GB/s of read bandwidth available. MatMult() would still be
>> doing suspiciously well, but it would be within the measurements. How
>> confident are you in the specced bandwidth?
I thought I was quite confident, but as you note:
> Are you running on archer? I found one site [1] that lists the
> bandwidth you gave, which corresponds to DDR3-1333, but other sites
> [2] all say the nodes have DDR3-1833, in which case you would be
> getting about 80% of spec bandwidth.
>
> [1]: https://www.archer.ac.uk/documentation/best-practice-guide/arch.php
> [2]: https://www.epcc.ed.ac.uk/blog/2013/11/20/archer-next-national-hpc-service-academic-research
I am, yes. I will write to them to confirm, and get them to change their website! Yeah, 85.3 was computed via the assumption of DDR3-1333, whereas DDR3-1866 gives me 102.4. Phew :).
Karl wrote:
> according to
> https://ark.intel.com/products/75283/Intel-Xeon-Processor-E5-2697-v2-30M-Cache-2_70-GHz
> you get 59.7 GB/sec of peak memory bandwidth per CPU, so you should get about 240 GB/sec for your two-node system.
That's assuming I have DDR3-2133 RAM chips, but as noted above, it looks like the nodes probably have 1866 RAM. Giving 204.8 GB/s.
> If you use PETSc's `make streams`, then processor placement may - unfortunately - not be ideal and hence underestimating the achievable performance. Have a look at the new PETSc 3.8 manual [1], Chapter 14, where Richard and I nailed down some of these performance aspects.
Thanks, I am using make streams (or rather, just running the MPIVersion by hand). I have long been led to believe that Cray's aprun does a reasonable job of process placement (and pinning) for pure MPI jobs, so I have just trusted that.
On another note, I modified the MPIVersion to not just report the TRIAD number, and got:
Copy: 91467.3281 Rate (MB/s)
Scale: 63774.9615 Rate (MB/s)
Add: 73994.6106 Rate (MB/s)
Triad: 73564.8991 Rate (MB/s)
Inspecting the assembly, none of these are using non-temporal stores, but the copy has been turned into a call to memcpy (which might be).
In any case, those numbers would have led me to question the website-stated spec sheet numbers sooner.
Thanks all,
Lawrence
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170929/5e9e20b9/attachment.sig>
More information about the petsc-users
mailing list