[petsc-users] Understanding the memory bandwidth

Thu Aug 13 13:04:27 CDT 2015

Justin Chang <jychang48 at gmail.com> writes:

> Hi all,
>
> According to our University's HPC cluster (Intel Xeon E5-2680v2
> <http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20E5-2680%20v2.html>), the
> online specifications says I should have a maximum BW of 59.7 GB/s. I am
> guessing this number is computed by 1866 MHz * 8 Bytes * 4 memory channels.

Yup, per socket.

> Now, when I run the STREAMS Triad benchmark on a single compute node (two
> sockets, 10 cores each, 64 GB total memory), on up to 20 processes with
> MPICH, i get the following:
>
> $ mpiexec -n 1 ./MPIVersion:
> Triad:        13448.6701   Rate (MB/s)
>
> $ mpiexec -n 2 ./MPIVersion:
> Triad:        24409.1406   Rate (MB/s)
>
> $ mpiexec -n 4 ./MPIVersion
> Triad:        31914.8087   Rate (MB/s)
>
> $ mpiexec -n 6 ./MPIVersion
> Triad:        33290.2676   Rate (MB/s)
>
> $ mpiexec -n 8 ./MPIVersion
> Triad:        33618.2542   Rate (MB/s)
>
> $ mpiexec -n 10 ./MPIVersion
> Triad:        33730.1662   Rate (MB/s)
>
> $ mpiexec -n 12 ./MPIVersion
> Triad:        40835.9440   Rate (MB/s)
>
> $ mpiexec -n 14 ./MPIVersion
> Triad:        44396.0042   Rate (MB/s)
>
> $ mpiexec -n 16 ./MPIVersion
> Triad:        54647.5214   Rate (MB/s) *
>
> $ mpiexec -n 18 ./MPIVersion
> Triad:        57530.8125   Rate (MB/s) *
>
> $ mpiexec -n 20 ./MPIVersion
> Triad:        42388.0739   Rate (MB/s) *
>
> The * numbers fluctuate greatly each time I run this. However, if I use
> hydra's processor binding options:
>
> $ mpiexec.hydra -n 2 -bind-to socket ./MPIVersion
> Triad:        26879.3853   Rate (MB/s)
>
> $ mpiexec.hydra -n 4 -bind-to socket ./MPIVersion
> Triad:        48363.8441   Rate (MB/s)
>
> $ mpiexec.hydra -n 8 -bind-to socket ./MPIVersion
> Triad:        63479.9284   Rate (MB/s)

It looks like with one core/socket, all your memory sits over one
channel.  You can play tricks to avoid that or use 4 cores/socket in
order to use all memory channels.

> $ mpiexec.hydra -n 10 -bind-to socket ./MPIVersion
> Triad:        66160.5627   Rate (MB/s)

So this is a pretty low fraction (55%) of 59.7*2 = 119.4.  I suspect
your memory or motherboard is at most 1600 MHz, so your peak would be
102.4 GB/s.

You can check this as root using "dmidecode --type 17", which should
give one entry per channel, looking something like this:

Handle 0x002B, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x002A
        Error Information Handle: 0x002F
        Total Width: Unknown
        Data Width: Unknown
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMM0
        Bank Locator: BANK 0
        Type: <OUT OF SPEC>
        Type Detail: None
        Speed: Unknown
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Unknown
        Part Number: Not Specified
        Rank: Unknown
        Configured Clock Speed: 1600 MHz

> Now my question is, is 13.5 GB/s on one processor "good"? 

One memory channel is 1.866 * 8 = 14.9 GB/s.  You can get some bonus
overlap when adjacent pages are on different busses, but the prefetcher
only looks so far ahead, so most of the time you're only pulling from
one channel when using one thread.

> Because when I compare this to the 59.7 GB/s it seems really
> inefficient. Is there a way to browse through my system files to
> confirm this?
>
> Also, when I use multiple cores and with proper binding, the streams BW
> exceeds the reported max BW. Is this expected?

You're using two sockets.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150813/4e7cfa69/attachment.pgp>