[petsc-users] Understanding the memory bandwidth

Thu Aug 13 15:50:34 CDT 2015

Justin Chang <jychang48 at gmail.com> writes:

> On Thu, Aug 13, 2015 at 1:04 PM, Jed Brown <jed at jedbrown.org> wrote:
>> It looks like with one core/socket, all your memory sits over one
>> channel.  You can play tricks to avoid that or use 4 cores/socket in
>> order to use all memory channels.
>
> How do I play these tricks?

They generally aren't practical outside of simple benchmarks.  Read
through this blog series if you want to dive into memory performance.

http://sites.utexas.edu/jdm4372/2010/11/11/optimizing-amd-opteron-memory-bandwidth-part-5-single-thread-read-only/

> I have no root access. Is there another way to confirm the clock speed?

I don't recall a way to access that information without root.  You can
benchmark, obviously, but you're looking for an independent information
source.  You can ask a sysadmin to run this on a compute node.

>
> ---
>
> So if I have two sockets per node, then the theoretical peak bandwidth
> is actually double than what I thought (whether it be 119.4 GB/s or
> 102.4 GB/s). And if 8 cores really is the optimal number to use for a
> single compute node, why are there 20 totals to begin with? Or would
> this depend on the particular application?

"20 totals"?  Note that you might have hyperthreading, in which case
there are twice as many logical cores as physical cores.

> Also, can someone elaborate on the difference between the words
> "core", "processor", and "thread"?

Processor - typically a unit of manufacturing and sale that goes into a
socket.  Sometimes it shares a last-level cache and other times it is
independent parts stuck together.  Sometimes different parts of the
processor are connected to different memory channels (implying multiple
"NUMA nodes" on a single socket) and sometimes they are multiplexed (so
all cores see the same speed to any memory channel on that socket).

Core - the physical unit that processes ("integer") instructions.  There
can be multiple floating point units per core (e.g., anything with
dual-issue FMA) or multiple cores per floating point unit (e.g., the AMD
processors on Titan).

Logical core/hardware thread - the logical unit exposed to the operating
system.  Often there are 2, 4, or more hardware threads per core.  These
have their own registers (as far as you can tell; it can be complicated
by "register renaming") and are used to cover high-latency operations
including waiting on memory and some arithmetic.  Usually only one
hardware thread issues instructions in any given cycle, so if a single
thread has sufficient ILP (instruction-level parallelism) to keep
issuing every cycle, there can be no benefit to using multiple hardware
threads.  This is impossible with some architectures, thus necessitating
use of multiple hardware threads per core to reach peak flops, integer
instructions, and/or bandwidth.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20150813/e96cfa90/attachment.pgp>