[petsc-users] Understanding the memory bandwidth

Barry Smith bsmith at mcs.anl.gov
Mon Aug 17 13:35:31 CDT 2015


> On Aug 17, 2015, at 1:21 PM, Justin Chang <jychang48 at gmail.com> wrote:
> 
> Thanks everyone for your valuable input, a few follow up questions:
> 
> 1) The specs for my machine says there are 10 cores and 20 threads.
> Does that mean for each socket, i have 10 cores where each core has 2
> threads? Or does it mean that each core can use up to 20 threads? Or
> something else entirely?

   Some times a single core has support for multiple (like 2) "hardware threads". What this means is that the core has "extra" hardware, generally registers, that allow switching between 2 threads on the core without having to save all the registers for one thread and load all the registers from the other thread (essentially it has more registers then it would have without support for "hardware threads". This means that if the core has two threads, they can be switched back and forth very rapidly. The reason hardware designers put this in is so if one thread is waiting for memory loads, it can switch to the other thread and get some work done during the time. This allows memory latency hiding. 

  The term "hardware threads" is not a very accurate term IMHO. You can use any many or as few threads on this system as you want to, the system is just optimized hardware wise to run with 20 threads for latency hiding. 

  Note that PETSc codes are generally memory bandwidth limited, not memory latency limited and so generally with PETSc code it makes sense to use fewer threads than cores (you are not utilizing all the "extra" hardware then but it is faster).

  Barry



> 
> 2a) When I do an hwloc-info on a single compute node:
> 
> $ hwloc-info
> 
> depth 0: 1 Machine (type #1)
> 
> depth 1: 2 NUMANode (type #2)
> 
>  depth 2: 2 Socket (type #3)
> 
>   depth 3: 2 L3Cache (type #4)
> 
>    depth 4: 20 L2Cache (type #4)
> 
>     depth 5: 20 L1dCache (type #4)
> 
>      depth 6: 20 L1iCache (type #4)
> 
>       depth 7: 20 Core (type #5)
> 
>        depth 8: 20 PU (type #6)
> 
> Special depth -3: 5 Bridge (type #9)
> 
> Special depth -4: 6 PCI Device (type #10)
> 
> Special depth -5: 6 OS Device (type #11)
> 
> With this setup, does it mean that if I invoke mpiexec.hydra -np
> <number> -bind-to hwthread ... the MPI program will bind to the cores?
> 
> 2b) Our headnode has 40 PU at depth 8, so if I -bind-to hwthread on
> this node (and get yelled at by the system admins) it's possible that
> two MPI processes can run on the same core?
> 
> 3) When I invoke an MPI process via mpiexec.hydra -np <number> ...
> without any bindings, do we know what exactly is going on?
> 
> Thanks,
> Justin
> 
> On Fri, Aug 14, 2015 at 2:29 AM, Åsmund Ervik <asmund.ervik at ntnu.no> wrote:
>>>> So this is a pretty low fraction (55%) of 59.7*2 = 119.4.  I suspect
>>>> your memory or motherboard is at most 1600 MHz, so your peak would be
>>>> 102.4 GB/s.
>>> 
>>>> You can check this as root using "dmidecode --type 17", which should
>>>> give one entry per channel, looking something like this:
>>>> 
>>>> Handle 0x002B, DMI type 17, 34 bytes
>>>> Memory Device
>>>>        Array Handle: 0x002A
>>>>        Error Information Handle: 0x002F
>>>>        Total Width: Unknown
>>>>        Data Width: Unknown
>>>>        Size: 4096 MB
>>>>        Form Factor: DIMM
>>>>        Set: None
>>>>        Locator: DIMM0
>>>>        Bank Locator: BANK 0
>>>>        Type: <OUT OF SPEC>
>>>>        Type Detail: None
>>>>        Speed: Unknown
>>>>        Manufacturer: Not Specified
>>>>        Serial Number: Not Specified
>>>>        Asset Tag: Unknown
>>>>        Part Number: Not Specified
>>>>        Rank: Unknown
>>>>        Configured Clock Speed: 1600 MHz
>>> 
>>> I have no root access. Is there another way to confirm the clock speed?
>> 
>> Also note: even in the case where your motherboard, RAM and CPU all say
>> 1866 on the label, if there are more memory DIMMs (chips) per node than
>> channels, say 16 DIMMs on your 8 channels, you will see a performance
>> reduction on the order of 20-30%. This is more likely if you are using
>> nodes in a "high-memory queue" or similar where there's >= 128 GB memory
>> per node. (This will change in the future when/if people start using
>> DDR4 LRDIMMs.) There's a series of in-depth discussions here:
>> http://frankdenneman.nl/2015/02/20/memory-deep-dive/ and there's also
>> lots of interesting memory-stuff on John McCalpin's blog:
>> https://sites.utexas.edu/jdm4372/
>> 
>> Regards,
>> Åsmund
>> 



More information about the petsc-users mailing list