[mpich-discuss] Hydra process affinity

Guillaume Mercier mercierg at mcs.anl.gov
Fri May 25 03:20:10 CDT 2012


Hello,

Pavan will probably confirm this point, but my guess is that the meaning 
of "binding to sockets"
differs from one implementation to the other.
If you take a look at the Hydra documentation, you will see that the 
behaviour you're experiencing
seems to be correct.  But I agree that it's not the one you want, though.

Regards
Guillaume


On 05/25/2012 01:14 AM, Martin Cuma wrote:
> Hello,
>
> I am trying to get consistent performance on dual socket multi-core 
> nodes which requires process binding to the socket. The core generally 
> runs one process per socket and launches multiple OpenMP threads to 
> fill up the socket's cores.
>
> I find a problem with the -binding cpu:sockets in the Hydra's mpirun 
> implementation - it binds the process to the first core on the socket, 
> rather than having it access all the socket's cores.
>
> For example, on a dual socket, 6 core CPU (12 cores total) node, I get:
> /uufs/chpc.utah.edu/sys/pkg/mpich2/1.5b1/bin/mpirun -binding 
> cpu:sockets -np 2 
> /uufs/chpc.utah.edu/sys/pkg/hwloc/1.4.2/bin/hwloc-bind --get
> 0x00000001
> 0x00000002
> /uufs/chpc.utah.edu/sys/pkg/hwloc/1.4.2/bin/hwloc-calc -p 
> --hierarchical socket.core 0x00000001
> Socket:0.Core:0
> /uufs/chpc.utah.edu/sys/pkg/hwloc/1.4.2/bin/hwloc-calc -p 
> --hierarchical socket.core 0x00000002
> Socket:1.Core:0
>
> I am using hwloc to report the binding and then calculating the 
> "physical" location. Notice that only single core is reported for each 
> socket, rather than reporting all 6 cores. This is verified by running 
> with 6 OpenMP threads per process and getting only 100% CPU load, 
> rather than 600%.
>
> Now, OpenMPI 1.6.1a1 (just added in today) is doing the affinity right 
> as:
> /uufs/ember.arches/sys/pkg/openmpi/1.6.1a1i/bin/mpirun --bysocket 
> --bind-to-socket -np 2 
> /uufs/chpc.utah.edu/sys/pkg/hwloc/1.4.2/bin/hwloc-bind --get
> 0x00aaaaaa
> 0x00555555
> /uufs/chpc.utah.edu/sys/pkg/hwloc/1.4.2/bin/hwloc-calc -p 
> --hierarchical socket.core 0x00aaaaaa
> Socket:1.Core:0 Socket:1.Core:1 Socket:1.Core:2 Socket:1.Core:8 
> Socket:1.Core:9 Socket:1.Core:10
> /uufs/chpc.utah.edu/sys/pkg/hwloc/1.4.2/bin/hwloc-calc -p 
> --hierarchical socket.core 0x00555555
> Socket:0.Core:0 Socket:0.Core:1 Socket:0.Core:2 Socket:0.Core:8 
> Socket:0.Core:9 Socket:0.Core:10
>
> I have a suspicion that other binding options may be broken as well, 
> for example, I was trying to run -binding cache:l3, which should have 
> the same efect as cpu:sockets on this machine, and it only ran on one 
> core as well (100% load).
>
> I would appreciate if someone could comment on this and if this is a 
> bug, I'd be happy to work with the developers to get this fixed. This 
> shows up both in 1.4.1p1 and 1.5b1.
>
> Thanks,
> MC
>
>



More information about the mpich-discuss mailing list