[mpich-discuss] on how processes are distributed to processors

Fri Jan 23 09:49:54 CST 2009

On Jan 23, 2009, at 3:08 AM, Nicolas Rosner wrote:
> Instead, to my surprise, the actual result is not that symmetrical at
> all. For some reason, the policy governing how ranks are "dealt" to
> hosts seems to change during said dealing. For instance, on a recent
> test with 3 quad-core nodes, I observed the following behavior:
>
> - ranks 0, 1, 2, 3 were launched on host 1
> - ranks 4, 5, 6, 7 were launched on host 2
> - ranks 8, 9, 10, 11 were launched on host 3
> - rank 12 was lanched on host 1
> - rank 13 was launched on host 2
> - rank 14 was launched on host 3
> - rank 15 was launched on host 1
> - rank 16 was launched on host 2
>     ... and so on.
>
> I must be missing the reason for this apparent lack of consistency.
> The policy during the 1st turn seemed reasonable (considering
> ncpus=4) -- why does it suddenly change on the 2nd turn? Can this be
> normalized somehow? Where could I find more information about these
> policies? I tried reading up on both mpiexec and mpd, but didn't find
> anything that explains this in detail.

Hi Nicolas,

Given what you've explained and my recollection of how mpd assigns  
processes to nodes, you probably have an mpd.hosts file like this:

host1:4
host2:4
host3:4

and you started your mpd ring using something like:

% mpdboot -n 3 --ncpus=4 -f mpd.hosts

mpd is filling each node before moving to the next one.  Once all  
nodes are full it allocates one process per node in a round-robin  
fashion.  This is generally desirable behavior once you realize that  
oversubscribing one node much more than the others in a cluster is a  
very bad scenario for many MPI programs.  This strategy spreads the  
"oversubscription factor" around evenly between the nodes.

Some options you have:
1) Remove the ":4" suffixes from your mpd.hosts file.  I believe this  
will give the behavior that you are looking for.  Of course, if you  
need to run other jobs with the same mpd ring that do need to know how  
many cores are on each node then this solution is suboptimal.
2) Use the -machinefile option to mpiexec.  The format is similar to  
the mpd.hosts format and can be found in the MPICH2 User's Guide [1].   
This should let you specify the exact mapping that you want to achieve.

Hope that helps,
-Dave

[1] http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-userguide.pdf