[mpich-discuss] Specifying hosts

Scott Atchley atchley at myri.com
Wed May 6 09:38:44 CDT 2009


On May 5, 2009, at 5:21 PM, Dave Goodell wrote:

> Hi Scott,
>
> I suspect that this is due to a long-standing, extremely user- 
> unfriendly gotcha in mpdboot's usage.  The core counts in the  
> machinefile are used for all hosts except for the current host.  So  
> you need to also specify a --ncpus=8 in your mpdboot command line.

Hi Dave,

I saw that in the guide, but I assumed that if it was in the  
machinefile that it would use that value. I shouldn't have assumed. ;-)

> You were probably getting some of your nodes oversubscribed and the  
> node where you ran mpdboot was probably undersubscribed.
>
> You can usually debug these sorts of problems with a little shell  
> pipeline like:
>
> % mpiexec -n 4 hostname | sort | uniq -c | sort -n
>      4 anlextwls098-007.wl.anl-external
>
> On a cluster larger than just my laptop you would get a list of  
> (process_count,hostname) tuples.  For very large systems where you  
> expect every host to have exactly the same number of processes you  
> can go a bit further:
>
> % mpiexec -n 4 hostname | sort | uniq -c | sort -n | awk '{print  
> $1}' | uniq
> 4
>
> If you see more than one line or if the number displayed on that one  
> line is anything other than the number of processes that you desire  
> to be run per node, then you have a problem.

I do not see any switches that would provide a mapping of ranks to  
hosts. Have I missed it? If there is not, has there been any  
discussion about providing one? I can imagine that it would be very  
helpful in combination with "-l" to determine if a job aborts to  
pinpoint the node for further investigation.

I guess I could just add something like this to my submit scripts:

mpiexec -l -n <num_cores> hostname

before running my actual application.

> Sorry for the very surprising behavior.  I believe that this gotcha  
> is not present in our new process manager, Hydra.  If this doesn't  
> solve your problem, let us know and we can dig in a bit deeper.
>
> -Dave

Is Hydra in the 1.1b1 tarball? If so, I will give it a try.

Scott


More information about the mpich-discuss mailing list