[mpich-discuss] Specifying hosts
    Scott Atchley 
    atchley at myri.com
       
    Wed May  6 09:38:44 CDT 2009
    
    
  
On May 5, 2009, at 5:21 PM, Dave Goodell wrote:
> Hi Scott,
>
> I suspect that this is due to a long-standing, extremely user- 
> unfriendly gotcha in mpdboot's usage.  The core counts in the  
> machinefile are used for all hosts except for the current host.  So  
> you need to also specify a --ncpus=8 in your mpdboot command line.
Hi Dave,
I saw that in the guide, but I assumed that if it was in the  
machinefile that it would use that value. I shouldn't have assumed. ;-)
> You were probably getting some of your nodes oversubscribed and the  
> node where you ran mpdboot was probably undersubscribed.
>
> You can usually debug these sorts of problems with a little shell  
> pipeline like:
>
> % mpiexec -n 4 hostname | sort | uniq -c | sort -n
>      4 anlextwls098-007.wl.anl-external
>
> On a cluster larger than just my laptop you would get a list of  
> (process_count,hostname) tuples.  For very large systems where you  
> expect every host to have exactly the same number of processes you  
> can go a bit further:
>
> % mpiexec -n 4 hostname | sort | uniq -c | sort -n | awk '{print  
> $1}' | uniq
> 4
>
> If you see more than one line or if the number displayed on that one  
> line is anything other than the number of processes that you desire  
> to be run per node, then you have a problem.
I do not see any switches that would provide a mapping of ranks to  
hosts. Have I missed it? If there is not, has there been any  
discussion about providing one? I can imagine that it would be very  
helpful in combination with "-l" to determine if a job aborts to  
pinpoint the node for further investigation.
I guess I could just add something like this to my submit scripts:
mpiexec -l -n <num_cores> hostname
before running my actual application.
> Sorry for the very surprising behavior.  I believe that this gotcha  
> is not present in our new process manager, Hydra.  If this doesn't  
> solve your problem, let us know and we can dig in a bit deeper.
>
> -Dave
Is Hydra in the 1.1b1 tarball? If so, I will give it a try.
Scott
    
    
More information about the mpich-discuss
mailing list