[mpich-discuss] Specifying hosts
Scott Atchley
atchley at myri.com
Wed May 6 09:38:44 CDT 2009
On May 5, 2009, at 5:21 PM, Dave Goodell wrote:
> Hi Scott,
>
> I suspect that this is due to a long-standing, extremely user-
> unfriendly gotcha in mpdboot's usage. The core counts in the
> machinefile are used for all hosts except for the current host. So
> you need to also specify a --ncpus=8 in your mpdboot command line.
Hi Dave,
I saw that in the guide, but I assumed that if it was in the
machinefile that it would use that value. I shouldn't have assumed. ;-)
> You were probably getting some of your nodes oversubscribed and the
> node where you ran mpdboot was probably undersubscribed.
>
> You can usually debug these sorts of problems with a little shell
> pipeline like:
>
> % mpiexec -n 4 hostname | sort | uniq -c | sort -n
> 4 anlextwls098-007.wl.anl-external
>
> On a cluster larger than just my laptop you would get a list of
> (process_count,hostname) tuples. For very large systems where you
> expect every host to have exactly the same number of processes you
> can go a bit further:
>
> % mpiexec -n 4 hostname | sort | uniq -c | sort -n | awk '{print
> $1}' | uniq
> 4
>
> If you see more than one line or if the number displayed on that one
> line is anything other than the number of processes that you desire
> to be run per node, then you have a problem.
I do not see any switches that would provide a mapping of ranks to
hosts. Have I missed it? If there is not, has there been any
discussion about providing one? I can imagine that it would be very
helpful in combination with "-l" to determine if a job aborts to
pinpoint the node for further investigation.
I guess I could just add something like this to my submit scripts:
mpiexec -l -n <num_cores> hostname
before running my actual application.
> Sorry for the very surprising behavior. I believe that this gotcha
> is not present in our new process manager, Hydra. If this doesn't
> solve your problem, let us know and we can dig in a bit deeper.
>
> -Dave
Is Hydra in the 1.1b1 tarball? If so, I will give it a try.
Scott
More information about the mpich-discuss
mailing list