[MPICH] MPI2 bails on mpdboot but works if I add the ring manually?

Shaun Q shaun at qualheim.org
Fri Mar 31 13:08:25 CST 2006


Hi there guys:

I'm trying to get a ring up on a new 64-bit diskless cluster I have here 
and I'm having some problems connecting:

So I run the mpdboot...

%mpdboot -n 4 --rsh=rsh &

try to start on 4 identical machines via rsh...

and it spits the following back to me:

mpdboot_ct105 (handle_mpd_output 368): failed to connect to mpd on ct107

that last machine name -- ct107 -- rotates between each of the four 
machines on trying to boot the ring..

This is the output from my /var/log/messages:

Mar 31 13:01:34 ct105 mpd: mpd starting; no mpdid yet
Mar 31 13:01:34 ct105 mpd: mpd has mpdid=ct105_37225 (port=37225)
Mar 31 13:01:34 ct105 python2.4: mpdboot_ct105 (handle_mpd_output 368): 
failed to connect to mpd on ct107
Mar 31 13:01:35 ct105 mpd: mpd ending mpdid=ct105_37225 (inside cleanup)

I am however, able to start up a ring by issuing the mpd commands manually 
(mpd; mpdtrace -l and then mpd -h blahblah -p blahblah & on the other 
nodes).

So what are we thinking here?  Might this be an RSH issue or a Python 
issue?

Any ideas?

Thanks!
Shaun Qualheim
Convergent Thinking




More information about the mpich-discuss mailing list