[MPICH] mpdboot problem
Tom Crick
tc at cs.bath.ac.uk
Sun May 21 13:46:28 CDT 2006
Apologies, it seems that YAST does weird things with /etc/hosts after
you make changes, so this is where the 127.0.0.2 appears.
mpdboot works fine now.
Cheers,
Tom
On Sun, 2006-05-21 at 18:43 +0100, Tom Crick wrote:
> Hello,
>
> I've been having an issue using mpdboot (MPICH2 1.0.2) on a beowulf
> cluster of 20 nodes running SuSE 9.2. After following the mpd
> troubleshooting guide in the MPICH2 install doc, I can still find no
> obvious answer why I am unable to start mpds on the 20 nodes using
> mpdboot.
>
> mpdboot gives a message like:
>
> mpdboot_grendel13_11 (err_exit 415): mpd failed to start correctly on
> grendel13
> reason: 11: unable to ping local mpd;
> invalid msg from mpd :{}:
> ** mpd may have disappeared, perhaps due to mismatched secretwords
> ** see msgs logged in syslog and /tmp/mpd2.logfile* on grendel13
> last printed output from mpd before becoming a daemon: 32838
>
> mpdboot_grendel13_11 (err_exit 421): contents of mpd logfile in /tmp:
> logfile for mpd with pid 3828
> grendel13_32838: conn error in connect_rhs: Connection refused
> grendel13_32838 (connect_rhs 602): failed to connect to rhs at
> 127.0.0.2 32849
> grendel13_32838 (enter_ring 513): rhs connect failed
> grendel13_32838 (run 215): failed to enter ring
>
>
> Even if you start an mpd manually on the head node and then on each work
> node e.g. "mpd -h <host> -p <port> &", it fails like above. Is it
> something to do with the "failed to connect to rhs at 127.0.0.2 32849"?
>
> It is possible to ssh from every machine to every other and running
> "mpdcheck -v -f /etc/mpd.hosts -ssh" from the head node gives no errors
> or problems. Checking the log files on the failing machines gives no
> more info than above and the secretwords on all machines are the same.
>
> Any ideas for next step of debugging? Should mpd be run as root?
>
> Thanks and regards,
>
> Tom
>
>
>
More information about the mpich-discuss
mailing list