[MPICH] mpdboot problem
Tom Crick
tc at cs.bath.ac.uk
Sun May 21 12:43:11 CDT 2006
Hello,
I've been having an issue using mpdboot (MPICH2 1.0.2) on a beowulf
cluster of 20 nodes running SuSE 9.2. After following the mpd
troubleshooting guide in the MPICH2 install doc, I can still find no
obvious answer why I am unable to start mpds on the 20 nodes using
mpdboot.
mpdboot gives a message like:
mpdboot_grendel13_11 (err_exit 415): mpd failed to start correctly on
grendel13
reason: 11: unable to ping local mpd;
invalid msg from mpd :{}:
** mpd may have disappeared, perhaps due to mismatched secretwords
** see msgs logged in syslog and /tmp/mpd2.logfile* on grendel13
last printed output from mpd before becoming a daemon: 32838
mpdboot_grendel13_11 (err_exit 421): contents of mpd logfile in /tmp:
logfile for mpd with pid 3828
grendel13_32838: conn error in connect_rhs: Connection refused
grendel13_32838 (connect_rhs 602): failed to connect to rhs at
127.0.0.2 32849
grendel13_32838 (enter_ring 513): rhs connect failed
grendel13_32838 (run 215): failed to enter ring
Even if you start an mpd manually on the head node and then on each work
node e.g. "mpd -h <host> -p <port> &", it fails like above. Is it
something to do with the "failed to connect to rhs at 127.0.0.2 32849"?
It is possible to ssh from every machine to every other and running
"mpdcheck -v -f /etc/mpd.hosts -ssh" from the head node gives no errors
or problems. Checking the log files on the failing machines gives no
more info than above and the secretwords on all machines are the same.
Any ideas for next step of debugging? Should mpd be run as root?
Thanks and regards,
Tom
--
Tom Crick
Mathematical Logic & Symbolic Computation Group
Department of Computer Science
University of Bath
tc at cs.bath.ac.uk
http://www.cs.bath.ac.uk/tom
More information about the mpich-discuss
mailing list