[MPICH] mpdboot problem

Tom Crick tc at cs.bath.ac.uk
Sun May 21 12:43:11 CDT 2006


Hello,

I've been having an issue using mpdboot (MPICH2 1.0.2) on a beowulf
cluster of 20 nodes running SuSE 9.2. After following the mpd
troubleshooting guide in the MPICH2 install doc, I can still find no
obvious answer why I am unable to start mpds on the 20 nodes using
mpdboot.

mpdboot gives a message like:

mpdboot_grendel13_11 (err_exit 415): mpd failed to start correctly on
grendel13
  reason: 11: unable to ping local mpd;
  invalid msg from mpd :{}:
  ** mpd may have disappeared, perhaps due to mismatched secretwords
  ** see msgs logged in syslog and /tmp/mpd2.logfile* on grendel13
  last printed output from mpd before becoming a daemon: 32838

mpdboot_grendel13_11 (err_exit 421): contents of mpd logfile in /tmp:
logfile for mpd with pid 3828
  grendel13_32838: conn error in connect_rhs: Connection refused
  grendel13_32838 (connect_rhs 602): failed to connect to rhs at
127.0.0.2 32849
  grendel13_32838 (enter_ring 513): rhs connect failed
  grendel13_32838 (run 215): failed to enter ring


Even if you start an mpd manually on the head node and then on each work
node e.g. "mpd -h <host> -p <port> &", it fails like above. Is it
something to do with the "failed to connect to rhs at 127.0.0.2 32849"?

It is possible to ssh from every machine to every other and running
"mpdcheck -v -f /etc/mpd.hosts -ssh" from the head node gives no errors
or problems. Checking the log files on the failing machines gives no
more info than above and the secretwords on all machines are the same.

Any ideas for next step of debugging? Should mpd be run as root?

Thanks and regards,

Tom



-- 
Tom Crick
Mathematical Logic & Symbolic Computation Group
Department of Computer Science
University of Bath
tc at cs.bath.ac.uk
http://www.cs.bath.ac.uk/tom





More information about the mpich-discuss mailing list