[MPICH] mpd hangs

Wed Feb 14 10:13:07 CST 2007

Hi,

I am regularly experiencing problems with mpd:
our cluster is now somewhat older (4 years) and every now and then, one
of the machines has to do a reboot for maintenance.

at this point, one of the mpd's will leave the ring in an unfriendly
way. 
After that, in many cases (not always), the ring will break down. I am
unable to start new jobs, and I'm unable to contact the ring at all (no
mpdtrace, mpdexit, ...)
It'll always say something about "end of console"...
The only solution seems to be to kill mpd on all nodes and then restart
it. But this not a good option, because generally there will be jobs
running, who are killed by this.

So:
is there any other way to set up mpd so that it will be more tolerant to
failure of one node?
Or is there any alternative to mpd? e.g. like in the "old" mpi-1, where
there was no mpd and jobs were started via rsh?

I start mpd as root by executing

/usr/local/encap/mpich2-1.0.4p1-intel/bin/mpd.py -d -e --ncpus=2 --ifhn=192.168.103.25

on node24=192.168.103.25
and then joining this ring from all other machines by executing

#!/bin/sh
head=node24
echo $"Joining mpd ring hosted by $head: "
ifhn=`/sbin/ifconfig |grep addr:192.168.103.|awk '{print $2}'|sed s/addr://`
port=`rsh $head mpdtrace -l | grep "$head"_ | awk '{print $1}' |sed 's/'$head'_//'`
if [ "$port" = "" ]
then
   echo "$head is not running the ring, cannot join"
   return 1
else
   python2 /usr/local/encap/mpich2-1.0.4p1-intel/bin/mpd.py --host=$head
--port=$port -d -e --ncpus=2 --ifhn=$ifhn
fi

   ...martin