[MPICH] mpd hangs
Ralph Butler
rbutler at mtsu.edu
Wed Feb 14 12:46:16 CST 2007
On WedFeb 14, at Wed Feb 14 10:13AM, Martin Kleinschmidt wrote:
> Hi,
>
> I am regularly experiencing problems with mpd:
> our cluster is now somewhat older (4 years) and every now and then,
> one
> of the machines has to do a reboot for maintenance.
>
> at this point, one of the mpd's will leave the ring in an unfriendly
> way.
> After that, in many cases (not always), the ring will break down. I am
> unable to start new jobs, and I'm unable to contact the ring at all
> (no
> mpdtrace, mpdexit, ...)
> It'll always say something about "end of console"...
> The only solution seems to be to kill mpd on all nodes and then
> restart
> it. But this not a good option, because generally there will be jobs
> running, who are killed by this.
>
> So:
> is there any other way to set up mpd so that it will be more
> tolerant to
> failure of one node?
> Or is there any alternative to mpd? e.g. like in the "old" mpi-1,
> where
> there was no mpd and jobs were started via rsh?
mpd tries to be tolerant of single-node failures. Many times it is
successful, as you point
out. However, it's resilience is somewhat dependent on the flow of
jobs in and out of the system
at the time. There are a couple of other process managers that may
be more to your liking.
I don't know much about them. But, they are smpd and remshell (which
uses rsh/ssh as you
mentioned above). I believe they are discussed in the manual. Other
folks may be able
to offer additional insights.
>
>
> I start mpd as root by executing
>
> /usr/local/encap/mpich2-1.0.4p1-intel/bin/mpd.py -d -e --ncpus=2 --
> ifhn=192.168.103.25
>
> on node24=192.168.103.25
> and then joining this ring from all other machines by executing
>
> #!/bin/sh
> head=node24
> echo $"Joining mpd ring hosted by $head: "
> ifhn=`/sbin/ifconfig |grep addr:192.168.103.|awk '{print $2}'|sed s/
> addr://`
> port=`rsh $head mpdtrace -l | grep "$head"_ | awk '{print $1}' |sed
> 's/'$head'_//'`
> if [ "$port" = "" ]
> then
> echo "$head is not running the ring, cannot join"
> return 1
> else
> python2 /usr/local/encap/mpich2-1.0.4p1-intel/bin/mpd.py --host=
> $head
> --port=$port -d -e --ncpus=2 --ifhn=$ifhn
> fi
>
>
> ...martin
>
More information about the mpich-discuss
mailing list