[MPICH] mpd hangs
Martin Kleinschmidt
mk at theochem.uni-duesseldorf.de
Thu Feb 15 01:20:34 CST 2007
On Mi, 14 Feb 2007, Ralph Butler wrote:
>
>On WedFeb 14, at Wed Feb 14 10:13AM, Martin Kleinschmidt wrote:
>
>>So:
>>is there any other way to set up mpd so that it will be more tolerant
>>to failure of one node?
>>Or is there any alternative to mpd? e.g. like in the "old" mpi-1,
>>where there was no mpd and jobs were started via rsh?
>
>mpd tries to be tolerant of single-node failures. Many times it is
>successful, as you point out. However, it's resilience is somewhat
>dependent on the flow of jobs in and out of the system at the time.
OK. But my problem is, that fault-tolerant operation is more an
exception than normal behaviour.
As we are only starting with all this parallel stuff, parallel job load
is usually low (1-3 jobs of 2-8 CPUs out of 48) at the time of node
failure.
>There are a couple of other process managers that may be more to your
>liking. I don't know much about them. But, they are smpd and remshell
>(which uses rsh/ssh as you mentioned above). I believe they are
>discussed in the manual. Other folks may be able to offer additional
>insights.
thanks for the hint, I'll have a look at them!
...martin
--
Vor einem Giovane Elber schlottern den gegnerischen Abwehrspielern
schon lange nicht mehr die Knie, vor einem Uwe Seeler hat ja auch
niemand mehr Angst. (Horst Koeppel, erklaert die Nichtberuecksichtigung
des Brasilianers)
More information about the mpich-discuss
mailing list