[MPICH] mpd hangs

Martin Kleinschmidt mk at theochem.uni-duesseldorf.de
Thu Feb 15 01:20:34 CST 2007


On Mi, 14 Feb 2007, Ralph Butler wrote:

>
>On WedFeb 14, at Wed Feb 14 10:13AM, Martin Kleinschmidt wrote:
>
>>So:
>>is there any other way to set up mpd so that it will be more  tolerant
>>to failure of one node?
>>Or is there any alternative to mpd? e.g. like in the "old" mpi-1,
>>where there was no mpd and jobs were started via rsh?
>
>mpd tries to be tolerant of single-node failures.  Many times it is  
>successful, as you point out.  However,  it's resilience is somewhat
>dependent on the flow of  jobs in and out of the system at the time.

OK. But my problem is, that fault-tolerant operation is more an
exception than  normal behaviour.
As we are only starting with all this parallel stuff, parallel job load
is usually low (1-3 jobs of 2-8 CPUs out of 48) at the time of node
failure.
 
>There are a couple of other process managers that may  be more to your
>liking.  I don't know much about them.  But, they are smpd and remshell
>(which uses rsh/ssh as you mentioned above).  I believe they are
>discussed in the manual.  Other  folks may be able to offer additional
>insights.

thanks for the hint, I'll have a look at them!

   ...martin

-- 
Vor einem Giovane Elber schlottern den gegnerischen Abwehrspielern
schon lange nicht mehr die Knie, vor einem Uwe Seeler hat ja auch
niemand mehr Angst. (Horst Koeppel, erklaert die Nichtberuecksichtigung
des Brasilianers)




More information about the mpich-discuss mailing list