[MPICH] process failure on a node

Rajeev Thakur thakur at mcs.anl.gov
Thu Mar 9 22:32:08 CST 2006


Not currently. The current version of MPICH2 (or MPICH-1) doesn't handle
failure of a process gracefully. If a process dies, the whole job may abort.
We intend to add better support for fault tolerance in the future.

Rajeev 

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of SUNDAR J
> Sent: Thursday, March 09, 2006 9:55 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] process failure on a node
> 
> Suppose I am running a program in which process 0 sends 
> message to all other
> processes. and if suppose anyone of the node fails to respond properly
> (say it
> is switched off , or it hangs at runtime ) then is there anyway to get
> around the problem. is there a way for the main process to 
> detect that the
> message it has send to one of the process has failed and 
> redirect it to
> some other process. What if the main process itself fails. 
> will bring down
> the whole program crashing.
> 
> 




More information about the mpich-discuss mailing list