[mpich-discuss] will my cluster go tumbling down?

Darius Buntinas buntinas at mcs.anl.gov
Mon Jan 31 15:33:18 CST 2011


Hi Eugene,

With the latest version of MPICH2, if you pass --disable-auto-cleanup to mpiexec, it will not kill the entire job if a process terminates unexpectedly (i.e., terminates without calling MPI_Finalize).

The latest version of MPICH2 can tolerate process and communication failures.  Any communication operation (send or receive) with the failed process will return an error code.  Note that collective operations may not give the correct result if performed on a communicator with a failed process, though it should return an error code to the processes that received an incorrect result.  The fault tolerance code is still experimental, so you might run into some bugs (if you do, please let us know).

-d

On Jan 29, 2011, at 1:21 PM, Eugene N wrote:

> 
> I dont want my node calling all rank abort if it segfaults, i mean, if the cause of the error would have nothing to do with mpi, how can i safeguard myself from such troubles?
> 



More information about the mpich-discuss mailing list