[mpich-discuss] survived nodes

Darius Buntinas buntinas at mcs.anl.gov
Wed Oct 27 14:19:51 CDT 2010


As Bill mentioned, the only way to implement a timeout is using an Irecv, then keep polling on the request till it completes or you time out, in which case you would cancel the request.

MPICH2 1.3 has features to allow the application to survive process and communication failures.  Use the -disable-auto-cleanup flag for mpiexec to prevent it from killing your entire job when a process fails.  Then set the error handler to MPI_ERRORS_RETURN, so that the MPI functions will return an error code rather than aborting when a fault happens.  If you do this, you'll be able to continue communicating with non-failed processes.  The only catch is that you can't use collective operations on communicators that contain failed processes.

FWIW, the MPI Forum is working on defining the behavior of the MPI library when faults occur.

I hope this helps.

-d

On Oct 27, 2010, at 8:17 AM, Harun Raşit ER wrote:

> I have 2 nodes. One of them sends a message to another and waits for reply. But the other node is not alive (may be the network is crashed). So I wanna wait for reply just for 3 seconds. After that, it will say that the other node is crashed and it will go on its task. But I cannot achieve this simple task:)
> 
> please help!
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list