[mpich-discuss] failed processes

Darius Buntinas buntinas at mcs.anl.gov
Wed Nov 3 11:29:08 CDT 2010


Hi Harun,

If you use MPICH2 1.3, and pass the -disable-auto-cleanup parameter to mpiexec, then your app will not automatically be killed when a process dies before calling MPI_Finalize.  You'll then need to set the default error handler in MPI to MPI_ERRORS_RETURN, so that the application won't abort when an error is detected.  

The MPICH2 library should allow you to continue communicating with other processes if a process dies.  However, collective operations on a communicator that includes a dead process will most likely hang some processes.

I hope this helps.

-d

On Nov 3, 2010, at 4:17 AM, Harun Raşit ER wrote:

> When one of the processes is failed, all my job is aborted. But there must be a solution that i cannot find! I would like to continue without the failed process and do the job with remaining processes. Is there any idea or solution?
> 
> thanks for your helps.
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list