[mpich-discuss] Can MPICH2 handle the fault that some processes die irregularly

Darius Buntinas buntinas at mcs.anl.gov
Wed Jan 5 13:40:32 CST 2011


The latest release has some support for tolerating communication failures, such as those due to a failed process, however it doesn't do a good job of detecting failed processes, so you can get a process that hangs in recv waiting for a message from a failed process.  We are working on improving detection of and tolerance to failed processes.  The next release should include many improvements.

In addition to setting an error handler, you'll need to tell the process manager not to terminate the job when a process fails.  If you're using the hydra process manager (which is the default in the latest release), you can give the -disable-auto-cleanup option to mpiexec.

-d

On Jan 4, 2011, at 7:41 PM, ejoywx wrote:

> Dear Sir,
> 
> Sorry to trouble you!
> 
> Maybe I am to ask this question. But for me, "Can MPICH2 handle the fault that some processes die irregularly" , it is very important: In our computer cluster, I find if a process dies in some node or a node is shutdown, all process of the cluster will die. We attempt to register a error handler to deal with such fault, unfortunately, We fail!
> 
> I admit that I do not know MPICH2, but I hope I am able to get help from you!  "Can MPICH2 handle the fault that some processes die irregularly?"
> 
> I look forward to receiving your e-mail.Thanks.
> 
> ejoywx
> 2011-01-05
> 
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list