[mpich-discuss] Fault Tolerant MPICH2 >= 1.3.2

Darius Buntinas buntinas at mcs.anl.gov
Thu Sep 8 12:40:37 CDT 2011


Hi Rob,

Look in the README file (Section 7) on how to use MPICH in a way that tolerates failures.  

I don't have a test program, but you can write your own that, e.g., does a ping-pong with another process.  When you kill one of the processes, you should get an error from the send or receive function call on the other process.  

By default MPICH will kill the job if any process fails, so make sure you read the README to see how to do it correctly.

-d

On Sep 8, 2011, at 11:17 AM, Rob Stewart wrote:

> Thanks Jayesh,
> 
> On 08/09/11 17:13, Jayesh Krishna wrote:
>> Darius,
>>  Can you help him out ?
> 
> Jayesh described here:
> http://lists.mcs.anl.gov/pipermail/mpich-discuss/2011-January/008791.html
> 
> The notion that from 1.3.2 on Linux systems, node failure would not result in the termination of an MPI job.
> 
> I have just compiled mpich2 with the intention of running a simple MPI program on our 32 node cluster to test the behaviour I think Jayesh is describing.
> 
> Do you have a simple unit test C file, or a simple example that I can use to test the continuation of jobs in the face of node failure?
> 
> Regards,
> 
> -- 
> Rob Stewart
> Computer Science
> Heriot Watt University
> Edinburgh
> T: 0131 4514196
> E: rs46 at hw.ac.uk
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list