[mpich-discuss] Fault Tolerant MPICH2 >= 1.3.2

Rob Stewart R.Stewart at hw.ac.uk
Thu Sep 8 11:17:20 CDT 2011


Thanks Jayesh,

On 08/09/11 17:13, Jayesh Krishna wrote:
> Darius,
>   Can you help him out ?

Jayesh described here:
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2011-January/008791.html

The notion that from 1.3.2 on Linux systems, node failure would not 
result in the termination of an MPI job.

I have just compiled mpich2 with the intention of running a simple MPI 
program on our 32 node cluster to test the behaviour I think Jayesh is 
describing.

Do you have a simple unit test C file, or a simple example that I can 
use to test the continuation of jobs in the face of node failure?

Regards,

-- 
Rob Stewart
Computer Science
Heriot Watt University
Edinburgh
T: 0131 4514196
E: rs46 at hw.ac.uk


More information about the mpich-discuss mailing list