[mpich-discuss] Fault Tolerance

Darius Buntinas buntinas at mcs.anl.gov
Thu Sep 16 07:21:14 CDT 2010


This feature will be available in 1.3.  You can try this out from the nightly snapshots.  This has been tested with the tcp netmod on Unix, I'm not sure if it's been tested with Windows (Jayesh?).

With this feature, when a process dies, the other processes can continue to run, however sends to the process will result in an error (eventually, depending on network timeouts, etc).  Collectives on communicators containing the dead process will not work reliably (they may succeed on some processes and may fail on others), so you really shouldn't use them.  Unfortunately, doing a comm split (and I believe all other communicator creation operations) are collective.  So for now, there's no way to create a new communicator from a communicator with a dead process.  This is something we're working on still.

To use this, you'll need to set the error handler to "errors return", and check the return codes of all MPI functions.  You'll also need to pass the --disable-auto-cleanup option to hydra so it doesn't kill the entire job when a process exits before calling MPI_Finalize.

I hope that helps.

-d

On Sep 15, 2010, at 10:43 PM, Hiatt, Dave M wrote:

> I see Ticket #1089 so I leap to the conclusion that it will be available soon.  Anything I could use in the interim, the wolves are at the door so to speak.
>  
> “People get held back by the voice inside em” – K’naan – In the Beginning
>  
> Dave M. Hiatt
> Director, Risk Analytics
> CitiMortgage
> 1000 Technology Drive
> O'Fallon, MO 63368-2240
>  
> Telephone:  636-261-1408
>  
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list