[mpich2-commits] r7733 - mpich2/trunk
buntinas at mcs.anl.gov
buntinas at mcs.anl.gov
Mon Jan 17 18:43:58 CST 2011
Author: buntinas
Date: 2011-01-17 18:43:58 -0600 (Mon, 17 Jan 2011)
New Revision: 7733
Modified:
mpich2/trunk/README.vin
Log:
Added documentation on fault tolerance features in README. This should resolve ticket #1142
Modified: mpich2/trunk/README.vin
===================================================================
--- mpich2/trunk/README.vin 2011-01-14 23:10:43 UTC (rev 7732)
+++ mpich2/trunk/README.vin 2011-01-18 00:43:58 UTC (rev 7733)
@@ -32,7 +32,7 @@
4. Alternate Process Managers
5. Alternate Configure Options
6. Testing the MPICH2 installation
-7. Checkpoint/Restart
+7. Fault Tolerance
8. Environment Variables
9. Developer Builds
10. Installing MPICH2 on windows
@@ -735,13 +735,72 @@
-------------------------------------------------------------------------
-7. Checkpoint/Restart
-=====================
+7. Fault Tolerance
+==================
+MPICH2 has some tolerance to process failures, and supports
+checkpointing and restart.
+
+Tolerance to Process Failures
+-----------------------------
+
+The features described in this section should be considered
+experimental. Which means that they have not been fully tested, and
+the behavior may change in future releases.
+
+Communication failures in MPICH2 are not fatal errors. This means
+that if the user sets the error handler to MPI_ERRORS_RETURN, MPICH2
+will return an appropriate error code in the event of a communication
+failure. When a process detects a failure when communicating with
+another process, it will consider that other process as having failed
+and will no longer attempt to communicate with that process. The user
+can, however, continue making communication calls to other processes.
+Any outstanding send or receive operations to a failed process, or
+wildcard receives (i.e., with MPI_ANY_SOURCE) posted to communicators
+with a failed process, will be immediately completed with an
+appropriate error code.
+
+Collectives performed on a communicator with a failed process will not
+hang, however, the results of the operation are undefined. Some, but
+not necessarily all, processes may return an error. Specifically,
+the collective on some process might not return an error, but the
+result may still be invalid.
+
+If used with the hydra process manager, hydra will detect failed
+processes and notify the MPICH2 library. Users can query the list of
+failed processes using the MPICH_ATTR_FAILED_PROCESSES predefined
+attribute on MPI_COMM_WORLD. The attribute value is an integer array
+containing the ranks of the failed processes. The array is terminated
+by MPI_PROC_NULL. The user needs to declare the following extern in
+order to use the attribute:
+ extern int MPICH_ATTR_FAILED_PROCESSES;
+
+Note that hydra by default will abort the entire application when any
+process terminates before calling MPI_Finalize. In order to allow
+an application to continue running despite failed processes, you'll
+need to pass the -disable-auto-cleanup option to mpiexec.
+
+Unsupported feature: In the current implementation (version 1.3.2),
+hydra notifies the MPICH2 library of failed processes by sending a
+SIGUSR1 signal. The application can catch this signal to be notified
+of failed processes. If the application replaces the library's signal
+handler with its own, the application must be sure to call the
+library's handler from it's own handler. The MPICH_ATTR_FAILED_
+PROCESSES attribute is not updated in the signal handler immediately,
+so the application must call a function like MPI_Iprobe() in order to
+allow the library to process the notification. Note that you cannot
+call MPI function from inside a signal handler (which admittedly makes
+this difficult to use). This feature may not be (probably won't be)
+supported in future releases, so portable applications should not use
+it.
+
+Checkpoint and Restart
+----------------------
+
MPICH2 supports checkpointing and restart fault-tolerance using BLCR.
-Configuration
--------------
+CONFIGURATION
+
First, you need to have BLCR version 0.8.2 or later installed on your
machine. If it's installed in the default system location, add the
following two options to your configure command:
@@ -769,8 +828,8 @@
Note, checkpointing is only supported with the Hydra process manager.
-Verifying Checkpointing Support
--------------------------------
+VERIFYING CHECKPOINTING SUPPORT
+
Make sure MPICH2 is correctly configured with BLCR. You can do this
using:
@@ -779,8 +838,8 @@
This should display 'BLCR' under 'Checkpointing libraries available'.
-Checkpointing the Application
------------------------------
+CHECKPOINTING THE APPLICATION
+
There are two ways to cause the application to checkpoint. You can ask
mpiexec to periodically checkpoint the application using the mpiexec
option -ckpoint-interval (seconds):
More information about the mpich2-commits
mailing list