[mpich2-commits] r7758 - mpich2/trunk
balaji at mcs.anl.gov
balaji at mcs.anl.gov
Wed Jan 19 16:50:18 CST 2011
Author: balaji
Date: 2011-01-19 16:50:18 -0600 (Wed, 19 Jan 2011)
New Revision: 7758
Modified:
mpich2/trunk/README.vin
Log:
Updates to the README to clarify parts of fault tolerance.
Reviewed by buntinas.
Modified: mpich2/trunk/README.vin
===================================================================
--- mpich2/trunk/README.vin 2011-01-19 19:04:10 UTC (rev 7757)
+++ mpich2/trunk/README.vin 2011-01-19 22:50:18 UTC (rev 7758)
@@ -746,54 +746,71 @@
The features described in this section should be considered
experimental. Which means that they have not been fully tested, and
-the behavior may change in future releases.
+the behavior may change in future releases. The below notes are some
+guidelines on what can be expected in this feature:
-Communication failures in MPICH2 are not fatal errors. This means
-that if the user sets the error handler to MPI_ERRORS_RETURN, MPICH2
-will return an appropriate error code in the event of a communication
-failure. When a process detects a failure when communicating with
-another process, it will consider that other process as having failed
-and will no longer attempt to communicate with that process. The user
-can, however, continue making communication calls to other processes.
-Any outstanding send or receive operations to a failed process, or
-wildcard receives (i.e., with MPI_ANY_SOURCE) posted to communicators
-with a failed process, will be immediately completed with an
-appropriate error code.
+ - ERROR RETURNS: Communication failures in MPICH2 are not fatal
+ errors. This means that if the user sets the error handler to
+ MPI_ERRORS_RETURN, MPICH2 will return an appropriate error code in
+ the event of a communication failure. When a process detects a
+ failure when communicating with another process, it will consider
+ the other process as having failed and will no longer attempt to
+ communicate with that process. The user can, however, continue
+ making communication calls to other processes. Any outstanding
+ send or receive operations to a failed process, or wildcard
+ receives (i.e., with MPI_ANY_SOURCE) posted to communicators with a
+ failed process, will be immediately completed with an appropriate
+ error code.
-Collectives performed on a communicator with a failed process will not
-hang, however, the results of the operation are undefined. Some, but
-not necessarily all, processes may return an error. Specifically,
-the collective on some process might not return an error, but the
-result may still be invalid.
+ - COLLECTIVES: For collective operations performed on communicators
+ with a failed process, the collective would return an error on
+ some, but not necessarily all processes. A collective call
+ returning MPI_SUCCESS on a given process means that the part of the
+ collective performed by that process has been successful.
-If used with the hydra process manager, hydra will detect failed
-processes and notify the MPICH2 library. Users can query the list of
-failed processes using the MPICH_ATTR_FAILED_PROCESSES predefined
-attribute on MPI_COMM_WORLD. The attribute value is an integer array
-containing the ranks of the failed processes. The array is terminated
-by MPI_PROC_NULL. The user needs to declare the following extern in
-order to use the attribute:
- extern int MPICH_ATTR_FAILED_PROCESSES;
+ MPICH2 release specific note: There is currently a bug in MPICH2
+ that might cause the collective operation to return
+ MPI_SUCCESS on a process even if data is corrupted on that
+ process.
-Note that hydra by default will abort the entire application when any
-process terminates before calling MPI_Finalize. In order to allow
-an application to continue running despite failed processes, you'll
-need to pass the -disable-auto-cleanup option to mpiexec.
+ - PROCESS MANAGER: If used with the hydra process manager, hydra will
+ detect failed processes and notify the MPICH2 library. Users can
+ query the list of failed processes using the
+ MPICH_ATTR_FAILED_PROCESSES predefined attribute on MPI_COMM_WORLD.
+ The attribute value is an integer array containing the ranks of the
+ failed processes. The array is terminated by MPI_PROC_NULL.
-Unsupported feature: In the current implementation (version 1.3.2),
-hydra notifies the MPICH2 library of failed processes by sending a
-SIGUSR1 signal. The application can catch this signal to be notified
-of failed processes. If the application replaces the library's signal
-handler with its own, the application must be sure to call the
-library's handler from it's own handler. The MPICH_ATTR_FAILED_
-PROCESSES attribute is not updated in the signal handler immediately,
-so the application must call a function like MPI_Iprobe() in order to
-allow the library to process the notification. Note that you cannot
-call MPI function from inside a signal handler (which admittedly makes
-this difficult to use). This feature may not be (probably won't be)
-supported in future releases, so portable applications should not use
-it.
+ MPICH2 release specific note: The user needs to declare the
+ following extern within the application in order to use the
+ attribute (this ideally should be added to mpi.h, but has not
+ been done so, to preserve ABI compatibility in the 1.3.x
+ release series):
+ extern int MPICH_ATTR_FAILED_PROCESSES;
+
+ Note that hydra by default will abort the entire application when
+ any process terminates before calling MPI_Finalize. In order to
+ allow an application to continue running despite failed processes,
+ you will need to pass the -disable-auto-cleanup option to mpiexec.
+
+ - FAILURE NOTIFICATION: THIS IS AN UNSUPPORTED FEATURE AND WILL
+ ALMOST CERTAINLY CHANGE IN THE FUTURE!
+
+ In the current release, hydra notifies the MPICH2 library of failed
+ processes by sending a SIGUSR1 signal. The application can catch
+ this signal to be notified of failed processes. If the application
+ replaces the library's signal handler with its own, the application
+ must be sure to call the library's handler from it's own
+ handler. Note that you cannot call any MPI function from inside a
+ signal handler.
+
+ In future releases, the plan is to provide a call such as
+ MPIX_Failure_notification that will allow the user to register a
+ callback function that will be called on process failures. This
+ mechanism has not been added yet to preserve ABI compatibility in
+ the 1.3.x release series.
+
+
Checkpoint and Restart
----------------------
More information about the mpich2-commits
mailing list