[mpich2-commits] r7758 - mpich2/trunk

balaji at mcs.anl.gov balaji at mcs.anl.gov
Wed Jan 19 16:50:18 CST 2011


Author: balaji
Date: 2011-01-19 16:50:18 -0600 (Wed, 19 Jan 2011)
New Revision: 7758

Modified:
   mpich2/trunk/README.vin
Log:
Updates to the README to clarify parts of fault tolerance.

Reviewed by buntinas.

Modified: mpich2/trunk/README.vin
===================================================================
--- mpich2/trunk/README.vin	2011-01-19 19:04:10 UTC (rev 7757)
+++ mpich2/trunk/README.vin	2011-01-19 22:50:18 UTC (rev 7758)
@@ -746,54 +746,71 @@
 
 The features described in this section should be considered
 experimental.  Which means that they have not been fully tested, and
-the behavior may change in future releases.
+the behavior may change in future releases. The below notes are some
+guidelines on what can be expected in this feature:
 
-Communication failures in MPICH2 are not fatal errors.  This means
-that if the user sets the error handler to MPI_ERRORS_RETURN, MPICH2
-will return an appropriate error code in the event of a communication
-failure.  When a process detects a failure when communicating with
-another process, it will consider that other process as having failed
-and will no longer attempt to communicate with that process.  The user
-can, however, continue making communication calls to other processes.
-Any outstanding send or receive operations to a failed process, or
-wildcard receives (i.e., with MPI_ANY_SOURCE) posted to communicators
-with a failed process, will be immediately completed with an
-appropriate error code.
+ - ERROR RETURNS: Communication failures in MPICH2 are not fatal
+   errors.  This means that if the user sets the error handler to
+   MPI_ERRORS_RETURN, MPICH2 will return an appropriate error code in
+   the event of a communication failure.  When a process detects a
+   failure when communicating with another process, it will consider
+   the other process as having failed and will no longer attempt to
+   communicate with that process.  The user can, however, continue
+   making communication calls to other processes.  Any outstanding
+   send or receive operations to a failed process, or wildcard
+   receives (i.e., with MPI_ANY_SOURCE) posted to communicators with a
+   failed process, will be immediately completed with an appropriate
+   error code.
 
-Collectives performed on a communicator with a failed process will not
-hang, however, the results of the operation are undefined.  Some, but
-not necessarily all, processes may return an error.  Specifically,
-the collective on some process might not return an error, but the
-result may still be invalid.
+ - COLLECTIVES: For collective operations performed on communicators
+   with a failed process, the collective would return an error on
+   some, but not necessarily all processes. A collective call
+   returning MPI_SUCCESS on a given process means that the part of the
+   collective performed by that process has been successful.
 
-If used with the hydra process manager, hydra will detect failed
-processes and notify the MPICH2 library.  Users can query the list of
-failed processes using the MPICH_ATTR_FAILED_PROCESSES predefined
-attribute on MPI_COMM_WORLD.  The attribute value is an integer array
-containing the ranks of the failed processes.  The array is terminated
-by MPI_PROC_NULL.  The user needs to declare the following extern in
-order to use the attribute:
-    extern int MPICH_ATTR_FAILED_PROCESSES;
+      MPICH2 release specific note: There is currently a bug in MPICH2
+         that might cause the collective operation to return
+         MPI_SUCCESS on a process even if data is corrupted on that
+         process.
 
-Note that hydra by default will abort the entire application when any
-process terminates before calling MPI_Finalize.  In order to allow
-an application to continue running despite failed processes, you'll
-need to pass the -disable-auto-cleanup option to mpiexec.
+ - PROCESS MANAGER: If used with the hydra process manager, hydra will
+   detect failed processes and notify the MPICH2 library.  Users can
+   query the list of failed processes using the
+   MPICH_ATTR_FAILED_PROCESSES predefined attribute on MPI_COMM_WORLD.
+   The attribute value is an integer array containing the ranks of the
+   failed processes.  The array is terminated by MPI_PROC_NULL.
 
-Unsupported feature: In the current implementation (version 1.3.2),
-hydra notifies the MPICH2 library of failed processes by sending a
-SIGUSR1 signal.  The application can catch this signal to be notified
-of failed processes.  If the application replaces the library's signal
-handler with its own, the application must be sure to call the
-library's handler from it's own handler.  The MPICH_ATTR_FAILED_
-PROCESSES attribute is not updated in the signal handler immediately,
-so the application must call a function like MPI_Iprobe() in order to
-allow the library to process the notification.  Note that you cannot
-call MPI function from inside a signal handler (which admittedly makes
-this difficult to use).  This feature may not be (probably won't be)
-supported in future releases, so portable applications should not use
-it.
+      MPICH2 release specific note: The user needs to declare the
+         following extern within the application in order to use the
+         attribute (this ideally should be added to mpi.h, but has not
+         been done so, to preserve ABI compatibility in the 1.3.x
+         release series):
 
+             extern int MPICH_ATTR_FAILED_PROCESSES;
+
+   Note that hydra by default will abort the entire application when
+   any process terminates before calling MPI_Finalize.  In order to
+   allow an application to continue running despite failed processes,
+   you will need to pass the -disable-auto-cleanup option to mpiexec.
+
+ - FAILURE NOTIFICATION: THIS IS AN UNSUPPORTED FEATURE AND WILL
+   ALMOST CERTAINLY CHANGE IN THE FUTURE!
+
+   In the current release, hydra notifies the MPICH2 library of failed
+   processes by sending a SIGUSR1 signal.  The application can catch
+   this signal to be notified of failed processes.  If the application
+   replaces the library's signal handler with its own, the application
+   must be sure to call the library's handler from it's own
+   handler. Note that you cannot call any MPI function from inside a
+   signal handler.
+
+   In future releases, the plan is to provide a call such as
+   MPIX_Failure_notification that will allow the user to register a
+   callback function that will be called on process failures. This
+   mechanism has not been added yet to preserve ABI compatibility in
+   the 1.3.x release series.
+
+
 Checkpoint and Restart
 ----------------------
 



More information about the mpich2-commits mailing list