[MPICH] Error handler

Rajeev Thakur thakur at mcs.anl.gov
Tue Apr 3 15:45:53 CDT 2007


The current version of MPICH2 cannot recover from a catastrophic error such
as the death of a process because of a segmentation fault. Simpler errors
such as incorrect parameters to functions can be caught. We plan to support
fault tolerance sometime in the future. 
 
Rajeev


  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Blankenship, David
Sent: Tuesday, April 03, 2007 2:24 PM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] Error handler



I am new to MPICH, and I have a lot of questions about error handling, but I
will start with just one easy one. 

I am up and running with MPICH and C++ on Red Hat Enterprise 4. I have a
fairly simple application where the master process divides the work and
sends it out to each of the workers. The workers do their part of the work
independently, and then the master assembles the results into a report.

Eventually, I will want to be able handle failures in the worker processes
by resubmitting the work to another worker to try to get my job complete.
For now, I would like to just catch the error and report the problem in my
application output.

When I run the application and have one of my workers exit, it "caused
collective abort of all ranks." At this point, I replaced the default error
handler with ERRORS_THROW_EXCEPTIONS error handler, but I still get the same
results. My MPICH initialization looks like:

MPI::Init( argC, argV ); 
MPI::COMM_WORLD.Set_errhandler( MPI::ERRORS_THROW_EXCEPTIONS ); 

I have also tried: 

MPI_Errhandler_set( MPI_COMM_WORLD, MPI::ERRORS_THROW_EXCEPTIONS ); 

with the same results. 

All I want to do right now is to catch the error, add the error to my
results and exit cleanly. 

What might I be doing wrong here? (I suppose that I could be testing this
incorrectly.) 
Is there a way to force MPICH to generate errors for testing? 

Is there some documentation or articles about error handling with MPICH that
might answer some of my other questions? 

Thanks, 

David 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070403/e7be3e37/attachment.htm>


More information about the mpich-discuss mailing list