[MPICH] Error handler

Blankenship, David David.Blankenship at kla-tencor.com
Tue Apr 3 16:13:10 CDT 2007

Is this the same behavior that I would get if a worker becomes
unreachable from a network error?


From: Rajeev Thakur [mailto:thakur at mcs.anl.gov] 
Sent: Tuesday, April 03, 2007 3:46 PM
To: Blankenship, David; mpich-discuss at mcs.anl.gov
Subject: RE: [MPICH] Error handler

The current version of MPICH2 cannot recover from a catastrophic error
such as the death of a process because of a segmentation fault. Simpler
errors such as incorrect parameters to functions can be caught. We plan
to support fault tolerance sometime in the future. 


	From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Blankenship, David
	Sent: Tuesday, April 03, 2007 2:24 PM
	To: mpich-discuss at mcs.anl.gov
	Subject: [MPICH] Error handler

	I am new to MPICH, and I have a lot of questions about error
handling, but I will start with just one easy one. 

	I am up and running with MPICH and C++ on Red Hat Enterprise 4.
I have a fairly simple application where the master process divides the
work and sends it out to each of the workers. The workers do their part
of the work independently, and then the master assembles the results
into a report.

	Eventually, I will want to be able handle failures in the worker
processes by resubmitting the work to another worker to try to get my
job complete. For now, I would like to just catch the error and report
the problem in my application output.

	When I run the application and have one of my workers exit, it
"caused collective abort of all ranks." At this point, I replaced the
default error handler with ERRORS_THROW_EXCEPTIONS error handler, but I
still get the same results. My MPICH initialization looks like:

	MPI::Init( argC, argV ); 

	I have also tried: 


	with the same results. 

	All I want to do right now is to catch the error, add the error
to my results and exit cleanly. 

	What might I be doing wrong here? (I suppose that I could be
testing this incorrectly.) 
	Is there a way to force MPICH to generate errors for testing? 

	Is there some documentation or articles about error handling
with MPICH that might answer some of my other questions? 



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070403/84dfd1d7/attachment.htm>

More information about the mpich-discuss mailing list