[MPICH2-dev] handling fatal errors

Rajeev Thakur thakur at mcs.anl.gov
Mon Aug 8 14:55:29 CDT 2005


I should add that "**fail" is not a good example in the call to
MPIR_Err_create_code below. We usually indicate the specific error that has
occurred, for example "**nomem". That gets expanded to a real error message
provided in the file src/mpi/errhan/errnames.txt. Having all the error
messages in a separate file also allows for internationalization. Someone
could translate that file into French, for example.

Rajeev


> -----Original Message-----
> From: Rajeev Thakur [mailto:thakur at mcs.anl.gov] 
> Sent: Monday, August 08, 2005 1:23 PM
> To: 'David Gingold'; 'mpich2-dev at mcs.anl.gov'
> Subject: RE: [MPICH2-dev] handling fatal errors
> 
> David,
>       If you look, for example, in 
> src/mpid/ch3/channels/sock/src, we call MPIR_Err_create_code 
> and return 
>         mpi_errno = MPIR_Err_create_code(mpi_errno, 
> MPIR_ERR_FATAL, FCNAME, __L
> INE__, MPI_ERR_OTHER, "**fail", NULL); 
> 
> For the default error handler (errors are fatal), 
> MPIR_Err_create_code will print an error message that 
> includes the function call stack and abort.
> 
> However, we have not been as consistent with our error 
> handling as we should have been, so you may find examples 
> where we haven't done the right thing.
> 
> Rajeev 
> 
> 
> > -----Original Message-----
> > From: owner-mpich2-dev at mcs.anl.gov 
> > [mailto:owner-mpich2-dev at mcs.anl.gov] On Behalf Of David Gingold
> > Sent: Friday, August 05, 2005 3:52 PM
> > To: mpich2-dev at mcs.anl.gov
> > Subject: [MPICH2-dev] handling fatal errors
> > 
> > In an MPICH2 device implementation, what is the right way 
> to handle  
> > fatal errors that cannot easily be attributed to a calling function?
> > 
> > Possible examples of this:
> > 
> >      - An asynchronous progress thread attempts to allocate memory  
> > but fails.
> > 
> >      - Resource allocation fails in code that was triggered 
> > by a user  
> > MPI call, but that is not particularly related to that call.
> > 
> >      - A similar failure happens in a place where it would be too  
> > awkward or costly to include code to pass the error back to 
> the user.
> > 
> > I spotted a few examples of this sort of thing in the MPICH2 code:
> > 
> >      MPID_Abort(MPIR_Process.comm_world, MPIR_Err_create_code 
> > (...), ...);
> > 
> > but I'm not sure whether doing this crosses into the realm of  
> > undesirability.
> > 
> > -dg
> > 
> > --
> > David Gingold
> > Principal Software Engineer
> > SiCortex
> > One Clock Tower Place, Suite 100
> > Maynard MA 01754
> > (978) 897-0214 x224
> > 
> > 
> > 
> 




More information about the mpich2-dev mailing list