[MPICH2-dev] handling fatal errors
Rajeev Thakur
thakur at mcs.anl.gov
Mon Aug 8 14:55:29 CDT 2005
I should add that "**fail" is not a good example in the call to
MPIR_Err_create_code below. We usually indicate the specific error that has
occurred, for example "**nomem". That gets expanded to a real error message
provided in the file src/mpi/errhan/errnames.txt. Having all the error
messages in a separate file also allows for internationalization. Someone
could translate that file into French, for example.
Rajeev
> -----Original Message-----
> From: Rajeev Thakur [mailto:thakur at mcs.anl.gov]
> Sent: Monday, August 08, 2005 1:23 PM
> To: 'David Gingold'; 'mpich2-dev at mcs.anl.gov'
> Subject: RE: [MPICH2-dev] handling fatal errors
>
> David,
> If you look, for example, in
> src/mpid/ch3/channels/sock/src, we call MPIR_Err_create_code
> and return
> mpi_errno = MPIR_Err_create_code(mpi_errno,
> MPIR_ERR_FATAL, FCNAME, __L
> INE__, MPI_ERR_OTHER, "**fail", NULL);
>
> For the default error handler (errors are fatal),
> MPIR_Err_create_code will print an error message that
> includes the function call stack and abort.
>
> However, we have not been as consistent with our error
> handling as we should have been, so you may find examples
> where we haven't done the right thing.
>
> Rajeev
>
>
> > -----Original Message-----
> > From: owner-mpich2-dev at mcs.anl.gov
> > [mailto:owner-mpich2-dev at mcs.anl.gov] On Behalf Of David Gingold
> > Sent: Friday, August 05, 2005 3:52 PM
> > To: mpich2-dev at mcs.anl.gov
> > Subject: [MPICH2-dev] handling fatal errors
> >
> > In an MPICH2 device implementation, what is the right way
> to handle
> > fatal errors that cannot easily be attributed to a calling function?
> >
> > Possible examples of this:
> >
> > - An asynchronous progress thread attempts to allocate memory
> > but fails.
> >
> > - Resource allocation fails in code that was triggered
> > by a user
> > MPI call, but that is not particularly related to that call.
> >
> > - A similar failure happens in a place where it would be too
> > awkward or costly to include code to pass the error back to
> the user.
> >
> > I spotted a few examples of this sort of thing in the MPICH2 code:
> >
> > MPID_Abort(MPIR_Process.comm_world, MPIR_Err_create_code
> > (...), ...);
> >
> > but I'm not sure whether doing this crosses into the realm of
> > undesirability.
> >
> > -dg
> >
> > --
> > David Gingold
> > Principal Software Engineer
> > SiCortex
> > One Clock Tower Place, Suite 100
> > Maynard MA 01754
> > (978) 897-0214 x224
> >
> >
> >
>
More information about the mpich2-dev
mailing list