[MPICH2-dev] handling fatal errors

Rajeev Thakur thakur at mcs.anl.gov
Mon Aug 8 13:22:54 CDT 2005


David,
      If you look, for example, in src/mpid/ch3/channels/sock/src, we call
MPIR_Err_create_code and return 
        mpi_errno = MPIR_Err_create_code(mpi_errno, MPIR_ERR_FATAL, FCNAME,
__L
INE__, MPI_ERR_OTHER, "**fail", NULL); 

For the default error handler (errors are fatal), MPIR_Err_create_code will
print an error message that includes the function call stack and abort.

However, we have not been as consistent with our error handling as we should
have been, so you may find examples where we haven't done the right thing.

Rajeev 


> -----Original Message-----
> From: owner-mpich2-dev at mcs.anl.gov 
> [mailto:owner-mpich2-dev at mcs.anl.gov] On Behalf Of David Gingold
> Sent: Friday, August 05, 2005 3:52 PM
> To: mpich2-dev at mcs.anl.gov
> Subject: [MPICH2-dev] handling fatal errors
> 
> In an MPICH2 device implementation, what is the right way to handle  
> fatal errors that cannot easily be attributed to a calling function?
> 
> Possible examples of this:
> 
>      - An asynchronous progress thread attempts to allocate memory  
> but fails.
> 
>      - Resource allocation fails in code that was triggered 
> by a user  
> MPI call, but that is not particularly related to that call.
> 
>      - A similar failure happens in a place where it would be too  
> awkward or costly to include code to pass the error back to the user.
> 
> I spotted a few examples of this sort of thing in the MPICH2 code:
> 
>      MPID_Abort(MPIR_Process.comm_world, MPIR_Err_create_code 
> (...), ...);
> 
> but I'm not sure whether doing this crosses into the realm of  
> undesirability.
> 
> -dg
> 
> --
> David Gingold
> Principal Software Engineer
> SiCortex
> One Clock Tower Place, Suite 100
> Maynard MA 01754
> (978) 897-0214 x224
> 
> 
> 




More information about the mpich2-dev mailing list