[MPICH] Fatal error in MPI_Test

Darius Buntinas buntinas at mcs.anl.gov
Thu Sep 20 10:42:13 CDT 2007


These types of error messages happen when one process exits before 
calling MPI_Finalize, usually because it is killed by an error (e.g., 
segfault).  As other processes try to communicate with that process, 
they'll notice that the connection has been closed, print an error 
message and abort.  The process manager will kill all processes when it 
detects that one process aborts or is killed, but sometimes it can't do 
that before the processes themselves detect the closed connections.

So basically, there was one error which caused a chain reaction.

The real question is what triggered the chain reaction.  I would bet 
that one process segfaulted.  Look for a message that a process received 
a segfault signal (on linux, that's signal no 11), or that it otherwise 
exited before calling MPI_Finalize.

-d


On 09/20/2007 08:52 AM, Kevin Van Workum wrote:
> I have a user getting the following error messages, apparently from MPICH2.
> 
> One node gets this message:
> 
> [cli_2]: aborting job:
> Fatal error in MPI_Isend: Other MPI error, error stack:
> MPI_Isend(145).............: MPI_Isend(buf=0x24d9f0, count=939,
> MPI_DOUBLE_PRECISION, dest=6, tag=887, MPI_COMM_WORLD,
> request=0x8612e58) failed
> MPIDI_EagerContigIsend(468): failure occurred while attempting to send
> an eager message
> MPIDU_Sock_writev(625).....: connection closed by peer
> (set=0,sock=5,errno=32:Broken pipe)
> 
> All the others get this message:
> 
> [cli_0]: aborting job:
> Fatal error in MPI_Test: Other MPI error, error stack:
> MPI_Test(145).............................:
> MPI_Test(request=0x85e5518, flag=0xbf86f0b4, status=0x85e5520) failed
> MPIDI_CH3I_Progress(144)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(607)..............: connection closed by peer
> (set=0,sock=4)
> 
> I'm using mpich2-1.0.5p4, ssm, on a system running Torque and OSC's
> mpiexec. If anyone has a clue as to the cause of these errors, please
> let me know.
> 
> Kevin
> 




More information about the mpich-discuss mailing list