[mpich-discuss] Unknown error code in 1.1.1p1

Scott Atchley atchley at myri.com
Mon Nov 30 12:48:39 CST 2009


Hi all,

A customer is using MPICH2-MX 1.1.1p1 (MPICH2 with the ch_mx device).  
They are running NAMD2 and getting the following error message:

> MXMPI:FATAL-ERROR:0:Fatal error in MPI_Test: Unknown error.  Please  
> file a
> bug report., error stack:
> MPI_Test(152): MPI_Test(request=0x3e57650, flag=0x7fffa244b0ac,
> status=0x7fffa244b070) failed
> MPI_Test(136):
> (unknown)(): Unknown error.  Please file a bug report.

<snip similar message from 3 other ranks>

> rank 4 in job 1  rycl02007_43890   caused collective abort of all  
> ranks
>  exit status of rank 4: killed by signal 9
>>

<snip similar message from 3 other ranks>

Looking at $MPICH/src/mpi/pt2pt/test.c, line 136 calls MPIU_ERR_POP()  
if mpi_errno is not 0. It should then fall through to to fn_exit which  
will return mpi_errno.

Instead, it seems to go to fn_fail which calls MPIR_Err_create_code()  
and then MPIR_Err_return_comm() which eventually calls MPID_Abort().  
The latter prints the MXMPI message and exits.

Am I missing something? Has anyone seen a similar failure before?

Scott


--
Scott Atchley
Myricom Inc.
http://www.myri.com




More information about the mpich-discuss mailing list