[mpich-discuss] Unknown error code in 1.1.1p1
Scott Atchley
atchley at myri.com
Mon Nov 30 12:48:39 CST 2009
Hi all,
A customer is using MPICH2-MX 1.1.1p1 (MPICH2 with the ch_mx device).
They are running NAMD2 and getting the following error message:
> MXMPI:FATAL-ERROR:0:Fatal error in MPI_Test: Unknown error. Please
> file a
> bug report., error stack:
> MPI_Test(152): MPI_Test(request=0x3e57650, flag=0x7fffa244b0ac,
> status=0x7fffa244b070) failed
> MPI_Test(136):
> (unknown)(): Unknown error. Please file a bug report.
<snip similar message from 3 other ranks>
> rank 4 in job 1 rycl02007_43890 caused collective abort of all
> ranks
> exit status of rank 4: killed by signal 9
>>
<snip similar message from 3 other ranks>
Looking at $MPICH/src/mpi/pt2pt/test.c, line 136 calls MPIU_ERR_POP()
if mpi_errno is not 0. It should then fall through to to fn_exit which
will return mpi_errno.
Instead, it seems to go to fn_fail which calls MPIR_Err_create_code()
and then MPIR_Err_return_comm() which eventually calls MPID_Abort().
The latter prints the MXMPI message and exits.
Am I missing something? Has anyone seen a similar failure before?
Scott
--
Scott Atchley
Myricom Inc.
http://www.myri.com
More information about the mpich-discuss
mailing list