[mpich-discuss] Unknown error code in 1.1.1p1

Darius Buntinas buntinas at mcs.anl.gov
Mon Nov 30 13:04:33 CST 2009


Hi Scott,

The code in MPI_Test looks ok.  What's happening is that
MPIR_Request_complete() is passing back an invalid error code.  Is there
a test program you can send us that reproduces this?

-d

On 11/30/2009 12:48 PM, Scott Atchley wrote:
> Hi all,
> 
> A customer is using MPICH2-MX 1.1.1p1 (MPICH2 with the ch_mx device).
> They are running NAMD2 and getting the following error message:
> 
>> MXMPI:FATAL-ERROR:0:Fatal error in MPI_Test: Unknown error.  Please
>> file a
>> bug report., error stack:
>> MPI_Test(152): MPI_Test(request=0x3e57650, flag=0x7fffa244b0ac,
>> status=0x7fffa244b070) failed
>> MPI_Test(136):
>> (unknown)(): Unknown error.  Please file a bug report.
> 
> <snip similar message from 3 other ranks>
> 
>> rank 4 in job 1  rycl02007_43890   caused collective abort of all ranks
>>  exit status of rank 4: killed by signal 9
>>>
> 
> <snip similar message from 3 other ranks>
> 
> Looking at $MPICH/src/mpi/pt2pt/test.c, line 136 calls MPIU_ERR_POP() if
> mpi_errno is not 0. It should then fall through to to fn_exit which will
> return mpi_errno.
> 
> Instead, it seems to go to fn_fail which calls MPIR_Err_create_code()
> and then MPIR_Err_return_comm() which eventually calls MPID_Abort(). The
> latter prints the MXMPI message and exits.
> 
> Am I missing something? Has anyone seen a similar failure before?
> 
> Scott
> 
> 
> -- 
> Scott Atchley
> Myricom Inc.
> http://www.myri.com
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list