[mpich-discuss] Unknown error code in 1.1.1p1
Darius Buntinas
buntinas at mcs.anl.gov
Mon Nov 30 13:04:33 CST 2009
Hi Scott,
The code in MPI_Test looks ok. What's happening is that
MPIR_Request_complete() is passing back an invalid error code. Is there
a test program you can send us that reproduces this?
-d
On 11/30/2009 12:48 PM, Scott Atchley wrote:
> Hi all,
>
> A customer is using MPICH2-MX 1.1.1p1 (MPICH2 with the ch_mx device).
> They are running NAMD2 and getting the following error message:
>
>> MXMPI:FATAL-ERROR:0:Fatal error in MPI_Test: Unknown error. Please
>> file a
>> bug report., error stack:
>> MPI_Test(152): MPI_Test(request=0x3e57650, flag=0x7fffa244b0ac,
>> status=0x7fffa244b070) failed
>> MPI_Test(136):
>> (unknown)(): Unknown error. Please file a bug report.
>
> <snip similar message from 3 other ranks>
>
>> rank 4 in job 1 rycl02007_43890 caused collective abort of all ranks
>> exit status of rank 4: killed by signal 9
>>>
>
> <snip similar message from 3 other ranks>
>
> Looking at $MPICH/src/mpi/pt2pt/test.c, line 136 calls MPIU_ERR_POP() if
> mpi_errno is not 0. It should then fall through to to fn_exit which will
> return mpi_errno.
>
> Instead, it seems to go to fn_fail which calls MPIR_Err_create_code()
> and then MPIR_Err_return_comm() which eventually calls MPID_Abort(). The
> latter prints the MXMPI message and exits.
>
> Am I missing something? Has anyone seen a similar failure before?
>
> Scott
>
>
> --
> Scott Atchley
> Myricom Inc.
> http://www.myri.com
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list