[mpich-discuss] Isend Irecv error

Kenneth Leiter kenneth.leiter at gmail.com
Fri Jun 15 14:20:05 CDT 2012


Hi Dave.

Thanks for the response.

The MPI_Status MPI_ERROR values are the following (along with the
result of passing the error code to MPI_Error_String())

Task ID | MPI_ERROR | String

233 0 (No MPI error)
234 0 (No MPI error)
235 0 (No MPI error)
236 604005647 (Other MPI error, error stack:
MPID_nem_tcp_connpoll(1826): Communication error with rank 236:
Connection timed out)
237 18 (Pending request (no error))
238 18 (Pending request (no error))
239 18 (Pending request (no error))

I did a bit of googling and couldn't come up with anything off the bat
for this error.  Do you have any suggestions for how to get around the
timeout?

Thanks again,
Ken Leiter

On Fri, Jun 15, 2012 at 2:52 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> You are probably either hitting a resource limit (likely) or one of the nodes in your system is not configured correctly and is causing larger jobs to fail (less likely).
>
> Have you tried *not* setting MPI_ERRORS_RETURN (i.e., leaving MPI_ERRORS_ARE_FATAL as the default) to get a clearer error message?
>
> Alternatively, have you tried using MPI_ERROR_STRING() to find out what message is associated with the specific error code present in the status?
>
> http://www.mpi-forum.org/docs/mpi22-report/node193.htm#Node193
>
> -Dave
>
> On Jun 15, 2012, at 9:35 AM CDT, Kenneth Leiter wrote:
>
>> Hello,
>>
>> I am stumped by a problem I am having with my code failing when I use
>> a large number of processors.  I have produced a standalone code to
>> demonstrate the error.  I don't see the error with other MPI
>> implementations that are available to me (intel mpi and openmpi).  I
>> am using mpich-1.4.1p1.
>>
>> The test code sends and receives a buffer from all other tasks.  I
>> realize that I should write this as a collective operation (like
>> Bcast), but in my real code I only communicate to a few neighbor tasks
>> and must use point-to-point operations.  This test code demonstrates
>> the same problem I see in my real code.
>>
>> On my machine, everything works fine up to 128 processors (I have 24
>> cores per node on the machine), but fails at 256 processors.  Using
>> other mpi implementations I can get to 1500 processors with no
>> problem.  I have seen the same behavior on two different machines.
>>
>> I get an error in MPI_Waitall:
>>
>> Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for
>> the error code
>>
>> When I examine the MPI_Status I get:
>>
>> Task ID | Error code
>>
>> 230 0
>> 231 0
>> 232 0
>> 233 0
>> 234 0
>> 235 0
>> 236 604005647
>> 237 18
>> 238 18
>> 239 18
>> 240 18
>> 241 18
>> 242 18
>> 243 18
>>
>> I have attached the test code to this message.
>>
>> Thanks,
>> Ken Leiter
>> <mpichTest.cxx>_______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list