[mpich-discuss] Isend Irecv error

Darius Buntinas buntinas at mcs.anl.gov
Fri Jun 15 14:26:51 CDT 2012


It looks like your program is having every process send to every other process.  There may be an issue with a node being overloaded with too many simultaneous connection requests.  Currently MPICH doesn't retry after a timeout, but this is something we have on a branch.  Just to test this hypothesis, try putting in a sleep (or usleep) in the right place to slow things down, so that not all of the nodes are hitting the same node at the same time.

I'll try to get a patch for you to try.

-d

On Jun 15, 2012, at 9:35 AM, Kenneth Leiter wrote:

> Hello,
> 
> I am stumped by a problem I am having with my code failing when I use
> a large number of processors.  I have produced a standalone code to
> demonstrate the error.  I don't see the error with other MPI
> implementations that are available to me (intel mpi and openmpi).  I
> am using mpich-1.4.1p1.
> 
> The test code sends and receives a buffer from all other tasks.  I
> realize that I should write this as a collective operation (like
> Bcast), but in my real code I only communicate to a few neighbor tasks
> and must use point-to-point operations.  This test code demonstrates
> the same problem I see in my real code.
> 
> On my machine, everything works fine up to 128 processors (I have 24
> cores per node on the machine), but fails at 256 processors.  Using
> other mpi implementations I can get to 1500 processors with no
> problem.  I have seen the same behavior on two different machines.
> 
> I get an error in MPI_Waitall:
> 
> Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for
> the error code
> 
> When I examine the MPI_Status I get:
> 
> Task ID | Error code
> 
> 230 0
> 231 0
> 232 0
> 233 0
> 234 0
> 235 0
> 236 604005647
> 237 18
> 238 18
> 239 18
> 240 18
> 241 18
> 242 18
> 243 18
> 
> I have attached the test code to this message.
> 
> Thanks,
> Ken Leiter
> <mpichTest.cxx>_______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list