[mpich-discuss] Isend Irecv error
Dave Goodell
goodell at mcs.anl.gov
Fri Jun 15 14:37:45 CDT 2012
You could also try staggering the starting indices for the for-loops. Instead of every process starting with rank 0, send to ((rank+1)%wsize, then ((rank+2)%wsize), finishing with ((rank+wsize-1)%wsize). This should spread the connections around a little bit.
-Dave
On Jun 15, 2012, at 2:32 PM CDT, Kenneth Leiter wrote:
> Hi Darius,
>
> I agree that this is very likely the issue. In my real code, I put in
> sleep statements with some success so that a single node was not
> bombarded with many requests at the same time and was able to run with
> a larger numbers of processors than I could previously.
>
> I'll definitely be interested in trying out a patch.
>
> Thanks,
> Ken Leiter
>
> On Fri, Jun 15, 2012 at 3:26 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>> It looks like your program is having every process send to every other process. There may be an issue with a node being overloaded with too many simultaneous connection requests. Currently MPICH doesn't retry after a timeout, but this is something we have on a branch. Just to test this hypothesis, try putting in a sleep (or usleep) in the right place to slow things down, so that not all of the nodes are hitting the same node at the same time.
>>
>> I'll try to get a patch for you to try.
>>
>> -d
>>
>> On Jun 15, 2012, at 9:35 AM, Kenneth Leiter wrote:
>>
>>> Hello,
>>>
>>> I am stumped by a problem I am having with my code failing when I use
>>> a large number of processors. I have produced a standalone code to
>>> demonstrate the error. I don't see the error with other MPI
>>> implementations that are available to me (intel mpi and openmpi). I
>>> am using mpich-1.4.1p1.
>>>
>>> The test code sends and receives a buffer from all other tasks. I
>>> realize that I should write this as a collective operation (like
>>> Bcast), but in my real code I only communicate to a few neighbor tasks
>>> and must use point-to-point operations. This test code demonstrates
>>> the same problem I see in my real code.
>>>
>>> On my machine, everything works fine up to 128 processors (I have 24
>>> cores per node on the machine), but fails at 256 processors. Using
>>> other mpi implementations I can get to 1500 processors with no
>>> problem. I have seen the same behavior on two different machines.
>>>
>>> I get an error in MPI_Waitall:
>>>
>>> Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for
>>> the error code
>>>
>>> When I examine the MPI_Status I get:
>>>
>>> Task ID | Error code
>>>
>>> 230 0
>>> 231 0
>>> 232 0
>>> 233 0
>>> 234 0
>>> 235 0
>>> 236 604005647
>>> 237 18
>>> 238 18
>>> 239 18
>>> 240 18
>>> 241 18
>>> 242 18
>>> 243 18
>>>
>>> I have attached the test code to this message.
>>>
>>> Thanks,
>>> Ken Leiter
>>> <mpichTest.cxx>_______________________________________________
>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list