[mpich-discuss] Isend Irecv error

Kenneth Leiter kenneth.leiter at gmail.com
Fri Jun 15 16:19:11 CDT 2012


Increasing the MPIDI_NEM_TCP_MAX_CONNECT_RETRIES to 10000 made no
difference - I get the same timeout errors.

Reordering the for loops gets me past 256 processors, but I fail on 512.

- Ken

On Fri, Jun 15, 2012 at 3:37 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> You could also try staggering the starting indices for the for-loops.  Instead of every process starting with rank 0, send to ((rank+1)%wsize, then ((rank+2)%wsize), finishing with ((rank+wsize-1)%wsize).  This should spread the connections around a little bit.
>
> -Dave
>
> On Jun 15, 2012, at 2:32 PM CDT, Kenneth Leiter wrote:
>
>> Hi Darius,
>>
>> I agree that this is very likely the issue.  In my real code, I put in
>> sleep statements with some success so that a single node was not
>> bombarded with many requests at the same time and was able to run with
>> a larger numbers of processors than I could previously.
>>
>> I'll definitely be interested in trying out a patch.
>>
>> Thanks,
>> Ken Leiter
>>
>> On Fri, Jun 15, 2012 at 3:26 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>>> It looks like your program is having every process send to every other process.  There may be an issue with a node being overloaded with too many simultaneous connection requests.  Currently MPICH doesn't retry after a timeout, but this is something we have on a branch.  Just to test this hypothesis, try putting in a sleep (or usleep) in the right place to slow things down, so that not all of the nodes are hitting the same node at the same time.
>>>
>>> I'll try to get a patch for you to try.
>>>
>>> -d
>>>
>>> On Jun 15, 2012, at 9:35 AM, Kenneth Leiter wrote:
>>>
>>>> Hello,
>>>>
>>>> I am stumped by a problem I am having with my code failing when I use
>>>> a large number of processors.  I have produced a standalone code to
>>>> demonstrate the error.  I don't see the error with other MPI
>>>> implementations that are available to me (intel mpi and openmpi).  I
>>>> am using mpich-1.4.1p1.
>>>>
>>>> The test code sends and receives a buffer from all other tasks.  I
>>>> realize that I should write this as a collective operation (like
>>>> Bcast), but in my real code I only communicate to a few neighbor tasks
>>>> and must use point-to-point operations.  This test code demonstrates
>>>> the same problem I see in my real code.
>>>>
>>>> On my machine, everything works fine up to 128 processors (I have 24
>>>> cores per node on the machine), but fails at 256 processors.  Using
>>>> other mpi implementations I can get to 1500 processors with no
>>>> problem.  I have seen the same behavior on two different machines.
>>>>
>>>> I get an error in MPI_Waitall:
>>>>
>>>> Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for
>>>> the error code
>>>>
>>>> When I examine the MPI_Status I get:
>>>>
>>>> Task ID | Error code
>>>>
>>>> 230 0
>>>> 231 0
>>>> 232 0
>>>> 233 0
>>>> 234 0
>>>> 235 0
>>>> 236 604005647
>>>> 237 18
>>>> 238 18
>>>> 239 18
>>>> 240 18
>>>> 241 18
>>>> 242 18
>>>> 243 18
>>>>
>>>> I have attached the test code to this message.
>>>>
>>>> Thanks,
>>>> Ken Leiter
>>>> <mpichTest.cxx>_______________________________________________
>>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list