[mpich-discuss] Assertion failure from too many MPI_Gets between fences
Jeremiah Willcock
jewillco at osl.iu.edu
Fri Jan 7 17:26:30 CST 2011
On Fri, 7 Jan 2011, Dave Goodell wrote:
> On Jan 7, 2011, at 4:03 PM CST, Jeremiah Willcock wrote:
>
>> On Fri, 7 Jan 2011, Dave Goodell wrote:
>>
>>> On Jan 7, 2011, at 3:44 PM CST, Jeremiah Willcock wrote:
>>>
>>>> On Fri, 7 Jan 2011, Dave Goodell wrote:
>>>>
>>>>> On Jan 7, 2011, at 12:56 PM CST, Jeremiah Willcock wrote:
>>> Not off the top of my head. It seems like you are using a small-ish test program to check this. Can you send that to me so that I can reproduce this for myself and play with the bug?
>>
>> I don't really have a small program for this, just part of a larger one
>> that I was putting print statements in and hacking on. The relevant
>> part of the code just creates a window for an array (with different
>> data on different ranks) then does MPI_Gets on random parts of it to
>> implement a large gather operation. A code such as row-distributed
>> sparse matrix-vector multiplication would have this kind of access
>> pattern as well.
>
> I won't be able to dig into this today, so I've created a ticket to
> track this: https://trac.mcs.anl.gov/projects/mpich2/ticket/1156
I added myself to the cc list for this ticket.
>>>> Inserting fences periodically in my code fixes the problem, but the
>>>> fence frequency needed is proportional to the number of ranks in the
>>>> job. I think the MPI implementation should automatically do whatever
>>>> flow control it needs to avoid running out of memory, no matter how
>>>> many requests the application feeds in.
>>>
>>> I agree, the MPI implementation should take care of that, within
>>> reason. But forcing an MPI_Get to be nonblocking will require a
>>> careful reading of the MPI standard to ensure that's valid behavior.
>>> I think I can construct scenarios where an MPI_Get that blocks would
>>> cause a deadlock...
>>
>> I believe (like everything else in MPI) they can be nonblocking or
>> blocking, at the implementation's choice, but I don't know for sure.
>> Remember that you can't assume that a get has completed until the next
>> fence, so you can't depend on seeing the answer (which I think was what
>> you were worried about causing deadlocks). I treat the MPI_Gets as
>> nonblocking in my application; that's why I start a large number of
>> them then use a fence to complete them all before using the results.
>
> MPI-2.2, page 339, line 13-14: "These operations are nonblocking: the
> call initiates the transfer, but the transfer may continue after the
> call returns."
>
> This language is weaker than I would like, because the presence of the
> clarifying statements after the colon don't say that the call cannot
> block, implicitly watering down the natural MPI meaning of
> "nonblocking". But I think that the intent is clear, that the call
> should not block the user waiting on the action of another process.
> After further thought I can't come up with any realistic example where a
> blocking-for-flow-control MPI_Get causes a deadlock, but I think the
> behavior is still intended to be disallowed by the standard.
I think that the progress clarification at the top of page 371 of MPI 2.2
(end of section 11.7.2) would cover the case in which some one-sided
operations blocked for flow control. Or could there be deadlocks even
with MPI progress semantics?
-- Jeremiah Willcock
More information about the mpich-discuss
mailing list