[mpich-discuss] Assertion failure from too many MPI_Gets between fences

Dave Goodell goodell at mcs.anl.gov
Fri Jan 7 15:56:28 CST 2011


On Jan 7, 2011, at 3:44 PM CST, Jeremiah Willcock wrote:

> On Fri, 7 Jan 2011, Dave Goodell wrote:
> 
>> On Jan 7, 2011, at 12:56 PM CST, Jeremiah Willcock wrote:
>>> on some or all ranks.  I am using the SVN head version currently, but the same error (and same line number) occurred with 1.3.1.  I am running two processes on one machine using "mpiexec -n 2 app"; the platform is x86-64 Linux (RHEL 5.5, gcc 4.1.2).  The number of MPI_Get operations required seems to be about 260k; fewer appears to work fine, but the exact number required for the error varies.  The kind of code I am using is:
>> 
>> 260k is a large number of requests, if one req is being allocated for each Get.  Requests are unfortunately large, somewhere on the order of 1 kiB, so 260k reqs is in the neighborhood of 260 MiB of memory, possibly double that if I'm lowballing the request size.  The handle allocator has a theoretical capacity of at least 2^26 bits (~67 million), so I don't think that we hit an intrinsic addressing limit.
>> 
>> Is your application memory-constrained?
> 
> Not really, at least at the sizes I'm testing at so far.  Is there a good way to test how much memory it is actually using, or how much memory MPICH is using?

Not off the top of my head.  It seems like you are using a small-ish test program to check this.  Can you send that to me so that I can reproduce this for myself and play with the bug?

> Inserting fences periodically in my code fixes the problem, but the fence frequency needed is proportional to the number of ranks in the job.  I think the MPI implementation should automatically do whatever flow control it needs to avoid running out of memory, no matter how many requests the application feeds in.

I agree, the MPI implementation should take care of that, within reason.  But forcing an MPI_Get to be nonblocking will require a careful reading of the MPI standard to ensure that's valid behavior.  I think I can construct scenarios where an MPI_Get that blocks would cause a deadlock...

Without knowing exactly why it's failing, other than some sort of request allocation problem, it's hard to say what the right fix is.

-Dave



More information about the mpich-discuss mailing list