[mpich-discuss] apparent hydra problem

Martin Pokorny mpokorny at nrao.edu
Fri Mar 2 14:26:59 CST 2012


Dave Goodell wrote:
> On Mar 2, 2012, at 11:17 AM CST, Martin Pokorny wrote:
> 
>> Following up a thread that's nearly one year old...
>> 
>> Martin Pokorny wrote:
>>> This application does not use mpiexec for process management, as
>>> that all occurs using custom code; however, mpiexec is used for
>>> the ability to create MPI groups dynamically using
>>> MPI_Comm_connect/accept and MPI_Intercomm_merge. On the data
>>> collection nodes the code is multi-threaded, and each MPI
>>> communicator is referenced by only one thread in every process.
>>> What I have seen is that (after MPI_Comm_connect/accept has
>>> completed) the call to MPI_Intercomm_merge occasionally fails to
>>> return. If I put an MPI_Barrier in place on the 
>>> inter-communicator before the call to MPI_Intercomm_merge, then 
>>> occasionally that call fails to return. There are no crashes or
>>> other observable failures; the calling threads simply hang.
> 
> This may or may not be related to your problem.  Our implementation
> of MPI_Intercomm_merge has a real problem in its implementation
> w.r.t. the way that we use context IDs.  If you look at line 103, we
> are just arbitrarily making up a context ID based on the existing
> context ID for one half of the intercomm:
> 
> http://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/comm/intercomm_merge.c#L103
> 
> 
> This causes the temp context ID to collide with a context ID used by
> an internal subcommunicator on half of the intercomm, and potentially
> to collide with a random communicator on the other half.  So it's
> possible to get some "cross talk" between two otherwise unrelated
> communicators.

That's conceivably applicable in my case because the involved processes 
can be long-running, and threads with distinct communicators are 
employed to allow writing multiple files (using MPI-IO) concurrently. Is 
there some way I might be able to modify the MPIR_Intercomm_merge_impl 
code to test for a context ID collision (and then report this condition)?

> Actually causing a failure from such a bug is very hard to do though.
> Given that your code behaves differently between MPD and Hydra,
> something else in the process manager could be what's causing the
> problem.
> 
> I've never fixed the bug in the past because the fix is a bit labor
> intensive and I could never come up with a good test that would
> stimulate the bug and prove that we didn't regress.  Furthermore,
> this functionality is rarely used.  I'll spend a little more time
> thinking about how to test it and how to fix it, but I can't promise
> anything in the near future.

If there's anything I can do to help, let me know.

-- 
Martin Pokorny
Software Engineer - Expanded Very Large Array
National Radio Astronomy Observatory - New Mexico Operations


More information about the mpich-discuss mailing list