[mpich-discuss] apparent hydra problem

Dave Goodell goodell at mcs.anl.gov
Fri Mar 2 12:43:06 CST 2012


On Mar 2, 2012, at 11:17 AM CST, Martin Pokorny wrote:

> Following up a thread that's nearly one year old...
> 
> Martin Pokorny wrote:
>> 
>> This application does not use mpiexec for process management, as that
>> all occurs using custom code; however, mpiexec is used for the ability
>> to create MPI groups dynamically using MPI_Comm_connect/accept and
>> MPI_Intercomm_merge. On the data collection nodes the code is
>> multi-threaded, and each MPI communicator is referenced by only one
>> thread in every process. What I have seen is that (after
>> MPI_Comm_connect/accept has completed) the call to MPI_Intercomm_merge
>> occasionally fails to return. If I put an MPI_Barrier in place on the
>> inter-communicator before the call to MPI_Intercomm_merge, then
>> occasionally that call fails to return. There are no crashes or other
>> observable failures; the calling threads simply hang.

This may or may not be related to your problem.  Our implementation of MPI_Intercomm_merge has a real problem in its implementation w.r.t. the way that we use context IDs.  If you look at line 103, we are just arbitrarily making up a context ID based on the existing context ID for one half of the intercomm: 

http://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/comm/intercomm_merge.c#L103

This causes the temp context ID to collide with a context ID used by an internal subcommunicator on half of the intercomm, and potentially to collide with a random communicator on the other half.  So it's possible to get some "cross talk" between two otherwise unrelated communicators.

Actually causing a failure from such a bug is very hard to do though.  Given that your code behaves differently between MPD and Hydra, something else in the process manager could be what's causing the problem.

I've never fixed the bug in the past because the fix is a bit labor intensive and I could never come up with a good test that would stimulate the bug and prove that we didn't regress.  Furthermore, this functionality is rarely used.  I'll spend a little more time thinking about how to test it and how to fix it, but I can't promise anything in the near future.

-Dave



More information about the mpich-discuss mailing list