[mpich-discuss] apparent hydra problem

Martin Pokorny mpokorny at nrao.edu
Tue Apr 19 16:59:40 CDT 2011


My application consists of a soft real-time data collection and
processing system running on a dedicated cluster. All computers are
running Linux, the network is all 1 Gb Ethernet, and the version of
mpich2 that I'm using is mpich2-1.3.1.

This application does not use mpiexec for process management, as that
all occurs using custom code; however, mpiexec is used for the ability
to create MPI groups dynamically using MPI_Comm_connect/accept and
MPI_Intercomm_merge. On the data collection nodes the code is
multi-threaded, and each MPI communicator is referenced by only one
thread in every process. What I have seen is that (after
MPI_Comm_connect/accept has completed) the call to MPI_Intercomm_merge
occasionally fails to return. If I put an MPI_Barrier in place on the
inter-communicator before the call to MPI_Intercomm_merge, then
occasionally that call fails to return. There are no crashes or other
observable failures; the calling threads simply hang.

Recently I have switched (back) to using the MPD process manager instead
of Hydra (which I started using with the switch to mpich2-1.3.1), and
these failures no longer appear. It is rather difficult to get good
debugging information from this application as it processes data in real
time (making it difficult to interrupt), and the process management is
rather dynamic (making it hard to attach to a running process), but I'd
like to get to the bottom of this problem so that I can use Hydra again.
Is there any further information that I can provide, or is there
something I could try that might help solve this problem?

-- 
Martin



More information about the mpich-discuss mailing list