[mpich-discuss] apparent hydra problem

Pavan Balaji balaji at mcs.anl.gov
Tue Apr 19 21:29:42 CDT 2011


Hi Martin,

Can you try out the latest version of MPICH2 (1.4rc2) to see if this 
problem persists? We made several fixes for dynamic processes are 1.3.1, 
so we want to make sure you aren't running into some bug we already 
fixed. If it fails with 1.4rc2 as well, can you try running the 
application by passing the "-verbose" flag to mpiexec and send us the 
output?

  -- Pavan

On 04/19/2011 04:59 PM, Martin Pokorny wrote:
> My application consists of a soft real-time data collection and
> processing system running on a dedicated cluster. All computers are
> running Linux, the network is all 1 Gb Ethernet, and the version of
> mpich2 that I'm using is mpich2-1.3.1.
>
> This application does not use mpiexec for process management, as that
> all occurs using custom code; however, mpiexec is used for the ability
> to create MPI groups dynamically using MPI_Comm_connect/accept and
> MPI_Intercomm_merge. On the data collection nodes the code is
> multi-threaded, and each MPI communicator is referenced by only one
> thread in every process. What I have seen is that (after
> MPI_Comm_connect/accept has completed) the call to MPI_Intercomm_merge
> occasionally fails to return. If I put an MPI_Barrier in place on the
> inter-communicator before the call to MPI_Intercomm_merge, then
> occasionally that call fails to return. There are no crashes or other
> observable failures; the calling threads simply hang.
>
> Recently I have switched (back) to using the MPD process manager instead
> of Hydra (which I started using with the switch to mpich2-1.3.1), and
> these failures no longer appear. It is rather difficult to get good
> debugging information from this application as it processes data in real
> time (making it difficult to interrupt), and the process management is
> rather dynamic (making it hard to attach to a running process), but I'd
> like to get to the bottom of this problem so that I can use Hydra again.
> Is there any further information that I can provide, or is there
> something I could try that might help solve this problem?
>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list