[mpich-discuss] apparent hydra problem

Fri Mar 2 11:17:55 CST 2012

Following up a thread that's nearly one year old...

Martin Pokorny wrote:
> My application consists of a soft real-time data collection and
> processing system running on a dedicated cluster. All computers are
> running Linux, the network is all 1 Gb Ethernet, and the version of
> mpich2 that I'm using is mpich2-1.3.1.
> 
> This application does not use mpiexec for process management, as that
> all occurs using custom code; however, mpiexec is used for the ability
> to create MPI groups dynamically using MPI_Comm_connect/accept and
> MPI_Intercomm_merge. On the data collection nodes the code is
> multi-threaded, and each MPI communicator is referenced by only one
> thread in every process. What I have seen is that (after
> MPI_Comm_connect/accept has completed) the call to MPI_Intercomm_merge
> occasionally fails to return. If I put an MPI_Barrier in place on the
> inter-communicator before the call to MPI_Intercomm_merge, then
> occasionally that call fails to return. There are no crashes or other
> observable failures; the calling threads simply hang.

I'm currently using mpich2-1.4.1 and still encountering the same error 
as I originally reported. The error appears infrequently, but it has not 
disappeared. I never see this error when using mpd, but I've recently 
encountered other problems with mpd, and would therefore like to abandon 
the use of mpd entirely. Generally speaking, we have more success using 
mpd, and we normally use it in our production setting.

Because my application effectively runs as a server application, I have 
until recently been unable to correlate the output from mpiexec (with 
the --verbose flag set) with the appearance of this error. However, a 
recent change I made enables logging of mpiexec's messages to a log file 
with timestamps. Now I have captured one of these events in the output 
of mpiexec (at 18:20:12, below). Comparing it to the output captured 
from successful connect/accept/merge events immediately prior to the 
failed event, I don't see any difference. Based on logs produced by my 
application, the hang still appears to be occurring in the call to 
MPI_Intercomm_merge. Is there anything else I could do to produce more 
informative output or otherwise help debug this problem?

Here's the relevant log output from mpiexec --verbose (piped to the 
logger facility), ending with the failed execution:

> Mar  1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi command (from 0): get
> Mar  1 18:17:42 cbe-control logger: kvsname=kvs_31762_0 key=P0-businesscard 
> Mar  1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) upstream
> Mar  1 18:17:42 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
> Mar  1 18:17:42 cbe-control logger: [mpiexec at cbe-node-07] PMI response to fd 6 pid 0: cmd=get_result rc=0 msg=success value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
> Mar  1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't understand the response get_result; forwarding downstream
> Mar  1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi command (from 0): get
> Mar  1 18:19:12 cbe-control logger: kvsname=kvs_31762_0 key=P0-businesscard 
> Mar  1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) upstream
> Mar  1 18:19:12 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
> Mar  1 18:19:12 cbe-control logger: [mpiexec at cbe-node-07] PMI response to fd 6 pid 0: cmd=get_result rc=0 msg=success value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
> Mar  1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't understand the response get_result; forwarding downstream
> Mar  1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi command (from 0): get
> Mar  1 18:20:12 cbe-control logger: kvsname=kvs_31762_0 key=P0-businesscard 
> Mar  1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) upstream
> Mar  1 18:20:12 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
> Mar  1 18:20:12 cbe-control logger: [mpiexec at cbe-node-07] PMI response to fd 6 pid 0: cmd=get_result rc=0 msg=success value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
> Mar  1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't understand the response get_result; forwarding downstream

-- 
Martin Pokorny
Software Engineer - Expanded Very Large Array
National Radio Astronomy Observatory - New Mexico Operations