[mpich-discuss] apparent hydra problem
Martin Pokorny
mpokorny at nrao.edu
Fri Mar 2 11:17:55 CST 2012
Following up a thread that's nearly one year old...
Martin Pokorny wrote:
> My application consists of a soft real-time data collection and
> processing system running on a dedicated cluster. All computers are
> running Linux, the network is all 1 Gb Ethernet, and the version of
> mpich2 that I'm using is mpich2-1.3.1.
>
> This application does not use mpiexec for process management, as that
> all occurs using custom code; however, mpiexec is used for the ability
> to create MPI groups dynamically using MPI_Comm_connect/accept and
> MPI_Intercomm_merge. On the data collection nodes the code is
> multi-threaded, and each MPI communicator is referenced by only one
> thread in every process. What I have seen is that (after
> MPI_Comm_connect/accept has completed) the call to MPI_Intercomm_merge
> occasionally fails to return. If I put an MPI_Barrier in place on the
> inter-communicator before the call to MPI_Intercomm_merge, then
> occasionally that call fails to return. There are no crashes or other
> observable failures; the calling threads simply hang.
I'm currently using mpich2-1.4.1 and still encountering the same error
as I originally reported. The error appears infrequently, but it has not
disappeared. I never see this error when using mpd, but I've recently
encountered other problems with mpd, and would therefore like to abandon
the use of mpd entirely. Generally speaking, we have more success using
mpd, and we normally use it in our production setting.
Because my application effectively runs as a server application, I have
until recently been unable to correlate the output from mpiexec (with
the --verbose flag set) with the appearance of this error. However, a
recent change I made enables logging of mpiexec's messages to a log file
with timestamps. Now I have captured one of these events in the output
of mpiexec (at 18:20:12, below). Comparing it to the output captured
from successful connect/accept/merge events immediately prior to the
failed event, I don't see any difference. Based on logs produced by my
application, the hang still appears to be occurring in the call to
MPI_Intercomm_merge. Is there anything else I could do to produce more
informative output or otherwise help debug this problem?
Here's the relevant log output from mpiexec --verbose (piped to the
logger facility), ending with the failed execution:
> Mar 1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi command (from 0): get
> Mar 1 18:17:42 cbe-control logger: kvsname=kvs_31762_0 key=P0-businesscard
> Mar 1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) upstream
> Mar 1 18:17:42 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
> Mar 1 18:17:42 cbe-control logger: [mpiexec at cbe-node-07] PMI response to fd 6 pid 0: cmd=get_result rc=0 msg=success value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
> Mar 1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't understand the response get_result; forwarding downstream
> Mar 1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi command (from 0): get
> Mar 1 18:19:12 cbe-control logger: kvsname=kvs_31762_0 key=P0-businesscard
> Mar 1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) upstream
> Mar 1 18:19:12 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
> Mar 1 18:19:12 cbe-control logger: [mpiexec at cbe-node-07] PMI response to fd 6 pid 0: cmd=get_result rc=0 msg=success value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
> Mar 1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't understand the response get_result; forwarding downstream
> Mar 1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi command (from 0): get
> Mar 1 18:20:12 cbe-control logger: kvsname=kvs_31762_0 key=P0-businesscard
> Mar 1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) upstream
> Mar 1 18:20:12 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
> Mar 1 18:20:12 cbe-control logger: [mpiexec at cbe-node-07] PMI response to fd 6 pid 0: cmd=get_result rc=0 msg=success value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
> Mar 1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't understand the response get_result; forwarding downstream
--
Martin Pokorny
Software Engineer - Expanded Very Large Array
National Radio Astronomy Observatory - New Mexico Operations
More information about the mpich-discuss
mailing list