[mpich-discuss] apparent hydra problem

Fri Mar 2 13:04:11 CST 2012

Pavan Balaji wrote:
> Can you send us a sample code that's failing?

Unfortunately the application is real-time data-driven, making it 
impossible to run yourself outside of our observatory. Also, because the 
codebase is large, instead of sending you the entire application, I'll 
just include some of the relevant excerpts (with minor changes, and some 
additional comments), below.

A group of processes I'll call the "data processors" already share a 
communicator. They connect to the "metadata processor" in the following 
function.

static unsigned
connect_as_group(subscan_t *subscan, const char *port_name)
{
     MPI_Comm intercomm;
     /* subscan->io_instance->pipelines is the communicator for
        the data processors */
     MPI_Barrier(subscan->io_instance->pipelines);
     MPI_Comm_connect((char *)port_name, MPI_INFO_NULL, 0,
                      subscan->io_instance->pipelines, &intercomm);
     MPI_Comm newcomm;
     /* The code sometimes hangs in the following call. Note that
        I have also seen a call to MPI_Barrier(intercomm) in this
        location hang. */
     MPI_Intercomm_merge(intercomm, TRUE, &newcomm);
     MPI_Comm_disconnect(&intercomm);
     subscan->mpi_comm = newcomm;

     unsigned signal[MDATA_SIGNAL_SIZE];
     MPI_Bcast(signal, MDATA_SIGNAL_SIZE, MPI_UNSIGNED, 0,
               subscan->mpi_comm);
     return signal[MDATA_SIGNAL_CMD_OFFSET];
}

The metadata processor code excerpt:

if (need_comm) { /* need_comm will be "true" */
     MPI_Barrier(mpi_comm); /* this is cruft, mpi_comm here has size 1 */
     MPI_Comm intercomm;
     MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, mpi_comm, &intercomm);
     MPI_Comm newcomm;
     /* The following call sometimes hangs. */
     MPI_Intercomm_merge(intercomm, FALSE, &newcomm);
     MPI_Comm_disconnect(&intercomm);
     mpi_comm = newcomm;
}
/* send signal to open a new file */
unsigned signal[MDATA_SIGNAL_SIZE] = {
     [MDATA_SIGNAL_CMD_OFFSET] = MDATA_SIGNAL_OPEN_FILE
};
MPI_Bcast(signal, MDATA_SIGNAL_SIZE, MPI_UNSIGNED, 0, mpi_comm);

> Also, are you saying the mpd works, but hydra doesn't, in the *same* 
> version of MPICH2?

Yes.

> You can't compare mpich2-1.0.8p1 with mpd against 
> mpich2-1.4.1p1 with hydra.  There are too many variables changing over 
> here, and the problem could be in some other place.
> 
>  -- Pavan
> 
> On 03/02/2012 11:17 AM, Martin Pokorny wrote:
>> Following up a thread that's nearly one year old...
>>
>> Martin Pokorny wrote:
>>> My application consists of a soft real-time data collection and
>>> processing system running on a dedicated cluster. All computers are
>>> running Linux, the network is all 1 Gb Ethernet, and the version of
>>> mpich2 that I'm using is mpich2-1.3.1.
>>>
>>> This application does not use mpiexec for process management, as that
>>> all occurs using custom code; however, mpiexec is used for the ability
>>> to create MPI groups dynamically using MPI_Comm_connect/accept and
>>> MPI_Intercomm_merge. On the data collection nodes the code is
>>> multi-threaded, and each MPI communicator is referenced by only one
>>> thread in every process. What I have seen is that (after
>>> MPI_Comm_connect/accept has completed) the call to MPI_Intercomm_merge
>>> occasionally fails to return. If I put an MPI_Barrier in place on the
>>> inter-communicator before the call to MPI_Intercomm_merge, then
>>> occasionally that call fails to return. There are no crashes or other
>>> observable failures; the calling threads simply hang.
>>
>> I'm currently using mpich2-1.4.1 and still encountering the same error
>> as I originally reported. The error appears infrequently, but it has not
>> disappeared. I never see this error when using mpd, but I've recently
>> encountered other problems with mpd, and would therefore like to abandon
>> the use of mpd entirely. Generally speaking, we have more success using
>> mpd, and we normally use it in our production setting.
>>
>> Because my application effectively runs as a server application, I have
>> until recently been unable to correlate the output from mpiexec (with
>> the --verbose flag set) with the appearance of this error. However, a
>> recent change I made enables logging of mpiexec's messages to a log file
>> with timestamps. Now I have captured one of these events in the output
>> of mpiexec (at 18:20:12, below). Comparing it to the output captured
>> from successful connect/accept/merge events immediately prior to the
>> failed event, I don't see any difference. Based on logs produced by my
>> application, the hang still appears to be occurring in the call to
>> MPI_Intercomm_merge. Is there anything else I could do to produce more
>> informative output or otherwise help debug this problem?
>>
>> Here's the relevant log output from mpiexec --verbose (piped to the
>> logger facility), ending with the failed execution:
>>
>>> Mar  1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi 
>>> command (from 0): get
>>> Mar  1 18:17:42 cbe-control logger: kvsname=kvs_31762_0 
>>> key=P0-businesscard
>>> Mar  1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] 
>>> forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) 
>>> upstream
>>> Mar  1 18:17:42 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] 
>>> got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
>>> Mar  1 18:17:42 cbe-control logger: [mpiexec at cbe-node-07] PMI 
>>> response to fd 6 pid 0: cmd=get_result rc=0 msg=success 
>>> value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
>>> Mar  1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't 
>>> understand the response get_result; forwarding downstream
>>> Mar  1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi 
>>> command (from 0): get
>>> Mar  1 18:19:12 cbe-control logger: kvsname=kvs_31762_0 
>>> key=P0-businesscard
>>> Mar  1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] 
>>> forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) 
>>> upstream
>>> Mar  1 18:19:12 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] 
>>> got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
>>> Mar  1 18:19:12 cbe-control logger: [mpiexec at cbe-node-07] PMI 
>>> response to fd 6 pid 0: cmd=get_result rc=0 msg=success 
>>> value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
>>> Mar  1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't 
>>> understand the response get_result; forwarding downstream
>>> Mar  1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi 
>>> command (from 0): get
>>> Mar  1 18:20:12 cbe-control logger: kvsname=kvs_31762_0 
>>> key=P0-businesscard
>>> Mar  1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] 
>>> forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) 
>>> upstream
>>> Mar  1 18:20:12 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] 
>>> got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
>>> Mar  1 18:20:12 cbe-control logger: [mpiexec at cbe-node-07] PMI 
>>> response to fd 6 pid 0: cmd=get_result rc=0 msg=success 
>>> value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
>>> Mar  1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't 
>>> understand the response get_result; forwarding downstream

-- 
Martin Pokorny
Software Engineer - Expanded Very Large Array
National Radio Astronomy Observatory - New Mexico Operations