[mpich-discuss] apparent hydra problem

Fri Mar 2 11:52:01 CST 2012

Can you send us a sample code that's failing?

Also, are you saying the mpd works, but hydra doesn't, in the *same* 
version of MPICH2?  You can't compare mpich2-1.0.8p1 with mpd against 
mpich2-1.4.1p1 with hydra.  There are too many variables changing over 
here, and the problem could be in some other place.

  -- Pavan

On 03/02/2012 11:17 AM, Martin Pokorny wrote:
> Following up a thread that's nearly one year old...
>
> Martin Pokorny wrote:
>> My application consists of a soft real-time data collection and
>> processing system running on a dedicated cluster. All computers are
>> running Linux, the network is all 1 Gb Ethernet, and the version of
>> mpich2 that I'm using is mpich2-1.3.1.
>>
>> This application does not use mpiexec for process management, as that
>> all occurs using custom code; however, mpiexec is used for the ability
>> to create MPI groups dynamically using MPI_Comm_connect/accept and
>> MPI_Intercomm_merge. On the data collection nodes the code is
>> multi-threaded, and each MPI communicator is referenced by only one
>> thread in every process. What I have seen is that (after
>> MPI_Comm_connect/accept has completed) the call to MPI_Intercomm_merge
>> occasionally fails to return. If I put an MPI_Barrier in place on the
>> inter-communicator before the call to MPI_Intercomm_merge, then
>> occasionally that call fails to return. There are no crashes or other
>> observable failures; the calling threads simply hang.
>
> I'm currently using mpich2-1.4.1 and still encountering the same error
> as I originally reported. The error appears infrequently, but it has not
> disappeared. I never see this error when using mpd, but I've recently
> encountered other problems with mpd, and would therefore like to abandon
> the use of mpd entirely. Generally speaking, we have more success using
> mpd, and we normally use it in our production setting.
>
> Because my application effectively runs as a server application, I have
> until recently been unable to correlate the output from mpiexec (with
> the --verbose flag set) with the appearance of this error. However, a
> recent change I made enables logging of mpiexec's messages to a log file
> with timestamps. Now I have captured one of these events in the output
> of mpiexec (at 18:20:12, below). Comparing it to the output captured
> from successful connect/accept/merge events immediately prior to the
> failed event, I don't see any difference. Based on logs produced by my
> application, the hang still appears to be occurring in the call to
> MPI_Intercomm_merge. Is there anything else I could do to produce more
> informative output or otherwise help debug this problem?
>
> Here's the relevant log output from mpiexec --verbose (piped to the
> logger facility), ending with the failed execution:
>
>> Mar  1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi command (from 0): get
>> Mar  1 18:17:42 cbe-control logger: kvsname=kvs_31762_0 key=P0-businesscard
>> Mar  1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) upstream
>> Mar  1 18:17:42 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
>> Mar  1 18:17:42 cbe-control logger: [mpiexec at cbe-node-07] PMI response to fd 6 pid 0: cmd=get_result rc=0 msg=success value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
>> Mar  1 18:17:42 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't understand the response get_result; forwarding downstream
>> Mar  1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi command (from 0): get
>> Mar  1 18:19:12 cbe-control logger: kvsname=kvs_31762_0 key=P0-businesscard
>> Mar  1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) upstream
>> Mar  1 18:19:12 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
>> Mar  1 18:19:12 cbe-control logger: [mpiexec at cbe-node-07] PMI response to fd 6 pid 0: cmd=get_result rc=0 msg=success value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
>> Mar  1 18:19:12 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't understand the response get_result; forwarding downstream
>> Mar  1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] got pmi command (from 0): get
>> Mar  1 18:20:12 cbe-control logger: kvsname=kvs_31762_0 key=P0-businesscard
>> Mar  1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] forwarding command (cmd=get kvsname=kvs_31762_0 key=P0-businesscard) upstream
>> Mar  1 18:20:12 cbe-control logger: [mpiexec at cbe-node-07] [pgid: 0] got PMI command: cmd=get kvsname=kvs_31762_0 key=P0-businesscard
>> Mar  1 18:20:12 cbe-control logger: [mpiexec at cbe-node-07] PMI response to fd 6 pid 0: cmd=get_result rc=0 msg=success value=description#10.80.200.107$port#36287$ifname#10.80.200.107$
>> Mar  1 18:20:12 cbe-control logger: [proxy:0:0 at cbe-node-07] we don't understand the response get_result; forwarding downstream
>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji