[mpich-discuss] MPD error
Rajeev Thakur
thakur at mcs.anl.gov
Tue Apr 28 12:57:44 CDT 2009
Hard to say what may be the problem. Make sure all MPD-related processes on
those machines are dead and reconnect them to the ring.
Rajeev
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Vivek Gavane
> Sent: Monday, April 27, 2009 12:14 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] MPD error
>
> Hi,
> I spawned a mpd ring on a cluster of 36 modes and it was
> working fine.
> After two days i found few hosts got out the ring due to some
> unknown reasons.
> The logfile in the host shows following.
> -----------------------------
> compute-0-11.local_32863 (runmainloop 305): no pulse_ack from
> rhs; re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (runmainloop 310): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs;
> re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs;
> re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs;
> re-entering ring
> compute-0-11.local_32863 (connect_lhs 897): bad generation from lhs;
> lhsgen=3 mygen=4
> compute-0-11.local_32863 (enter_ring 836): lhs connect failed
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=2
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs;
> re-entering ring
> compute-0-11.local_32863 (connect_lhs 897): bad generation from lhs;
> lhsgen=4 mygen=5
> compute-0-11.local_32863 (enter_ring 836): lhs connect failed
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=2
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (runmainloop 305): no pulse_ack from
> rhs; re-entering ring
> compute-0-11.local_32863 (connect_lhs 897): bad generation from lhs;
> lhsgen=5 mygen=6
> compute-0-11.local_32863 (enter_ring 836): lhs connect failed
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=2
> compute-0-11.local_32863 (runmainloop 310): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs;
> re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs;
> re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs;
> re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs;
> re-entering ring
> compute-0-11.local_32863 (connect_rhs 957): bad generation from rhs;
> lhsgen=9 mygen=10
> compute-0-11.local_32863 (enter_ring 849): rhs connect failed
>
> -----------------------------
>
> Can anyone tell me probable reasons.
>
> Thanks.
> --
> Regards,
> Vivek Gavane.
>
More information about the mpich-discuss
mailing list