[mpich-discuss] MPD error

Rajeev Thakur thakur at mcs.anl.gov
Tue Apr 28 12:57:44 CDT 2009


Hard to say what may be the problem. Make sure all MPD-related processes on
those machines are dead and reconnect them to the ring.

Rajeev 

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Vivek Gavane
> Sent: Monday, April 27, 2009 12:14 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] MPD error
> 
> Hi,
>    I spawned a mpd ring on a cluster of 36 modes and it was 
> working fine.
> After two days i found few hosts got out the ring due to some 
> unknown reasons.
> The logfile in the host shows following.
> -----------------------------
> compute-0-11.local_32863 (runmainloop 305): no pulse_ack from 
> rhs; re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (runmainloop 310): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; 
> re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; 
> re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; 
> re-entering ring
> compute-0-11.local_32863 (connect_lhs 897): bad generation from lhs;
> lhsgen=3 mygen=4
> compute-0-11.local_32863 (enter_ring 836): lhs connect failed
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=2
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; 
> re-entering ring
> compute-0-11.local_32863 (connect_lhs 897): bad generation from lhs;
> lhsgen=4 mygen=5
> compute-0-11.local_32863 (enter_ring 836): lhs connect failed
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=2
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (runmainloop 305): no pulse_ack from 
> rhs; re-entering ring
> compute-0-11.local_32863 (connect_lhs 897): bad generation from lhs;
> lhsgen=5 mygen=6
> compute-0-11.local_32863 (enter_ring 836): lhs connect failed
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=2
> compute-0-11.local_32863 (runmainloop 310): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; 
> re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; 
> re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; 
> re-entering ring
> compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
> numTries=1
> compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
> compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; 
> re-entering ring
> compute-0-11.local_32863 (connect_rhs 957): bad generation from rhs;
> lhsgen=9 mygen=10
> compute-0-11.local_32863 (enter_ring 849): rhs connect failed
> 
> -----------------------------
> 
> Can anyone tell me probable reasons.
> 
> Thanks.
> --
> Regards,
> Vivek Gavane.
> 



More information about the mpich-discuss mailing list