[mpich-discuss] MPD error

Vivek Gavane vivekg at cdac.in
Mon Apr 27 00:14:21 CDT 2009


Hi,
   I spawned a mpd ring on a cluster of 36 modes and it was working fine.
After two days i found few hosts got out the ring due to some unknown
reasons.
The logfile in the host shows following.
-----------------------------
compute-0-11.local_32863 (runmainloop 305): no pulse_ack from rhs;
re-entering ring
compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
numTries=1
compute-0-11.local_32863 (runmainloop 310): back in ring
compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; re-entering
ring
compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
numTries=1
compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; re-entering
ring
compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
numTries=1
compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; re-entering
ring
compute-0-11.local_32863 (connect_lhs 897): bad generation from lhs;
lhsgen=3 mygen=4
compute-0-11.local_32863 (enter_ring 836): lhs connect failed
compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
numTries=2
compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; re-entering
ring
compute-0-11.local_32863 (connect_lhs 897): bad generation from lhs;
lhsgen=4 mygen=5
compute-0-11.local_32863 (enter_ring 836): lhs connect failed
compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
numTries=2
compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
compute-0-11.local_32863 (runmainloop 305): no pulse_ack from rhs;
re-entering ring
compute-0-11.local_32863 (connect_lhs 897): bad generation from lhs;
lhsgen=5 mygen=6
compute-0-11.local_32863 (enter_ring 836): lhs connect failed
compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
numTries=2
compute-0-11.local_32863 (runmainloop 310): back in ring
compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; re-entering
ring
compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
numTries=1
compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; re-entering
ring
compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
numTries=1
compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; re-entering
ring
compute-0-11.local_32863 (reenter_ring 806): reenter_ring rc=0 after
numTries=1
compute-0-11.local_32863 (handle_rhs_input 1092): back in ring
compute-0-11.local_32863 (handle_rhs_input 1087): lost rhs; re-entering
ring
compute-0-11.local_32863 (connect_rhs 957): bad generation from rhs;
lhsgen=9 mygen=10
compute-0-11.local_32863 (enter_ring 849): rhs connect failed

-----------------------------

Can anyone tell me probable reasons.

Thanks.
-- 
Regards,
Vivek Gavane.


More information about the mpich-discuss mailing list