[mpich-discuss] mpich2 1.0.8

Tue Feb 10 11:14:53 CST 2009

Login to node12, kill the mpd that is dead/hung and restart it via the  
same cmd, e.g.:

         kill -9 2229
         python2.5 /usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571  
--ncpus=1 -e -d

There should be a new mpd in the ring now.
If this works, you can do likewise on other nodes.
If not, then this implies that the existing ring has some problems and  
probably
needs to be destroyed and re-started.

On MonFeb 9, at Mon Feb 9 12:08PM, Anne M. Hammond wrote:

> Our ring does not show all the nodes in the cluster:
> (missing 12, 13, 14, 15, 16, 17, 19, 20):
>
> [root at master]# mpdtrace -l
> master.corp.com_34571 (10.0.0.185)
> node11.cl.corp.com_57072 (10.0.0.11)
> node18.cl.corp.com_51834 (10.0.0.18)
> node21.cl.corp.com_36328 (10.0.0.21)
> node22.cl.corp.com_55311 (10.0.0.22)
>
> However, if I rsh to node12, PID 2229 is an mpd that is bound to
> the master (10.0.0.185), using the correct port:
>
>    1  2229  2228  2228 ?           -1 S        0   4:47 python2.5 / 
> usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
> 2229 17009 17009  2228 ?           -1 S      654   0:01  \_  
> python2.5 /usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 -- 
> ncpus=1 -e -d
> 17009 17013 17013  2228 ?           -1 R      654 110:32  |   \_ ./ 
> bbsim3d.x
> 2229 17010 17010  2228 ?           -1 S      654   0:00  \_  
> python2.5 /usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 -- 
> ncpus=1 -e -d
> 17010 17012 17012  2228 ?           -1 R      654 111:45      \_ ./ 
> bbsim3d.x
>
> This is the same on the other nodes running this job.
>
> Is there a way to have the 8 nodes not currently in the ring to
> reenter the ring without killing the job from the queue?
>
> Thanks in advance.
>
> Anne