[mpich-discuss] mpich2 1.0.8
Ralph Butler
rbutler at mtsu.edu
Tue Feb 10 11:14:53 CST 2009
Login to node12, kill the mpd that is dead/hung and restart it via the
same cmd, e.g.:
kill -9 2229
python2.5 /usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571
--ncpus=1 -e -d
There should be a new mpd in the ring now.
If this works, you can do likewise on other nodes.
If not, then this implies that the existing ring has some problems and
probably
needs to be destroyed and re-started.
On MonFeb 9, at Mon Feb 9 12:08PM, Anne M. Hammond wrote:
> Our ring does not show all the nodes in the cluster:
> (missing 12, 13, 14, 15, 16, 17, 19, 20):
>
> [root at master]# mpdtrace -l
> master.corp.com_34571 (10.0.0.185)
> node11.cl.corp.com_57072 (10.0.0.11)
> node18.cl.corp.com_51834 (10.0.0.18)
> node21.cl.corp.com_36328 (10.0.0.21)
> node22.cl.corp.com_55311 (10.0.0.22)
>
> However, if I rsh to node12, PID 2229 is an mpd that is bound to
> the master (10.0.0.185), using the correct port:
>
> 1 2229 2228 2228 ? -1 S 0 4:47 python2.5 /
> usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
> 2229 17009 17009 2228 ? -1 S 654 0:01 \_
> python2.5 /usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --
> ncpus=1 -e -d
> 17009 17013 17013 2228 ? -1 R 654 110:32 | \_ ./
> bbsim3d.x
> 2229 17010 17010 2228 ? -1 S 654 0:00 \_
> python2.5 /usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --
> ncpus=1 -e -d
> 17010 17012 17012 2228 ? -1 R 654 111:45 \_ ./
> bbsim3d.x
>
> This is the same on the other nodes running this job.
>
> Is there a way to have the 8 nodes not currently in the ring to
> reenter the ring without killing the job from the queue?
>
> Thanks in advance.
>
> Anne
More information about the mpich-discuss
mailing list