[mpich-discuss] mpich2 1.0.8
Anne M. Hammond
hammond at txcorp.com
Mon Feb 9 12:08:19 CST 2009
Our ring does not show all the nodes in the cluster:
(missing 12, 13, 14, 15, 16, 17, 19, 20):
[root at master]# mpdtrace -l
master.corp.com_34571 (10.0.0.185)
node11.cl.corp.com_57072 (10.0.0.11)
node18.cl.corp.com_51834 (10.0.0.18)
node21.cl.corp.com_36328 (10.0.0.21)
node22.cl.corp.com_55311 (10.0.0.22)
However, if I rsh to node12, PID 2229 is an mpd that is bound to
the master (10.0.0.185), using the correct port:
1 2229 2228 2228 ? -1 S 0 4:47 python2.5
/usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
2229 17009 17009 2228 ? -1 S 654 0:01 \_ python2.5
/usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
17009 17013 17013 2228 ? -1 R 654 110:32 | \_
./bbsim3d.x
2229 17010 17010 2228 ? -1 S 654 0:00 \_ python2.5
/usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
17010 17012 17012 2228 ? -1 R 654 111:45 \_
./bbsim3d.x
This is the same on the other nodes running this job.
Is there a way to have the 8 nodes not currently in the ring to
reenter the ring without killing the job from the queue?
Thanks in advance.
Anne
More information about the mpich-discuss
mailing list