[mpich-discuss] mpich2 1.0.8

Anne M. Hammond hammond at txcorp.com
Mon Feb 9 12:08:19 CST 2009


Our ring does not show all the nodes in the cluster:
(missing 12, 13, 14, 15, 16, 17, 19, 20):

[root at master]# mpdtrace -l
master.corp.com_34571 (10.0.0.185)
node11.cl.corp.com_57072 (10.0.0.11)
node18.cl.corp.com_51834 (10.0.0.18)
node21.cl.corp.com_36328 (10.0.0.21)
node22.cl.corp.com_55311 (10.0.0.22)

However, if I rsh to node12, PID 2229 is an mpd that is bound to
the master (10.0.0.185), using the correct port:

     1  2229  2228  2228 ?           -1 S        0   4:47 python2.5 
/usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
  2229 17009 17009  2228 ?           -1 S      654   0:01  \_ python2.5 
/usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
17009 17013 17013  2228 ?           -1 R      654 110:32  |   \_ 
./bbsim3d.x
  2229 17010 17010  2228 ?           -1 S      654   0:00  \_ python2.5 
/usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
17010 17012 17012  2228 ?           -1 R      654 111:45      \_ 
./bbsim3d.x

This is the same on the other nodes running this job.

Is there a way to have the 8 nodes not currently in the ring to
reenter the ring without killing the job from the queue?

Thanks in advance.

Anne


More information about the mpich-discuss mailing list