[mpich-discuss] mpich2 1.0.8
    Anne M. Hammond 
    hammond at txcorp.com
       
    Mon Feb  9 12:08:19 CST 2009
    
    
  
Our ring does not show all the nodes in the cluster:
(missing 12, 13, 14, 15, 16, 17, 19, 20):
[root at master]# mpdtrace -l
master.corp.com_34571 (10.0.0.185)
node11.cl.corp.com_57072 (10.0.0.11)
node18.cl.corp.com_51834 (10.0.0.18)
node21.cl.corp.com_36328 (10.0.0.21)
node22.cl.corp.com_55311 (10.0.0.22)
However, if I rsh to node12, PID 2229 is an mpd that is bound to
the master (10.0.0.185), using the correct port:
     1  2229  2228  2228 ?           -1 S        0   4:47 python2.5 
/usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
  2229 17009 17009  2228 ?           -1 S      654   0:01  \_ python2.5 
/usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
17009 17013 17013  2228 ?           -1 R      654 110:32  |   \_ 
./bbsim3d.x
  2229 17010 17010  2228 ?           -1 S      654   0:00  \_ python2.5 
/usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
17010 17012 17012  2228 ?           -1 R      654 111:45      \_ 
./bbsim3d.x
This is the same on the other nodes running this job.
Is there a way to have the 8 nodes not currently in the ring to
reenter the ring without killing the job from the queue?
Thanks in advance.
Anne
    
    
More information about the mpich-discuss
mailing list