[mpich-discuss] mpich2 1.0.8

Dave Goodell goodell at mcs.anl.gov
Mon Feb 9 15:37:47 CST 2009


Hi Anne,

Have you tried the troubleshooting steps listed in appendix A of the  
MPICH2 Installer's Guide[1]?

-Dave

[1] http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf

On Feb 9, 2009, at 10:08 AM, Anne M. Hammond wrote:

> Our ring does not show all the nodes in the cluster:
> (missing 12, 13, 14, 15, 16, 17, 19, 20):
>
> [root at master]# mpdtrace -l
> master.corp.com_34571 (10.0.0.185)
> node11.cl.corp.com_57072 (10.0.0.11)
> node18.cl.corp.com_51834 (10.0.0.18)
> node21.cl.corp.com_36328 (10.0.0.21)
> node22.cl.corp.com_55311 (10.0.0.22)
>
> However, if I rsh to node12, PID 2229 is an mpd that is bound to
> the master (10.0.0.185), using the correct port:
>
>    1  2229  2228  2228 ?           -1 S        0   4:47 python2.5 / 
> usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
> 2229 17009 17009  2228 ?           -1 S      654   0:01  \_  
> python2.5 /usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 -- 
> ncpus=1 -e -d
> 17009 17013 17013  2228 ?           -1 R      654 110:32  |   \_ ./ 
> bbsim3d.x
> 2229 17010 17010  2228 ?           -1 S      654   0:00  \_  
> python2.5 /usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 -- 
> ncpus=1 -e -d
> 17010 17012 17012  2228 ?           -1 R      654 111:45      \_ ./ 
> bbsim3d.x
>
> This is the same on the other nodes running this job.
>
> Is there a way to have the 8 nodes not currently in the ring to
> reenter the ring without killing the job from the queue?
>
> Thanks in advance.
>
> Anne



More information about the mpich-discuss mailing list