[mpich-discuss] mpich2 1.0.8
Dave Goodell
goodell at mcs.anl.gov
Mon Feb 9 15:37:47 CST 2009
Hi Anne,
Have you tried the troubleshooting steps listed in appendix A of the
MPICH2 Installer's Guide[1]?
-Dave
[1] http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf
On Feb 9, 2009, at 10:08 AM, Anne M. Hammond wrote:
> Our ring does not show all the nodes in the cluster:
> (missing 12, 13, 14, 15, 16, 17, 19, 20):
>
> [root at master]# mpdtrace -l
> master.corp.com_34571 (10.0.0.185)
> node11.cl.corp.com_57072 (10.0.0.11)
> node18.cl.corp.com_51834 (10.0.0.18)
> node21.cl.corp.com_36328 (10.0.0.21)
> node22.cl.corp.com_55311 (10.0.0.22)
>
> However, if I rsh to node12, PID 2229 is an mpd that is bound to
> the master (10.0.0.185), using the correct port:
>
> 1 2229 2228 2228 ? -1 S 0 4:47 python2.5 /
> usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --ncpus=1 -e -d
> 2229 17009 17009 2228 ? -1 S 654 0:01 \_
> python2.5 /usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --
> ncpus=1 -e -d
> 17009 17013 17013 2228 ? -1 R 654 110:32 | \_ ./
> bbsim3d.x
> 2229 17010 17010 2228 ? -1 S 654 0:00 \_
> python2.5 /usr/local/mpich/bin/mpd.py -h 10.0.0.185 -p 34571 --
> ncpus=1 -e -d
> 17010 17012 17012 2228 ? -1 R 654 111:45 \_ ./
> bbsim3d.x
>
> This is the same on the other nodes running this job.
>
> Is there a way to have the 8 nodes not currently in the ring to
> reenter the ring without killing the job from the queue?
>
> Thanks in advance.
>
> Anne
More information about the mpich-discuss
mailing list