[mpich-discuss] mpdboot handshake problem

Rajeev Thakur thakur at mcs.anl.gov
Wed May 7 10:38:41 CDT 2008


Sometimes this is caused by a stray mpd already running on the machine. Do a
ps on n002 and "kill -9" any mpd if it is there. 

Rajeev

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Qi Ying
> Sent: Wednesday, May 07, 2008 9:37 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] mpdboot handshake problem
> 
> I did follow the guide and everything works as it should.
> What confuses me is I can manually start the mpd manually 
> without problem, for example:
> 
> [qying at n002 ~]$ mpd &
> [qying at n002 ~]$ mpdtrace -l
> n002.newwolf.edu_39682 (192.168.0.2)
> 
> [qying at n001 ~]$ mpd -h n002 -p 39682 &
> [qying at n001 ~]$ mpdtrace
> n001
> n002
> 
> [qying at n001 ~]$ mpdringtest
> time for 1 loops = 0.000439882278442 seconds
> 
> Seems everything is working fine. It is just that n002 does not work.
> I have 19 additional nodes running on the system and they are 
> all working fine.
> 
> Qi
> 
> On Tue, May 6, 2008 at 12:21 PM, Rajeev Thakur 
> <thakur at mcs.anl.gov> wrote:
> > It could be something with the networking configuration on the 
> > machines. You can debug the problem by using the mpdcheck 
> utility and 
> > following all the steps described in the installation guide.
> >
> > Rajeev
> >
> >
> >
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov 
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Qi Ying
> > > Sent: Tuesday, May 06, 2008 12:04 PM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [mpich-discuss] mpdboot handshake problem
> > >
> > > Hi All,
> > >
> > > Recently I had trouble starting mpd using mpdboot on my cluster. 
> > > This seems caused by a single node in the system (n002). The 
> > > following is the debugging output. However, I can 
> manually start mpd 
> > > on n001 and
> > > n002 (and have them join the ring), and there is no problem. Any 
> > > insights or suggestions?
> > >
> > > Thanks,
> > >
> > > Qi Ying
> > >
> > > [qying at n001~] $ mpdboot -n 2 -f ~/mpd.hosts 
> --rsh=/usr/bin/rsh -v -d
> > >
> > > running mpdallexit on n001.newwolf.edu LAUNCHED mpd on 
> > > n001.newwolf.edu  via
> > > debug: launch cmd= /opt/mpich2/bin/mpd.py   --ncpus=1 -e -d
> > > debug: mpd on n001.newwolf.edu  on port 55059
> > > RUNNING: mpd on n001.newwolf.edu
> > > debug: info for running mpd: {'ncpus': 1, 'list_port': 55059,
> > > 'entry_port': '', 'host': 'n001.newwolf.edu', 'entry_host': '',
> > > 'ifhn': ''}
> > > LAUNCHED mpd on n002  via  n001.newwolf.edu
> > > debug: launch cmd= /usr/bin/rsh -n n002 
> '/opt/mpich2/bin/mpd.py  -h 
> > > n001.newwolf.edu -p 55059  --ncpus=1 -e -d'
> > > debug: mpd on n002  on port 54319
> > > mpdboot_n001.newwolf.edu (handle_mpd_output 385): failed to 
> > > handshake with mpd on n002; recvd output={}
> > >
> > >
> > >
> >
> >
> 
> 
> 




More information about the mpich-discuss mailing list