[mpich-discuss] mpdboot handshake problem

Rajeev Thakur thakur at mcs.anl.gov
Tue May 6 12:21:35 CDT 2008


It could be something with the networking configuration on the machines. You
can debug the problem by using the mpdcheck utility and following all the
steps described in the installation guide.

Rajeev


> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Qi Ying
> Sent: Tuesday, May 06, 2008 12:04 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] mpdboot handshake problem
> 
> Hi All,
> 
> Recently I had trouble starting mpd using mpdboot on my cluster. This
> seems caused by a single node in the system (n002). The following is
> the debugging output. However, I can manually start mpd on n001 and
> n002 (and have them join the ring), and there is no problem. Any
> insights or suggestions?
> 
> Thanks,
> 
> Qi Ying
> 
> [qying at n001~] $ mpdboot -n 2 -f ~/mpd.hosts --rsh=/usr/bin/rsh -v -d
> 
> running mpdallexit on n001.newwolf.edu
> LAUNCHED mpd on n001.newwolf.edu  via
> debug: launch cmd= /opt/mpich2/bin/mpd.py   --ncpus=1 -e -d
> debug: mpd on n001.newwolf.edu  on port 55059
> RUNNING: mpd on n001.newwolf.edu
> debug: info for running mpd: {'ncpus': 1, 'list_port': 55059,
> 'entry_port': '', 'host': 'n001.newwolf.edu', 'entry_host': '',
> 'ifhn': ''}
> LAUNCHED mpd on n002  via  n001.newwolf.edu
> debug: launch cmd= /usr/bin/rsh -n n002 '/opt/mpich2/bin/mpd.py  -h
> n001.newwolf.edu -p 55059  --ncpus=1 -e -d'
> debug: mpd on n002  on port 54319
> mpdboot_n001.newwolf.edu (handle_mpd_output 385): failed to handshake
> with mpd on n002; recvd output={}
> 
> 
> 




More information about the mpich-discuss mailing list