[mpich-discuss] unable to get mpich2 1.0.7 working

Kamaraju Kusumanchi kamaraju at gmail.com
Wed Sep 10 10:55:31 CDT 2008


On Wed, Sep 10, 2008 at 5:12 AM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> We see those errors with asynchronous I/O and Fortran/C++ sometimes but
> don't know what causes them. For some reason MPD died later on in the tests,
> which caused the remaining tests to fail. I think your installation is ok.
> Try running your application.
>
> Rajeev


I tried to mpdboot a bunch of nodes but it also gives a error.

$cat mpd.hosts
node2
node4
node5
node8
node9
node10

$mpdboot -n 6 -f mpd.hosts
mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
handshake with mpd on node2; recvd output={}

The node number in the above error message changes. Sometimes it is
node2, sometime it is node5 etc.,
But there is nothing wrong with these nodes as such. For example if my
mpd.hosts contains just 2 lines

node2
node5

instead of 6 lines, then if I do

$mpdboot -n 2 -f mpd.hosts

then there are no errors.

On each of these nodes (node2, node4, node5, node8, node9, node10) I
am able to start/stop mpd individually. But the problem arises only
when I tried to boot the nodes alltogether.

I also can't figure out anything by enabling the debugging messages

$cat mpd.hosts
node2
node4
node5
node8
node9
node10

$mpdboot -n 6 -f mpd.hosts -v -d
debug: starting
running mpdallexit on ank.mae.cornell.edu
LAUNCHED mpd on ank.mae.cornell.edu  via
debug: launch cmd=
/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
  --ncpus=1 -e -d
debug: mpd on ank.mae.cornell.edu  on port 46161
RUNNING: mpd on ank.mae.cornell.edu
debug: info for running mpd: {'ncpus': 1, 'list_port': 46161,
'entry_port': '', 'host': 'ank.mae.cornell.edu', 'entry_host': '',
'ifhn': ''}
LAUNCHED mpd on node2  via  ank.mae.cornell.edu
debug: launch cmd= ssh -x -n -q node2
'/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
 -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
LAUNCHED mpd on node4  via  ank.mae.cornell.edu
debug: launch cmd= ssh -x -n -q node4
'/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
 -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
LAUNCHED mpd on node5  via  ank.mae.cornell.edu
debug: launch cmd= ssh -x -n -q node5
'/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
 -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
LAUNCHED mpd on node8  via  ank.mae.cornell.edu
debug: launch cmd= ssh -x -n -q node8
'/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
 -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
debug: mpd on node2  on port 60704
mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
handshake with mpd on node2; recvd output={}


Any other suggestions?

thanks
raju




More information about the mpich-discuss mailing list