[mpich-discuss] unable to get mpich2 1.0.7 working
Kamaraju Kusumanchi
kamaraju at gmail.com
Wed Sep 10 10:55:31 CDT 2008
On Wed, Sep 10, 2008 at 5:12 AM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> We see those errors with asynchronous I/O and Fortran/C++ sometimes but
> don't know what causes them. For some reason MPD died later on in the tests,
> which caused the remaining tests to fail. I think your installation is ok.
> Try running your application.
>
> Rajeev
I tried to mpdboot a bunch of nodes but it also gives a error.
$cat mpd.hosts
node2
node4
node5
node8
node9
node10
$mpdboot -n 6 -f mpd.hosts
mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
handshake with mpd on node2; recvd output={}
The node number in the above error message changes. Sometimes it is
node2, sometime it is node5 etc.,
But there is nothing wrong with these nodes as such. For example if my
mpd.hosts contains just 2 lines
node2
node5
instead of 6 lines, then if I do
$mpdboot -n 2 -f mpd.hosts
then there are no errors.
On each of these nodes (node2, node4, node5, node8, node9, node10) I
am able to start/stop mpd individually. But the problem arises only
when I tried to boot the nodes alltogether.
I also can't figure out anything by enabling the debugging messages
$cat mpd.hosts
node2
node4
node5
node8
node9
node10
$mpdboot -n 6 -f mpd.hosts -v -d
debug: starting
running mpdallexit on ank.mae.cornell.edu
LAUNCHED mpd on ank.mae.cornell.edu via
debug: launch cmd=
/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
--ncpus=1 -e -d
debug: mpd on ank.mae.cornell.edu on port 46161
RUNNING: mpd on ank.mae.cornell.edu
debug: info for running mpd: {'ncpus': 1, 'list_port': 46161,
'entry_port': '', 'host': 'ank.mae.cornell.edu', 'entry_host': '',
'ifhn': ''}
LAUNCHED mpd on node2 via ank.mae.cornell.edu
debug: launch cmd= ssh -x -n -q node2
'/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
-h ank.mae.cornell.edu -p 46161 --ncpus=1 -e -d'
LAUNCHED mpd on node4 via ank.mae.cornell.edu
debug: launch cmd= ssh -x -n -q node4
'/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
-h ank.mae.cornell.edu -p 46161 --ncpus=1 -e -d'
LAUNCHED mpd on node5 via ank.mae.cornell.edu
debug: launch cmd= ssh -x -n -q node5
'/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
-h ank.mae.cornell.edu -p 46161 --ncpus=1 -e -d'
LAUNCHED mpd on node8 via ank.mae.cornell.edu
debug: launch cmd= ssh -x -n -q node8
'/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
-h ank.mae.cornell.edu -p 46161 --ncpus=1 -e -d'
debug: mpd on node2 on port 60704
mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
handshake with mpd on node2; recvd output={}
Any other suggestions?
thanks
raju
More information about the mpich-discuss
mailing list