[mpich-discuss] unable to get mpich2 1.0.7 working

Anthony Chan chan at mcs.anl.gov
Wed Sep 10 11:32:19 CDT 2008


You may want to try "mpdcheck" to see if there is any network issue.
mpdcheck is decribed in the Appendix A of the installer's guide.

A.Chan
----- "Kamaraju Kusumanchi" <kamaraju at gmail.com> wrote:

> On Wed, Sep 10, 2008 at 5:12 AM, Rajeev Thakur <thakur at mcs.anl.gov>
> wrote:
> > We see those errors with asynchronous I/O and Fortran/C++ sometimes
> but
> > don't know what causes them. For some reason MPD died later on in
> the tests,
> > which caused the remaining tests to fail. I think your installation
> is ok.
> > Try running your application.
> >
> > Rajeev
> 
> 
> I tried to mpdboot a bunch of nodes but it also gives a error.
> 
> $cat mpd.hosts
> node2
> node4
> node5
> node8
> node9
> node10
> 
> $mpdboot -n 6 -f mpd.hosts
> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
> handshake with mpd on node2; recvd output={}
> 
> The node number in the above error message changes. Sometimes it is
> node2, sometime it is node5 etc.,
> But there is nothing wrong with these nodes as such. For example if
> my
> mpd.hosts contains just 2 lines
> 
> node2
> node5
> 
> instead of 6 lines, then if I do
> 
> $mpdboot -n 2 -f mpd.hosts
> 
> then there are no errors.
> 
> On each of these nodes (node2, node4, node5, node8, node9, node10) I
> am able to start/stop mpd individually. But the problem arises only
> when I tried to boot the nodes alltogether.
> 
> I also can't figure out anything by enabling the debugging messages
> 
> $cat mpd.hosts
> node2
> node4
> node5
> node8
> node9
> node10
> 
> $mpdboot -n 6 -f mpd.hosts -v -d
> debug: starting
> running mpdallexit on ank.mae.cornell.edu
> LAUNCHED mpd on ank.mae.cornell.edu  via
> debug: launch cmd=
> /home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>   --ncpus=1 -e -d
> debug: mpd on ank.mae.cornell.edu  on port 46161
> RUNNING: mpd on ank.mae.cornell.edu
> debug: info for running mpd: {'ncpus': 1, 'list_port': 46161,
> 'entry_port': '', 'host': 'ank.mae.cornell.edu', 'entry_host': '',
> 'ifhn': ''}
> LAUNCHED mpd on node2  via  ank.mae.cornell.edu
> debug: launch cmd= ssh -x -n -q node2
> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
> LAUNCHED mpd on node4  via  ank.mae.cornell.edu
> debug: launch cmd= ssh -x -n -q node4
> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
> LAUNCHED mpd on node5  via  ank.mae.cornell.edu
> debug: launch cmd= ssh -x -n -q node5
> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
> LAUNCHED mpd on node8  via  ank.mae.cornell.edu
> debug: launch cmd= ssh -x -n -q node8
> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
> debug: mpd on node2  on port 60704
> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
> handshake with mpd on node2; recvd output={}
> 
> 
> Any other suggestions?
> 
> thanks
> raju




More information about the mpich-discuss mailing list