[mpich-discuss] unable to get mpich2 1.0.7 working

Anthony Chan chan at mcs.anl.gov
Wed Sep 10 19:20:19 CDT 2008


This is what our mpd expert asked if you've done the following:

Try using mpdcheck as a client and server on
each pair of machines in question, and then reversing the roles.
Also, after using mpdcheck, they can try starting a set of mpds by  
hand and using mpiexec (not mpdtrace)
to run a program.  If they can start a ring of mpds by hand and run  
mpiexec using the entire ring, there really
should be no problem using mpdboot.  Of course all existing mpd  
processes have to be killed first.




----- "Kamaraju Kusumanchi" <kamaraju at gmail.com> wrote:

> I forgot to metion this in my mail. I did do the mpdcheck and
> verified
> that it is not a networking issue.
> 
> The problem seems to arise only when I try to boot a lot of nodes
> (say
> 6 nodes). But there are no problems if I mpdboot only 2 nodes (any 2
> out of the 6). So it is not a networking issue AFAICT.
> 
> raju
> 
> On Wed, Sep 10, 2008 at 12:32 PM, Anthony Chan <chan at mcs.anl.gov>
> wrote:
> > You may want to try "mpdcheck" to see if there is any network
> issue.
> > mpdcheck is decribed in the Appendix A of the installer's guide.
> >
> > A.Chan
> > ----- "Kamaraju Kusumanchi" <kamaraju at gmail.com> wrote:
> >
> >> On Wed, Sep 10, 2008 at 5:12 AM, Rajeev Thakur
> <thakur at mcs.anl.gov>
> >> wrote:
> >> > We see those errors with asynchronous I/O and Fortran/C++
> sometimes
> >> but
> >> > don't know what causes them. For some reason MPD died later on
> in
> >> the tests,
> >> > which caused the remaining tests to fail. I think your
> installation
> >> is ok.
> >> > Try running your application.
> >> >
> >> > Rajeev
> >>
> >>
> >> I tried to mpdboot a bunch of nodes but it also gives a error.
> >>
> >> $cat mpd.hosts
> >> node2
> >> node4
> >> node5
> >> node8
> >> node9
> >> node10
> >>
> >> $mpdboot -n 6 -f mpd.hosts
> >> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
> >> handshake with mpd on node2; recvd output={}
> >>
> >> The node number in the above error message changes. Sometimes it
> is
> >> node2, sometime it is node5 etc.,
> >> But there is nothing wrong with these nodes as such. For example
> if
> >> my
> >> mpd.hosts contains just 2 lines
> >>
> >> node2
> >> node5
> >>
> >> instead of 6 lines, then if I do
> >>
> >> $mpdboot -n 2 -f mpd.hosts
> >>
> >> then there are no errors.
> >>
> >> On each of these nodes (node2, node4, node5, node8, node9, node10)
> I
> >> am able to start/stop mpd individually. But the problem arises
> only
> >> when I tried to boot the nodes alltogether.
> >>
> >> I also can't figure out anything by enabling the debugging
> messages
> >>
> >> $cat mpd.hosts
> >> node2
> >> node4
> >> node5
> >> node8
> >> node9
> >> node10
> >>
> >> $mpdboot -n 6 -f mpd.hosts -v -d
> >> debug: starting
> >> running mpdallexit on ank.mae.cornell.edu
> >> LAUNCHED mpd on ank.mae.cornell.edu  via
> >> debug: launch cmd=
> >>
> /home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
> >>   --ncpus=1 -e -d
> >> debug: mpd on ank.mae.cornell.edu  on port 46161
> >> RUNNING: mpd on ank.mae.cornell.edu
> >> debug: info for running mpd: {'ncpus': 1, 'list_port': 46161,
> >> 'entry_port': '', 'host': 'ank.mae.cornell.edu', 'entry_host': '',
> >> 'ifhn': ''}
> >> LAUNCHED mpd on node2  via  ank.mae.cornell.edu
> >> debug: launch cmd= ssh -x -n -q node2
> >>
> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
> >> LAUNCHED mpd on node4  via  ank.mae.cornell.edu
> >> debug: launch cmd= ssh -x -n -q node4
> >>
> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
> >> LAUNCHED mpd on node5  via  ank.mae.cornell.edu
> >> debug: launch cmd= ssh -x -n -q node5
> >>
> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
> >> LAUNCHED mpd on node8  via  ank.mae.cornell.edu
> >> debug: launch cmd= ssh -x -n -q node8
> >>
> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
> >> debug: mpd on node2  on port 60704
> >> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
> >> handshake with mpd on node2; recvd output={}
> >>
> >>
> >> Any other suggestions?
> >>
> >> thanks
> >> raju
> >
> >




More information about the mpich-discuss mailing list