[mpich-discuss] unable to get mpich2 1.0.7 working

Kamaraju Kusumanchi kamaraju at gmail.com
Mon Oct 20 02:20:01 CDT 2008


I forgot to answer your question. Sorry!

Let's say I log in into node10
raju at node10:~ 1  1698 03:13 AM
$mpdcheck -s
server listening at INADDR_ANY on: node10.ank.mae.cornell.edu 54080

Now I open another window and perform
raju at node10:~ 1  1698 03:13 AM
$mpdcheck -c node10 54080
client successfully recvd ack from server: ack_from_server_to_client

Now, on the server, I have
$mpdcheck -s
server listening at INADDR_ANY on: node10.ank.mae.cornell.edu 54080
server has conn on <socket._socketobject object at 0x2aaaaab36770>
from ('127.0.0.1', 56822)
server successfully recvd msg from client: hello_from_client_to_server


Could it be possible that the 127.0.0.1 in the above message is the
cause of all these problems? Should it be something like 192.168.1.1
or something like that?

thanks
raju

On Wed, Sep 10, 2008 at 8:20 PM, Anthony Chan <chan at mcs.anl.gov> wrote:
> This is what our mpd expert asked if you've done the following:
>
> Try using mpdcheck as a client and server on
> each pair of machines in question, and then reversing the roles.
> Also, after using mpdcheck, they can try starting a set of mpds by
> hand and using mpiexec (not mpdtrace)
> to run a program.  If they can start a ring of mpds by hand and run
> mpiexec using the entire ring, there really
> should be no problem using mpdboot.  Of course all existing mpd
> processes have to be killed first.
>
>
>
>
> ----- "Kamaraju Kusumanchi" <kamaraju at gmail.com> wrote:
>
>> I forgot to metion this in my mail. I did do the mpdcheck and
>> verified
>> that it is not a networking issue.
>>
>> The problem seems to arise only when I try to boot a lot of nodes
>> (say
>> 6 nodes). But there are no problems if I mpdboot only 2 nodes (any 2
>> out of the 6). So it is not a networking issue AFAICT.
>>
>> raju
>>
>> On Wed, Sep 10, 2008 at 12:32 PM, Anthony Chan <chan at mcs.anl.gov>
>> wrote:
>> > You may want to try "mpdcheck" to see if there is any network
>> issue.
>> > mpdcheck is decribed in the Appendix A of the installer's guide.
>> >
>> > A.Chan
>> > ----- "Kamaraju Kusumanchi" <kamaraju at gmail.com> wrote:
>> >
>> >> On Wed, Sep 10, 2008 at 5:12 AM, Rajeev Thakur
>> <thakur at mcs.anl.gov>
>> >> wrote:
>> >> > We see those errors with asynchronous I/O and Fortran/C++
>> sometimes
>> >> but
>> >> > don't know what causes them. For some reason MPD died later on
>> in
>> >> the tests,
>> >> > which caused the remaining tests to fail. I think your
>> installation
>> >> is ok.
>> >> > Try running your application.
>> >> >
>> >> > Rajeev
>> >>
>> >>
>> >> I tried to mpdboot a bunch of nodes but it also gives a error.
>> >>
>> >> $cat mpd.hosts
>> >> node2
>> >> node4
>> >> node5
>> >> node8
>> >> node9
>> >> node10
>> >>
>> >> $mpdboot -n 6 -f mpd.hosts
>> >> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
>> >> handshake with mpd on node2; recvd output={}
>> >>
>> >> The node number in the above error message changes. Sometimes it
>> is
>> >> node2, sometime it is node5 etc.,
>> >> But there is nothing wrong with these nodes as such. For example
>> if
>> >> my
>> >> mpd.hosts contains just 2 lines
>> >>
>> >> node2
>> >> node5
>> >>
>> >> instead of 6 lines, then if I do
>> >>
>> >> $mpdboot -n 2 -f mpd.hosts
>> >>
>> >> then there are no errors.
>> >>
>> >> On each of these nodes (node2, node4, node5, node8, node9, node10)
>> I
>> >> am able to start/stop mpd individually. But the problem arises
>> only
>> >> when I tried to boot the nodes alltogether.
>> >>
>> >> I also can't figure out anything by enabling the debugging
>> messages
>> >>
>> >> $cat mpd.hosts
>> >> node2
>> >> node4
>> >> node5
>> >> node8
>> >> node9
>> >> node10
>> >>
>> >> $mpdboot -n 6 -f mpd.hosts -v -d
>> >> debug: starting
>> >> running mpdallexit on ank.mae.cornell.edu
>> >> LAUNCHED mpd on ank.mae.cornell.edu  via
>> >> debug: launch cmd=
>> >>
>> /home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>> >>   --ncpus=1 -e -d
>> >> debug: mpd on ank.mae.cornell.edu  on port 46161
>> >> RUNNING: mpd on ank.mae.cornell.edu
>> >> debug: info for running mpd: {'ncpus': 1, 'list_port': 46161,
>> >> 'entry_port': '', 'host': 'ank.mae.cornell.edu', 'entry_host': '',
>> >> 'ifhn': ''}
>> >> LAUNCHED mpd on node2  via  ank.mae.cornell.edu
>> >> debug: launch cmd= ssh -x -n -q node2
>> >>
>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>> >> LAUNCHED mpd on node4  via  ank.mae.cornell.edu
>> >> debug: launch cmd= ssh -x -n -q node4
>> >>
>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>> >> LAUNCHED mpd on node5  via  ank.mae.cornell.edu
>> >> debug: launch cmd= ssh -x -n -q node5
>> >>
>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>> >> LAUNCHED mpd on node8  via  ank.mae.cornell.edu
>> >> debug: launch cmd= ssh -x -n -q node8
>> >>
>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>> >> debug: mpd on node2  on port 60704
>> >> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
>> >> handshake with mpd on node2; recvd output={}
>> >>
>> >>
>> >> Any other suggestions?
>> >>
>> >> thanks
>> >> raju
>> >
>> >
>
>




More information about the mpich-discuss mailing list