[mpich-discuss] unable to get mpich2 1.0.7 working

Kamaraju Kusumanchi kamaraju at gmail.com
Mon Oct 20 02:51:36 CDT 2008


Ok. Reading the installation manual again offered some hints. I feel
stupid for having missed this information before... Sorry for wasting
all your time. I should have read the manual more carefully...

It looks like some of the nodes on our cluster are not configured
correctly. For example if I log in into node2 and run "mpdcheck -v -l"
then I get

raju at node2:~ 1  1737 03:45 AM
$mpdcheck -v -l
obtaining hostname via gethostname and getfqdn
gethostname gives  node2.ank.mae.cornell.edu
getfqdn gives  node2.ank.mae.cornell.edu
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames;  make sure
other than 127.0.0.1
gethostbyname_ex:  ('node2.ank.mae.cornell.edu', [], ['172.18.0.2'])
gethostbyname_ex:  ('node2.ank.mae.cornell.edu', [], ['172.18.0.2'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file


If I log in into node6 and run mpdcheck then I get

$mpdcheck -v -l
obtaining hostname via gethostname and getfqdn
gethostname gives  node6.ank.mae.cornell.edu
getfqdn gives  node6.ank.mae.cornell.edu
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames;  make sure
other than 127.0.0.1
gethostbyname_ex:  ('node6.ank.mae.cornell.edu', ['node6',
'localhost.localdomain', 'localhost'], ['127.0.0.1'])

    **********
    Your unqualified hostname resolves to 127.0.0.1, which is
    the IP address reserved for localhost. This likely means that
    you have a line similar to this one in your /etc/hosts file:
    127.0.0.1   $uqhn
    This should perhaps be changed to the following:
    127.0.0.1   localhost.localdomain localhost
    **********

gethostbyname_ex:  ('node6.ank.mae.cornell.edu', ['node6',
'localhost.localdomain', 'localhost'], ['127.0.0.1'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file


I have sent an email to our system administrator about this problem. I
will update this thread if there is any progress.

regards
raju



On Mon, Oct 20, 2008 at 3:20 AM, Kamaraju Kusumanchi <kamaraju at gmail.com> wrote:
> I forgot to answer your question. Sorry!
>
> Let's say I log in into node10
> raju at node10:~ 1  1698 03:13 AM
> $mpdcheck -s
> server listening at INADDR_ANY on: node10.ank.mae.cornell.edu 54080
>
> Now I open another window and perform
> raju at node10:~ 1  1698 03:13 AM
> $mpdcheck -c node10 54080
> client successfully recvd ack from server: ack_from_server_to_client
>
> Now, on the server, I have
> $mpdcheck -s
> server listening at INADDR_ANY on: node10.ank.mae.cornell.edu 54080
> server has conn on <socket._socketobject object at 0x2aaaaab36770>
> from ('127.0.0.1', 56822)
> server successfully recvd msg from client: hello_from_client_to_server
>
>
> Could it be possible that the 127.0.0.1 in the above message is the
> cause of all these problems? Should it be something like 192.168.1.1
> or something like that?
>
> thanks
> raju
>
> On Wed, Sep 10, 2008 at 8:20 PM, Anthony Chan <chan at mcs.anl.gov> wrote:
>> This is what our mpd expert asked if you've done the following:
>>
>> Try using mpdcheck as a client and server on
>> each pair of machines in question, and then reversing the roles.
>> Also, after using mpdcheck, they can try starting a set of mpds by
>> hand and using mpiexec (not mpdtrace)
>> to run a program.  If they can start a ring of mpds by hand and run
>> mpiexec using the entire ring, there really
>> should be no problem using mpdboot.  Of course all existing mpd
>> processes have to be killed first.
>>
>>
>>
>>
>> ----- "Kamaraju Kusumanchi" <kamaraju at gmail.com> wrote:
>>
>>> I forgot to metion this in my mail. I did do the mpdcheck and
>>> verified
>>> that it is not a networking issue.
>>>
>>> The problem seems to arise only when I try to boot a lot of nodes
>>> (say
>>> 6 nodes). But there are no problems if I mpdboot only 2 nodes (any 2
>>> out of the 6). So it is not a networking issue AFAICT.
>>>
>>> raju
>>>
>>> On Wed, Sep 10, 2008 at 12:32 PM, Anthony Chan <chan at mcs.anl.gov>
>>> wrote:
>>> > You may want to try "mpdcheck" to see if there is any network
>>> issue.
>>> > mpdcheck is decribed in the Appendix A of the installer's guide.
>>> >
>>> > A.Chan
>>> > ----- "Kamaraju Kusumanchi" <kamaraju at gmail.com> wrote:
>>> >
>>> >> On Wed, Sep 10, 2008 at 5:12 AM, Rajeev Thakur
>>> <thakur at mcs.anl.gov>
>>> >> wrote:
>>> >> > We see those errors with asynchronous I/O and Fortran/C++
>>> sometimes
>>> >> but
>>> >> > don't know what causes them. For some reason MPD died later on
>>> in
>>> >> the tests,
>>> >> > which caused the remaining tests to fail. I think your
>>> installation
>>> >> is ok.
>>> >> > Try running your application.
>>> >> >
>>> >> > Rajeev
>>> >>
>>> >>
>>> >> I tried to mpdboot a bunch of nodes but it also gives a error.
>>> >>
>>> >> $cat mpd.hosts
>>> >> node2
>>> >> node4
>>> >> node5
>>> >> node8
>>> >> node9
>>> >> node10
>>> >>
>>> >> $mpdboot -n 6 -f mpd.hosts
>>> >> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
>>> >> handshake with mpd on node2; recvd output={}
>>> >>
>>> >> The node number in the above error message changes. Sometimes it
>>> is
>>> >> node2, sometime it is node5 etc.,
>>> >> But there is nothing wrong with these nodes as such. For example
>>> if
>>> >> my
>>> >> mpd.hosts contains just 2 lines
>>> >>
>>> >> node2
>>> >> node5
>>> >>
>>> >> instead of 6 lines, then if I do
>>> >>
>>> >> $mpdboot -n 2 -f mpd.hosts
>>> >>
>>> >> then there are no errors.
>>> >>
>>> >> On each of these nodes (node2, node4, node5, node8, node9, node10)
>>> I
>>> >> am able to start/stop mpd individually. But the problem arises
>>> only
>>> >> when I tried to boot the nodes alltogether.
>>> >>
>>> >> I also can't figure out anything by enabling the debugging
>>> messages
>>> >>
>>> >> $cat mpd.hosts
>>> >> node2
>>> >> node4
>>> >> node5
>>> >> node8
>>> >> node9
>>> >> node10
>>> >>
>>> >> $mpdboot -n 6 -f mpd.hosts -v -d
>>> >> debug: starting
>>> >> running mpdallexit on ank.mae.cornell.edu
>>> >> LAUNCHED mpd on ank.mae.cornell.edu  via
>>> >> debug: launch cmd=
>>> >>
>>> /home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>>> >>   --ncpus=1 -e -d
>>> >> debug: mpd on ank.mae.cornell.edu  on port 46161
>>> >> RUNNING: mpd on ank.mae.cornell.edu
>>> >> debug: info for running mpd: {'ncpus': 1, 'list_port': 46161,
>>> >> 'entry_port': '', 'host': 'ank.mae.cornell.edu', 'entry_host': '',
>>> >> 'ifhn': ''}
>>> >> LAUNCHED mpd on node2  via  ank.mae.cornell.edu
>>> >> debug: launch cmd= ssh -x -n -q node2
>>> >>
>>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>>> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>>> >> LAUNCHED mpd on node4  via  ank.mae.cornell.edu
>>> >> debug: launch cmd= ssh -x -n -q node4
>>> >>
>>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>>> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>>> >> LAUNCHED mpd on node5  via  ank.mae.cornell.edu
>>> >> debug: launch cmd= ssh -x -n -q node5
>>> >>
>>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>>> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>>> >> LAUNCHED mpd on node8  via  ank.mae.cornell.edu
>>> >> debug: launch cmd= ssh -x -n -q node8
>>> >>
>>> '/home6/raju/software/compiledLibs/mpich2_1.0.7_gcc_4.3.2_gfortran_4.3.2/bin/mpd.py
>>> >>  -h ank.mae.cornell.edu -p 46161  --ncpus=1 -e -d'
>>> >> debug: mpd on node2  on port 60704
>>> >> mpdboot_ank.mae.cornell.edu (handle_mpd_output 392): failed to
>>> >> handshake with mpd on node2; recvd output={}
>>> >>
>>> >>
>>> >> Any other suggestions?
>>> >>
>>> >> thanks
>>> >> raju
>>> >
>>> >
>>
>>
>




More information about the mpich-discuss mailing list