[MPICH] Problem setting up a ring
Rajeev Thakur
thakur at mcs.anl.gov
Wed Apr 4 13:17:17 CDT 2007
>From veritas, can you do "ssh elaine date"?
Rajeev
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Brett Gordon
> Sent: Wednesday, April 04, 2007 11:07 AM
> To: Ralph Butler; mpich-discuss at mcs.anl.gov
> Subject: Re: [MPICH] Problem setting up a ring
>
> Ralph,
>
> I went back to the install guide and ran through the steps from 1 - 5
> without any problems.
>
> But there was a problem trying to run mpdcheck -f mpd.hosts. Here is
> the output, showing me trying to connect from 'elaine' to 'veritas'.
> It seems like there is an ssh problem, but as you can see in the last
> call, I was able to ssh with no password.
>
> -------------------------------------------------------
> [brgordon at elaine ~]$ less mpd.hosts
> veritas
>
> [brgordon at elaine ~]$ mpdcheck -f mpd.hosts
>
> [brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts //First without -ssh
> obtaining hostname via gethostname and getfqdn
> gethostname gives elaine.tepper.cmu.edu
> getfqdn gives elaine.tepper.cmu.edu
> checking out unqualified hostname; make sure is not "localhost", etc.
> checking out qualified hostname; make sure is not "localhost", etc.
> obtain IP addrs via qualified and unqualified hostnames; make sure
> other than 127.0.0.1
> gethostbyname_ex: ('elaine.tepper.cmu.edu', ['elaine'],
> ['128.2.89.115'])
> gethostbyname_ex: ('elaine.tepper.cmu.edu', ['elaine'],
> ['128.2.89.115'])
> checking that IP addrs resolve to same host
> now do some gethostbyaddr and gethostbyname_ex for machines
> in hosts file
> checking gethostbyXXX for unqualified veritas
> gethostbyname_ex: ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
> checking gethostbyXXX for qualified veritas
> gethostbyname_ex: ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
>
> [brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts -ssh //and now with
> obtaining hostname via gethostname and getfqdn
> gethostname gives elaine.tepper.cmu.edu
> getfqdn gives elaine.tepper.cmu.edu
> checking out unqualified hostname; make sure is not "localhost", etc.
> checking out qualified hostname; make sure is not "localhost", etc.
> obtain IP addrs via qualified and unqualified hostnames; make sure
> other than 127.0.0.1
> gethostbyname_ex: ('elaine.tepper.cmu.edu', ['elaine'],
> ['128.2.89.115'])
> gethostbyname_ex: ('elaine.tepper.cmu.edu', ['elaine'],
> ['128.2.89.115'])
> checking that IP addrs resolve to same host
> now do some gethostbyaddr and gethostbyname_ex for machines
> in hosts file
> checking gethostbyXXX for unqualified veritas
> gethostbyname_ex: ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
> checking gethostbyXXX for qualified veritas
> gethostbyname_ex: ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
> trying: ssh veritas -x -n /bin/echo hello
> starting server: /home/brgordon/mpich2-install/bin/mpdcheck.py -s
> starting client: ssh veritas -x -n
> /home/brgordon/mpich2-install/bin/mpdcheck.py -c elaine.tepper.cmu.edu
> 36031
> client on veritas failed to access the server
> here is the output:
> Traceback (most recent call last):
> File "/home/brgordon/mpich2-install/bin/mpdcheck.py", line 103, in ?
> sock.connect((argv[argidx+1],int(argv[argidx+2]))) #
> note double parens
> File "<string>", line 1, in connect
> socket.error: (113, 'No route to host')
>
> [brgordon at elaine ~]$ ssh veritas date
> Mon Apr 9 00:42:41 EDT 2007
> -------------------------------------------------------
>
> I tried running the ssh command by hand from elaine, and then tried
> doing the mpdcheck command from veritas, but neither worked. Both
> returned a "no route to host" error.
>
> Here is the first part of the output from mpdcheck -pc. Perhaps I need
> to edit the /etc/hosts or /etc/resolv.conf files on veritas?
>
> Elaine
> -------
> [brgordon at elaine ~]$ mpdcheck -pc
> --- print results of: gethostbyname_ex(gethostname())
> ('elaine.tepper.cmu.edu', ['elaine'], ['128.2.89.115'])
> --- try to run /bin/hostname
> elaine.tepper.cmu.edu
> --- try to run uname -a
> Linux elaine.tepper.cmu.edu 2.6.18-1.2257.fc5 #1 SMP Fri Dec 15
> 16:07:14 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
> --- try to print /etc/hosts
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> 127.0.0.1 localhost.localdomain localhost
> 128.2.89.115 elaine.tepper.cmu.edu elaine
> 128.2.89.114 austen.tepper.cmu.edu austen
> 128.2.89.117 puddy.tepper.cmu.edu puddy.tepper
> 128.2.89.118 rlg.tepper.cmu.edu rlg
> 128.2.13.161 unix31.andrew.cmu.edu unix31
> 128.2.13.162 unix32.andrew.cmu.edu unix32
> 128.2.92.151 bigp.tepper.cmu.edu bigp
> 128.32.66.92 bear.haas.berkeley.edu bear
> --- try to print /etc/resolv.conf
> search TEPPER.cmu.edu
> nameserver 128.2.1.11
> nameserver 128.2.1.10
>
> [.....]
>
>
> Veritas
> ----------
> brgordon at veritas:~> mpdcheck -pc
> --- print results of: gethostbyname_ex(gethostname())
> ('veritas.tepper.cmu.edu', [], ['128.2.93.142'])
> --- try to run /bin/hostname
> veritas
> --- try to run uname -a
> Linux veritas 2.6.13-15.15-smp #1 SMP Mon Feb 26 14:11:33 UTC 2007
> x86_64 x86_64 x86_64 GNU/Linux
> --- try to print /etc/hosts
> #
> # hosts This file describes a number of hostname-to-address
> # mappings for the TCP/IP subsystem. It is mostly
> # used at boot time, when no name servers are running.
> # On small systems, this file can be used instead of a
> # "named" name server.
> # Syntax:
> #
> # IP-Address Full-Qualified-Hostname Short-Hostname
> #
>
> 127.0.0.1 localhost
>
> # special IPv6 addresses
> ::1 localhost ipv6-localhost ipv6-loopback
>
> fe00::0 ipv6-localnet
>
> ff00::0 ipv6-mcastprefix
> ff02::1 ipv6-allnodes
> ff02::2 ipv6-allrouters
> ff02::3 ipv6-allhosts
> 127.0.0.2 linux.site linux
> --- try to print /etc/resolv.conf
> ### BEGIN INFO
> #
> # Modified_by: dhcpcd
> # Backup: /etc/resolv.conf.saved.by.dhcpcd.eth0
> # Process: dhcpcd
> # Process_id: 5677
> # Script: /sbin/modify_resolvconf
> # Saveto:
> # Info: This is a temporary resolv.conf created by
> service dhcpcd.
> # The previous file has been saved and will be
> restored later.
> #
> # If you don't like your resolv.conf to be changed, you
> # can set
> MODIFY_{RESOLV,NAMED}_CONF_DYNAMICALLY=no. This
> # variables are placed in /etc/sysconfig/network/config.
> #
> # You can also configure service dhcpcd not to
> modify it.
> #
> # If you don't like dhcpcd to change your nameserver
> # settings
> # then either set DHCLIENT_MODIFY_RESOLV_CONF=no
> # in /etc/sysconfig/network/dhcp, or
> # set MODIFY_RESOLV_CONF_DYNAMICALLY=no in
> # /etc/sysconfig/network/config or (manually) use dhcpcd
> # with -R. If you only want to keep your
> searchlist, set
> # DHCLIENT_KEEP_SEARCHLIST=yes in
> /etc/sysconfig/network/dhcp or
> # (manually) use the -K option.
> #
> ### END INFO
> search tepper.cmu.edu
> nameserver 128.2.1.10
> nameserver 128.2.1.11
>
> [.....]
>
> -------------------------------------------------------------
>
> Thanks,
>
> Brett
>
>
>
> On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
> > First I should point out that mpdallexit will probably fail if you
> > have not
> > successfully debugged host/net config problems; use kill -9.
> > Then, I note that you are trying mpdboot with 2 hosts.
> But, the manual
> > suggests that when there are multiple hosts, that config has to be
> > verified for all. This includes running mpcheck as server
> and client
> > on each, then reversing their roles, running mpd by hand on
> each, etc.
> > Here is a typical blurb I send out as a reminder:
> >
> > Sometimes there are problems with mpd or mpdboot while following
> > the Quick Start portion of the mpich2 install guide. This typically
> > happens somewhere during Steps 10-13, but may occur during other
> > steps as well. The guide suggests that when mpd/mpdboot problems
> > arise, you follow the procedures in Appendix A
> (Troubleshooting MPDs).
> >
> > Section A.1 (Getting Started with MPD) provides a 7-step procedure
> > to follow to get one or more mpds to working, first by hand, and
> > then via mpdboot. However, some of the early steps begin with a
> > pre-MPD program called mpdcheck. That program is designed to help
> > determine in advance if there will be problems associated wtih host
> > or network configuration. The instructions in section A.1 suggest
> > first using mpdcheck on individual machines, and then pair-wise.
> > It is particularly important to try the pair-wise experiments where
> > one machine plays the role of the server and the other the client,
> > and then to reverse the roles.
> >
> > Sometimes the procedures in A.1 indicate that MPDs are not likely
> > to run on your systems due to problems with host and/or network
> > configuration. At those points, you are referred to subsequent
> > sections, e.g. A.2 Debugging host/network configuration problems,
> > or A.3 Firewalls, etc.
> >
> >
> > On WedApr 4, at Wed Apr 4 9:41AM, Brett Gordon wrote:
> >
> > > Hi Ralph,
> > >
> > > Thanks for your response.
> > >
> > > I did as you suggested, and it seems to work, but I still
> can't get
> > > the ring running.
> > >
> > > Terminal 1:
> > > brgordon at veritas:~> mpdallexit //Just to make sure nothing else
> > > is running
> > > mpdallexit: cannot connect to local mpd
> (/tmp/mpd2.console_brgordon);
> > > possible causes:
> > > 1. no mpd is running on this host
> > > 2. an mpd is running but was started without a "console"
> (-n option)
> > > In case 1, you can start an mpd on this host with:
> > > mpd &
> > > and you will be able to run jobs just on this host.
> > > For more details on starting mpds on a set of hosts, see
> > > the MPICH2 Installation Guide.
> > > brgordon at veritas:~> mpdcheck
> > > brgordon at veritas:~> mpdcheck -s
> > > server listening at INADDR_ANY on: veritas 23768
> > > server has conn on <socket._socketobject object at 0x2aaaaab42650>
> > > from ('128.2.93.142', 25125)
> > > server successfully recvd msg from client:
> hello_from_client_to_server
> > > brgordon at veritas:~>
> > >
> > > Terminal 2:
> > > brgordon at veritas:~> mpdcheck -c veritas 23768
> > > client successfully recvd ack from server:
> ack_from_server_to_client
> > > brgordon at veritas:~>
> > >
> > > I then tried to run mpdboot from the computer 'veritas', hoping to
> > > bring up a ring with 'veritas' and 'elaine', and got the
> following:
> > >
> > > brgordon at veritas:~> mpdboot -n 2 -f mpd.hosts --user=brgordon --
> > > verbose --chkup
> > > checking elaine
> > > there are 2 hosts up (counting local)
> > > running mpdallexit on veritas
> > > LAUNCHED mpd on veritas via
> > > RUNNING: mpd on veritas
> > > LAUNCHED mpd on elaine via veritas
> > > mpdboot_veritas (handle_mpd_output 383): failed to connect to mpd
> > > on elaine
> > >
> > > brgordon at veritas:~> less mpd.hosts
> > > elaine
> > > brgordon at veritas:~> less .mpd.conf
> > > secretword=<my password>
> > >
> > > Same files exist on 'elaine', but the host is listed as 'veritas'.
> > >
> > > Thanks,
> > > Brett
> > >
> > >
> > >
> > > On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
> > >> In your output for mpdcheck below, there are jumbled
> lots of error
> > >> msgs from
> > >> an mpd as well. Apparently you had started an mpd in that same
> > >> window at some
> > >> point. Anyway, it is best to make sure that all mpd
> processes are
> > >> killed before doing
> > >> the mpdcheck. Then, try it again. As the manual suggests, it is
> > >> pointless to try using
> > >> mpdboot unless you have cleared up all issues first.
> Even starting
> > >> an mpd ring by
> > >> hand is only recommended after successfully debugging
> with mpdcheck.
> > >> So, I
> > >> suggest trying mpdcheck again, first with no options.
> Then, with -s
> > >> in one window
> > >> and -c n another.
> > >>
> > >> On TueApr 3, at Tue Apr 3 10:34PM, Brett Gordon wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > I have successfully installed mpich2-1.0.5 on two
> linux boxes. Both
> > >> > succeed in the standard tests involving one host
> solving the 'cpi'
> > >> > program.
> > >> >
> > >> > However, I'm running into two (probably related) problems:
> > >> >
> > >> > 1) When I try to run mpd as a server and client on the same
> > >> computer
> > >> > (as on page 31 of the install documentation), I get
> the following:
> > >> >
> > >> > brgordon at veritas:~> mpdcheck -s
> > >> > server listening at INADDR_ANY on: veritas 23761
> > >> > brgordon at veritas:~> mpdcheck -c veritas 23761
> > >> > veritas_23761 (recv_dict_msg 549):recv_dict_msg:
> errmsg=:invalid
> > >> > literal for int(): hello_fr:
> > >> > mpdtb:
> > >> > /home/brgordon/mpich2-install/bin/mpdlib.py, 549,
> > >> recv_dict_msg
> > >> > /home/brgordon/mpich2-install/bin/mpdlib.py, 989,
> > >> > handle_ring_listener_connection
> > >> > /home/brgordon/mpich2-install/bin/mpdlib.py, 743,
> > >> > handle_active_streams
> > >> > /home/brgordon/mpich2-install/bin/mpd, 286, runmainloop
> > >> > /home/brgordon/mpich2-install/bin/mpd, 255, run
> > >> > /home/brgordon/mpich2-install/bin/mpd, 1470, ?
> > >> >
> > >> > veritas_23761 (handle_ring_listener_connection 993):
> INVALID msg
> > >> from
> > >> > new connection :('128.2.93.142', 16587): msg=:{}:
> > >> > Traceback (most recent call last):
> > >> > File "/home/brgordon/mpich2-install/bin/mpdcheck",
> line 105, in ?
> > >> > msg = sock.recv(64)
> > >> > socket.error: (104, 'Connection reset by peer')
> > >> >
> > >> > 2) I also can't get a ring to work. I have setup ssh to work
> > >> without
> > >> > using passwords ('ssh veritas date' works fine). The
> workaround for
> > >> > mpdboot on page 9 of the install doc does not work for
> me, nor does
> > >> > running 'mpdcheck -f mpd.hosts -ssh'.
> > >> >
> > >> > When I try to run mpdboot, I get
> > >> > brgordon at elaine ~]$ mpdboot -n 2 -f mpd.hosts
> > >> > mpdboot_elaine.tepper.cmu.edu (handle_mpd_output 383):
> failed to
> > >> > connect to mpd on veritas
> > >> >
> > >> >
> > >> > I feel like I'm getting close to having this working,
> so I would
> > >> > greatly appreciate any help. Please let me know if
> there is more
> > >> > information I can provide.
> > >> >
> > >> > Thanks,
> > >> > Brett
> > >> >
> > >>
> > >>
> >
> >
>
>
More information about the mpich-discuss
mailing list