[MPICH] Problem setting up a ring

Brett Gordon brgordon at gmail.com
Wed Apr 4 13:39:20 CDT 2007


Hi Rajeev,

Yes, I can, which is why I'm still confused.

brgordon at veritas:~> ssh elaine date
Wed Apr  4 14:35:32 EDT 2007
[brgordon at elaine ~]$ ssh veritas date
Mon Apr  9 03:25:29 EDT 2007

(Note that the date on veritas is wrong - this is a separate problem
with openSUSE 10.0...... unless you know of a reason that this would
cause issues).

I have added each computer to the /etc/hosts files and included them
in /etc/hosts.allow. ssh was already open to ALL in the latter file,
but just in case, I added each of them explicitly.

Unfortunately, I don't have a sysadmin to go to, and I don't know much
about these types of issues. Any ideas?

Thanks,
Brett


On 4/4/07, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> From veritas, can you do "ssh elaine date"?
>
> Rajeev
>
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Brett Gordon
> > Sent: Wednesday, April 04, 2007 11:07 AM
> > To: Ralph Butler; mpich-discuss at mcs.anl.gov
> > Subject: Re: [MPICH] Problem setting up a ring
> >
> > Ralph,
> >
> > I went back to the install guide and ran through the steps from 1 - 5
> > without any problems.
> >
> > But there was a problem trying to run mpdcheck -f mpd.hosts. Here is
> > the output, showing me trying to connect from 'elaine' to 'veritas'.
> > It seems like there is an ssh problem, but as you can see in the last
> > call, I was able to ssh with no password.
> >
> > -------------------------------------------------------
> > [brgordon at elaine ~]$ less mpd.hosts
> > veritas
> >
> > [brgordon at elaine ~]$ mpdcheck -f mpd.hosts
> >
> > [brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts   //First without -ssh
> > obtaining hostname via gethostname and getfqdn
> > gethostname gives  elaine.tepper.cmu.edu
> > getfqdn gives  elaine.tepper.cmu.edu
> > checking out unqualified hostname; make sure is not "localhost", etc.
> > checking out qualified hostname; make sure is not "localhost", etc.
> > obtain IP addrs via qualified and unqualified hostnames;  make sure
> > other than 127.0.0.1
> > gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],
> > ['128.2.89.115'])
> > gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],
> > ['128.2.89.115'])
> > checking that IP addrs resolve to same host
> > now do some gethostbyaddr and gethostbyname_ex for machines
> > in hosts file
> > checking gethostbyXXX for unqualified veritas
> > gethostbyname_ex:  ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
> > checking gethostbyXXX for qualified veritas
> > gethostbyname_ex:  ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
> >
> > [brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts -ssh  //and now with
> > obtaining hostname via gethostname and getfqdn
> > gethostname gives  elaine.tepper.cmu.edu
> > getfqdn gives  elaine.tepper.cmu.edu
> > checking out unqualified hostname; make sure is not "localhost", etc.
> > checking out qualified hostname; make sure is not "localhost", etc.
> > obtain IP addrs via qualified and unqualified hostnames;  make sure
> > other than 127.0.0.1
> > gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],
> > ['128.2.89.115'])
> > gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],
> > ['128.2.89.115'])
> > checking that IP addrs resolve to same host
> > now do some gethostbyaddr and gethostbyname_ex for machines
> > in hosts file
> > checking gethostbyXXX for unqualified veritas
> > gethostbyname_ex:  ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
> > checking gethostbyXXX for qualified veritas
> > gethostbyname_ex:  ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
> > trying: ssh veritas -x -n /bin/echo hello
> > starting server: /home/brgordon/mpich2-install/bin/mpdcheck.py -s
> > starting client: ssh veritas -x -n
> > /home/brgordon/mpich2-install/bin/mpdcheck.py -c elaine.tepper.cmu.edu
> > 36031
> > client on veritas failed to access the server
> > here is the output:
> > Traceback (most recent call last):
> >   File "/home/brgordon/mpich2-install/bin/mpdcheck.py", line 103, in ?
> >     sock.connect((argv[argidx+1],int(argv[argidx+2])))  #
> > note double parens
> >   File "<string>", line 1, in connect
> > socket.error: (113, 'No route to host')
> >
> > [brgordon at elaine ~]$ ssh veritas date
> > Mon Apr  9 00:42:41 EDT 2007
> > -------------------------------------------------------
> >
> > I tried running the ssh command by hand from elaine, and then tried
> > doing the mpdcheck command from veritas, but neither worked. Both
> > returned a "no route to host" error.
> >
> > Here is the first part of the output from mpdcheck -pc. Perhaps I need
> > to edit the /etc/hosts or /etc/resolv.conf files on veritas?
> >
> > Elaine
> > -------
> > [brgordon at elaine ~]$ mpdcheck -pc
> > --- print results of: gethostbyname_ex(gethostname())
> > ('elaine.tepper.cmu.edu', ['elaine'], ['128.2.89.115'])
> > --- try to run /bin/hostname
> > elaine.tepper.cmu.edu
> > --- try to run uname -a
> > Linux elaine.tepper.cmu.edu 2.6.18-1.2257.fc5 #1 SMP Fri Dec 15
> > 16:07:14 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
> > --- try to print /etc/hosts
> > # Do not remove the following line, or various programs
> > # that require network functionality will fail.
> > 127.0.0.1               localhost.localdomain localhost
> > 128.2.89.115            elaine.tepper.cmu.edu elaine
> > 128.2.89.114    austen.tepper.cmu.edu      austen
> > 128.2.89.117    puddy.tepper.cmu.edu       puddy.tepper
> > 128.2.89.118    rlg.tepper.cmu.edu         rlg
> > 128.2.13.161    unix31.andrew.cmu.edu      unix31
> > 128.2.13.162    unix32.andrew.cmu.edu      unix32
> > 128.2.92.151    bigp.tepper.cmu.edu        bigp
> > 128.32.66.92    bear.haas.berkeley.edu     bear
> > --- try to print /etc/resolv.conf
> > search TEPPER.cmu.edu
> > nameserver 128.2.1.11
> > nameserver 128.2.1.10
> >
> > [.....]
> >
> >
> > Veritas
> > ----------
> > brgordon at veritas:~> mpdcheck -pc
> > --- print results of: gethostbyname_ex(gethostname())
> > ('veritas.tepper.cmu.edu', [], ['128.2.93.142'])
> > --- try to run /bin/hostname
> > veritas
> > --- try to run uname -a
> > Linux veritas 2.6.13-15.15-smp #1 SMP Mon Feb 26 14:11:33 UTC 2007
> > x86_64 x86_64 x86_64 GNU/Linux
> > --- try to print /etc/hosts
> > #
> > # hosts         This file describes a number of hostname-to-address
> > #               mappings for the TCP/IP subsystem.  It is mostly
> > #               used at boot time, when no name servers are running.
> > #               On small systems, this file can be used instead of a
> > #               "named" name server.
> > # Syntax:
> > #
> > # IP-Address  Full-Qualified-Hostname  Short-Hostname
> > #
> >
> > 127.0.0.1       localhost
> >
> > # special IPv6 addresses
> > ::1             localhost ipv6-localhost ipv6-loopback
> >
> > fe00::0         ipv6-localnet
> >
> > ff00::0         ipv6-mcastprefix
> > ff02::1         ipv6-allnodes
> > ff02::2         ipv6-allrouters
> > ff02::3         ipv6-allhosts
> > 127.0.0.2       linux.site linux
> > --- try to print /etc/resolv.conf
> > ### BEGIN INFO
> > #
> > # Modified_by:  dhcpcd
> > # Backup:       /etc/resolv.conf.saved.by.dhcpcd.eth0
> > # Process:      dhcpcd
> > # Process_id:   5677
> > # Script:       /sbin/modify_resolvconf
> > # Saveto:
> > # Info:         This is a temporary resolv.conf created by
> > service dhcpcd.
> > #               The previous file has been saved and will be
> > restored later.
> > #
> > #               If you don't like your resolv.conf to be changed, you
> > #               can set
> > MODIFY_{RESOLV,NAMED}_CONF_DYNAMICALLY=no. This
> > #               variables are placed in /etc/sysconfig/network/config.
> > #
> > #               You can also configure service dhcpcd not to
> > modify it.
> > #
> > #               If you don't like dhcpcd to change your nameserver
> > #               settings
> > #               then either set DHCLIENT_MODIFY_RESOLV_CONF=no
> > #               in /etc/sysconfig/network/dhcp, or
> > #               set MODIFY_RESOLV_CONF_DYNAMICALLY=no in
> > #               /etc/sysconfig/network/config or (manually) use dhcpcd
> > #               with -R.  If you only want to keep your
> > searchlist, set
> > #               DHCLIENT_KEEP_SEARCHLIST=yes in
> > /etc/sysconfig/network/dhcp or
> > #               (manually) use the -K option.
> > #
> > ### END INFO
> > search tepper.cmu.edu
> > nameserver 128.2.1.10
> > nameserver 128.2.1.11
> >
> > [.....]
> >
> > -------------------------------------------------------------
> >
> > Thanks,
> >
> > Brett
> >
> >
> >
> > On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
> > > First I should point out that mpdallexit will probably fail if you
> > > have not
> > > successfully debugged host/net config problems; use kill -9.
> > > Then, I note that you are trying mpdboot with 2 hosts.
> > But, the manual
> > > suggests that when there are multiple hosts, that config has to be
> > > verified for all.  This includes running mpcheck as server
> > and client
> > > on each, then reversing their roles, running mpd by hand on
> > each, etc.
> > > Here is a typical blurb I send out as a reminder:
> > >
> > > Sometimes there are problems with mpd or mpdboot while following
> > > the Quick Start portion of the mpich2 install guide.  This typically
> > > happens somewhere during Steps 10-13, but may occur during other
> > > steps as well.  The guide suggests that when mpd/mpdboot problems
> > > arise, you follow the procedures in Appendix A
> > (Troubleshooting MPDs).
> > >
> > > Section A.1 (Getting Started with MPD) provides a 7-step procedure
> > > to follow to get one or more mpds to working, first by hand, and
> > > then via mpdboot.  However, some of the early steps begin with a
> > > pre-MPD program called mpdcheck.  That program is designed to help
> > > determine in advance if there will be problems associated wtih host
> > > or network configuration.  The instructions in section A.1 suggest
> > > first using mpdcheck on individual machines, and then pair-wise.
> > > It is particularly important to try the pair-wise experiments where
> > > one machine plays the role of the server and the other the client,
> > > and then to reverse the roles.
> > >
> > > Sometimes the procedures in A.1 indicate that MPDs are not likely
> > > to run on your systems due to problems with host and/or network
> > > configuration.  At those points, you are referred to subsequent
> > > sections, e.g. A.2 Debugging host/network configuration problems,
> > > or A.3 Firewalls, etc.
> > >
> > >
> > > On WedApr 4, at Wed Apr 4 9:41AM, Brett Gordon wrote:
> > >
> > > > Hi Ralph,
> > > >
> > > > Thanks for your response.
> > > >
> > > > I did as you suggested, and it seems to work, but I still
> > can't get
> > > > the ring running.
> > > >
> > > > Terminal 1:
> > > > brgordon at veritas:~> mpdallexit    //Just to make sure nothing else
> > > > is running
> > > > mpdallexit: cannot connect to local mpd
> > (/tmp/mpd2.console_brgordon);
> > > > possible causes:
> > > >  1. no mpd is running on this host
> > > >  2. an mpd is running but was started without a "console"
> > (-n option)
> > > > In case 1, you can start an mpd on this host with:
> > > >    mpd &
> > > > and you will be able to run jobs just on this host.
> > > > For more details on starting mpds on a set of hosts, see
> > > > the MPICH2 Installation Guide.
> > > > brgordon at veritas:~> mpdcheck
> > > > brgordon at veritas:~> mpdcheck -s
> > > > server listening at INADDR_ANY on: veritas 23768
> > > > server has conn on <socket._socketobject object at 0x2aaaaab42650>
> > > > from ('128.2.93.142', 25125)
> > > > server successfully recvd msg from client:
> > hello_from_client_to_server
> > > > brgordon at veritas:~>
> > > >
> > > > Terminal 2:
> > > > brgordon at veritas:~> mpdcheck -c veritas 23768
> > > > client successfully recvd ack from server:
> > ack_from_server_to_client
> > > > brgordon at veritas:~>
> > > >
> > > > I then tried to run mpdboot from the computer 'veritas', hoping to
> > > > bring up a ring with 'veritas' and 'elaine', and got the
> > following:
> > > >
> > > > brgordon at veritas:~> mpdboot -n 2 -f mpd.hosts --user=brgordon --
> > > > verbose  --chkup
> > > > checking elaine
> > > > there are 2 hosts up (counting local)
> > > > running mpdallexit on veritas
> > > > LAUNCHED mpd on veritas  via
> > > > RUNNING: mpd on veritas
> > > > LAUNCHED mpd on elaine  via  veritas
> > > > mpdboot_veritas (handle_mpd_output 383): failed to connect to mpd
> > > > on elaine
> > > >
> > > > brgordon at veritas:~> less mpd.hosts
> > > > elaine
> > > > brgordon at veritas:~> less .mpd.conf
> > > > secretword=<my password>
> > > >
> > > > Same files exist on 'elaine', but the host is listed as 'veritas'.
> > > >
> > > > Thanks,
> > > > Brett
> > > >
> > > >
> > > >
> > > > On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
> > > >> In your output for mpdcheck below, there are jumbled
> > lots of error
> > > >> msgs from
> > > >> an mpd as well.  Apparently you had started an mpd in that same
> > > >> window at some
> > > >> point.  Anyway, it is best to make sure that all mpd
> > processes are
> > > >> killed before doing
> > > >> the mpdcheck.  Then, try it again.  As the manual suggests, it is
> > > >> pointless to try using
> > > >> mpdboot unless you have cleared up all issues first.
> > Even starting
> > > >> an mpd ring by
> > > >> hand is only recommended after successfully debugging
> > with mpdcheck.
> > > >> So, I
> > > >> suggest trying mpdcheck again, first with no options.
> > Then, with -s
> > > >> in one window
> > > >> and -c n another.
> > > >>
> > > >> On TueApr 3, at Tue Apr 3 10:34PM, Brett Gordon wrote:
> > > >>
> > > >> > Hello,
> > > >> >
> > > >> > I have successfully installed mpich2-1.0.5 on two
> > linux boxes. Both
> > > >> > succeed in the standard tests involving one host
> > solving the 'cpi'
> > > >> > program.
> > > >> >
> > > >> > However, I'm running into two (probably related) problems:
> > > >> >
> > > >> > 1) When I try to run mpd as a server and client on the same
> > > >> computer
> > > >> > (as on page 31 of the install documentation), I get
> > the following:
> > > >> >
> > > >> > brgordon at veritas:~> mpdcheck -s
> > > >> > server listening at INADDR_ANY on: veritas 23761
> > > >> > brgordon at veritas:~> mpdcheck -c veritas 23761
> > > >> > veritas_23761 (recv_dict_msg 549):recv_dict_msg:
> > errmsg=:invalid
> > > >> > literal for int(): hello_fr:
> > > >> >  mpdtb:
> > > >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  549,
> > > >> recv_dict_msg
> > > >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  989,
> > > >> > handle_ring_listener_connection
> > > >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  743,
> > > >> > handle_active_streams
> > > >> >    /home/brgordon/mpich2-install/bin/mpd,  286,  runmainloop
> > > >> >    /home/brgordon/mpich2-install/bin/mpd,  255,  run
> > > >> >    /home/brgordon/mpich2-install/bin/mpd,  1470,  ?
> > > >> >
> > > >> > veritas_23761 (handle_ring_listener_connection 993):
> > INVALID msg
> > > >> from
> > > >> > new connection :('128.2.93.142', 16587): msg=:{}:
> > > >> > Traceback (most recent call last):
> > > >> >  File "/home/brgordon/mpich2-install/bin/mpdcheck",
> > line 105, in ?
> > > >> >    msg = sock.recv(64)
> > > >> > socket.error: (104, 'Connection reset by peer')
> > > >> >
> > > >> > 2) I also can't get a ring to work. I have setup ssh to work
> > > >> without
> > > >> > using passwords ('ssh veritas date' works fine). The
> > workaround for
> > > >> > mpdboot on page 9 of the install doc does not work for
> > me, nor does
> > > >> > running 'mpdcheck -f mpd.hosts -ssh'.
> > > >> >
> > > >> > When I try to run mpdboot, I get
> > > >> > brgordon at elaine ~]$ mpdboot -n 2 -f mpd.hosts
> > > >> > mpdboot_elaine.tepper.cmu.edu (handle_mpd_output 383):
> > failed to
> > > >> > connect to mpd on veritas
> > > >> >
> > > >> >
> > > >> > I feel like I'm getting close to having this working,
> > so I would
> > > >> > greatly appreciate any help. Please let me know if
> > there is more
> > > >> > information I can provide.
> > > >> >
> > > >> > Thanks,
> > > >> > Brett
> > > >> >
> > > >>
> > > >>
> > >
> > >
> >
> >
>
>




More information about the mpich-discuss mailing list