[MPICH] Problem setting up a ring

Ralph Butler rbutler at mtsu.edu
Wed Apr 4 16:13:08 CDT 2007


I guess I am confused about what you tried when.  This comment by you:
 > I tried running the ssh command by hand from elaine, and then tried
 > doing the mpdcheck command from veritas, but neither worked. Both
 > returned a "no route to host" error.

says to me that ssh from elaine failed with the "no route" message.
This indicates that the problem exists outside mpd.
As I mentioned earlier, there are a variety of
techniques involved for addressing the issue.  A small amount of it is
addressed in the manual section on  Debugging host/net Problems and
the section on Firewalls.  Unfortunately, those issues are so large  
that our
manual can only scratch the surface.  (There are large books written on
the topics.)  If you are striving for a quick-and-dirty fix, adding  
BOTH machines
to the /etc/hosts file on BOTH machinesmay get it done.
But, if there are firewalls set up somewhere (perhaps even  
inadvertently on the
machines themselves), then it is a bigger problem.

--ralph

On WedApr 4, at Wed Apr 4 1:39PM, Brett Gordon wrote:

> Hi Rajeev,
>
> Yes, I can, which is why I'm still confused.
>
> brgordon at veritas:~> ssh elaine date
> Wed Apr  4 14:35:32 EDT 2007
> [brgordon at elaine ~]$ ssh veritas date
> Mon Apr  9 03:25:29 EDT 2007
>
> (Note that the date on veritas is wrong - this is a separate problem
> with openSUSE 10.0...... unless you know of a reason that this would
> cause issues).
>
> I have added each computer to the /etc/hosts files and included them
> in /etc/hosts.allow. ssh was already open to ALL in the latter file,
> but just in case, I added each of them explicitly.
>
> Unfortunately, I don't have a sysadmin to go to, and I don't know much
> about these types of issues. Any ideas?
>
> Thanks,
> Brett
>
>
> On 4/4/07, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>> From veritas, can you do "ssh elaine date"?
>>
>> Rajeev
>>
>> > -----Original Message-----
>> > From: owner-mpich-discuss at mcs.anl.gov
>> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Brett Gordon
>> > Sent: Wednesday, April 04, 2007 11:07 AM
>> > To: Ralph Butler; mpich-discuss at mcs.anl.gov
>> > Subject: Re: [MPICH] Problem setting up a ring
>> >
>> > Ralph,
>> >
>> > I went back to the install guide and ran through the steps from  
>> 1 - 5
>> > without any problems.
>> >
>> > But there was a problem trying to run mpdcheck -f mpd.hosts.  
>> Here is
>> > the output, showing me trying to connect from 'elaine' to  
>> 'veritas'.
>> > It seems like there is an ssh problem, but as you can see in the  
>> last
>> > call, I was able to ssh with no password.
>> >
>> > -------------------------------------------------------
>> > [brgordon at elaine ~]$ less mpd.hosts
>> > veritas
>> >
>> > [brgordon at elaine ~]$ mpdcheck -f mpd.hosts
>> >
>> > [brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts   //First without - 
>> ssh
>> > obtaining hostname via gethostname and getfqdn
>> > gethostname gives  elaine.tepper.cmu.edu
>> > getfqdn gives  elaine.tepper.cmu.edu
>> > checking out unqualified hostname; make sure is not "localhost",  
>> etc.
>> > checking out qualified hostname; make sure is not "localhost", etc.
>> > obtain IP addrs via qualified and unqualified hostnames;  make sure
>> > other than 127.0.0.1
>> > gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],
>> > ['128.2.89.115'])
>> > gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],
>> > ['128.2.89.115'])
>> > checking that IP addrs resolve to same host
>> > now do some gethostbyaddr and gethostbyname_ex for machines
>> > in hosts file
>> > checking gethostbyXXX for unqualified veritas
>> > gethostbyname_ex:  ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
>> > checking gethostbyXXX for qualified veritas
>> > gethostbyname_ex:  ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
>> >
>> > [brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts -ssh  //and now with
>> > obtaining hostname via gethostname and getfqdn
>> > gethostname gives  elaine.tepper.cmu.edu
>> > getfqdn gives  elaine.tepper.cmu.edu
>> > checking out unqualified hostname; make sure is not "localhost",  
>> etc.
>> > checking out qualified hostname; make sure is not "localhost", etc.
>> > obtain IP addrs via qualified and unqualified hostnames;  make sure
>> > other than 127.0.0.1
>> > gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],
>> > ['128.2.89.115'])
>> > gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],
>> > ['128.2.89.115'])
>> > checking that IP addrs resolve to same host
>> > now do some gethostbyaddr and gethostbyname_ex for machines
>> > in hosts file
>> > checking gethostbyXXX for unqualified veritas
>> > gethostbyname_ex:  ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
>> > checking gethostbyXXX for qualified veritas
>> > gethostbyname_ex:  ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
>> > trying: ssh veritas -x -n /bin/echo hello
>> > starting server: /home/brgordon/mpich2-install/bin/mpdcheck.py -s
>> > starting client: ssh veritas -x -n
>> > /home/brgordon/mpich2-install/bin/mpdcheck.py -c  
>> elaine.tepper.cmu.edu
>> > 36031
>> > client on veritas failed to access the server
>> > here is the output:
>> > Traceback (most recent call last):
>> >   File "/home/brgordon/mpich2-install/bin/mpdcheck.py", line  
>> 103, in ?
>> >     sock.connect((argv[argidx+1],int(argv[argidx+2])))  #
>> > note double parens
>> >   File "<string>", line 1, in connect
>> > socket.error: (113, 'No route to host')
>> >
>> > [brgordon at elaine ~]$ ssh veritas date
>> > Mon Apr  9 00:42:41 EDT 2007
>> > -------------------------------------------------------
>> >
>> > I tried running the ssh command by hand from elaine, and then tried
>> > doing the mpdcheck command from veritas, but neither worked. Both
>> > returned a "no route to host" error.
>> >
>> > Here is the first part of the output from mpdcheck -pc. Perhaps  
>> I need
>> > to edit the /etc/hosts or /etc/resolv.conf files on veritas?
>> >
>> > Elaine
>> > -------
>> > [brgordon at elaine ~]$ mpdcheck -pc
>> > --- print results of: gethostbyname_ex(gethostname())
>> > ('elaine.tepper.cmu.edu', ['elaine'], ['128.2.89.115'])
>> > --- try to run /bin/hostname
>> > elaine.tepper.cmu.edu
>> > --- try to run uname -a
>> > Linux elaine.tepper.cmu.edu 2.6.18-1.2257.fc5 #1 SMP Fri Dec 15
>> > 16:07:14 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
>> > --- try to print /etc/hosts
>> > # Do not remove the following line, or various programs
>> > # that require network functionality will fail.
>> > 127.0.0.1               localhost.localdomain localhost
>> > 128.2.89.115            elaine.tepper.cmu.edu elaine
>> > 128.2.89.114    austen.tepper.cmu.edu      austen
>> > 128.2.89.117    puddy.tepper.cmu.edu       puddy.tepper
>> > 128.2.89.118    rlg.tepper.cmu.edu         rlg
>> > 128.2.13.161    unix31.andrew.cmu.edu      unix31
>> > 128.2.13.162    unix32.andrew.cmu.edu      unix32
>> > 128.2.92.151    bigp.tepper.cmu.edu        bigp
>> > 128.32.66.92    bear.haas.berkeley.edu     bear
>> > --- try to print /etc/resolv.conf
>> > search TEPPER.cmu.edu
>> > nameserver 128.2.1.11
>> > nameserver 128.2.1.10
>> >
>> > [.....]
>> >
>> >
>> > Veritas
>> > ----------
>> > brgordon at veritas:~> mpdcheck -pc
>> > --- print results of: gethostbyname_ex(gethostname())
>> > ('veritas.tepper.cmu.edu', [], ['128.2.93.142'])
>> > --- try to run /bin/hostname
>> > veritas
>> > --- try to run uname -a
>> > Linux veritas 2.6.13-15.15-smp #1 SMP Mon Feb 26 14:11:33 UTC 2007
>> > x86_64 x86_64 x86_64 GNU/Linux
>> > --- try to print /etc/hosts
>> > #
>> > # hosts         This file describes a number of hostname-to-address
>> > #               mappings for the TCP/IP subsystem.  It is mostly
>> > #               used at boot time, when no name servers are  
>> running.
>> > #               On small systems, this file can be used instead  
>> of a
>> > #               "named" name server.
>> > # Syntax:
>> > #
>> > # IP-Address  Full-Qualified-Hostname  Short-Hostname
>> > #
>> >
>> > 127.0.0.1       localhost
>> >
>> > # special IPv6 addresses
>> > ::1             localhost ipv6-localhost ipv6-loopback
>> >
>> > fe00::0         ipv6-localnet
>> >
>> > ff00::0         ipv6-mcastprefix
>> > ff02::1         ipv6-allnodes
>> > ff02::2         ipv6-allrouters
>> > ff02::3         ipv6-allhosts
>> > 127.0.0.2       linux.site linux
>> > --- try to print /etc/resolv.conf
>> > ### BEGIN INFO
>> > #
>> > # Modified_by:  dhcpcd
>> > # Backup:       /etc/resolv.conf.saved.by.dhcpcd.eth0
>> > # Process:      dhcpcd
>> > # Process_id:   5677
>> > # Script:       /sbin/modify_resolvconf
>> > # Saveto:
>> > # Info:         This is a temporary resolv.conf created by
>> > service dhcpcd.
>> > #               The previous file has been saved and will be
>> > restored later.
>> > #
>> > #               If you don't like your resolv.conf to be  
>> changed, you
>> > #               can set
>> > MODIFY_{RESOLV,NAMED}_CONF_DYNAMICALLY=no. This
>> > #               variables are placed in /etc/sysconfig/network/ 
>> config.
>> > #
>> > #               You can also configure service dhcpcd not to
>> > modify it.
>> > #
>> > #               If you don't like dhcpcd to change your nameserver
>> > #               settings
>> > #               then either set DHCLIENT_MODIFY_RESOLV_CONF=no
>> > #               in /etc/sysconfig/network/dhcp, or
>> > #               set MODIFY_RESOLV_CONF_DYNAMICALLY=no in
>> > #               /etc/sysconfig/network/config or (manually) use  
>> dhcpcd
>> > #               with -R.  If you only want to keep your
>> > searchlist, set
>> > #               DHCLIENT_KEEP_SEARCHLIST=yes in
>> > /etc/sysconfig/network/dhcp or
>> > #               (manually) use the -K option.
>> > #
>> > ### END INFO
>> > search tepper.cmu.edu
>> > nameserver 128.2.1.10
>> > nameserver 128.2.1.11
>> >
>> > [.....]
>> >
>> > -------------------------------------------------------------
>> >
>> > Thanks,
>> >
>> > Brett
>> >
>> >
>> >
>> > On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
>> > > First I should point out that mpdallexit will probably fail if  
>> you
>> > > have not
>> > > successfully debugged host/net config problems; use kill -9.
>> > > Then, I note that you are trying mpdboot with 2 hosts.
>> > But, the manual
>> > > suggests that when there are multiple hosts, that config has  
>> to be
>> > > verified for all.  This includes running mpcheck as server
>> > and client
>> > > on each, then reversing their roles, running mpd by hand on
>> > each, etc.
>> > > Here is a typical blurb I send out as a reminder:
>> > >
>> > > Sometimes there are problems with mpd or mpdboot while following
>> > > the Quick Start portion of the mpich2 install guide.  This  
>> typically
>> > > happens somewhere during Steps 10-13, but may occur during other
>> > > steps as well.  The guide suggests that when mpd/mpdboot problems
>> > > arise, you follow the procedures in Appendix A
>> > (Troubleshooting MPDs).
>> > >
>> > > Section A.1 (Getting Started with MPD) provides a 7-step  
>> procedure
>> > > to follow to get one or more mpds to working, first by hand, and
>> > > then via mpdboot.  However, some of the early steps begin with a
>> > > pre-MPD program called mpdcheck.  That program is designed to  
>> help
>> > > determine in advance if there will be problems associated wtih  
>> host
>> > > or network configuration.  The instructions in section A.1  
>> suggest
>> > > first using mpdcheck on individual machines, and then pair-wise.
>> > > It is particularly important to try the pair-wise experiments  
>> where
>> > > one machine plays the role of the server and the other the  
>> client,
>> > > and then to reverse the roles.
>> > >
>> > > Sometimes the procedures in A.1 indicate that MPDs are not likely
>> > > to run on your systems due to problems with host and/or network
>> > > configuration.  At those points, you are referred to subsequent
>> > > sections, e.g. A.2 Debugging host/network configuration problems,
>> > > or A.3 Firewalls, etc.
>> > >
>> > >
>> > > On WedApr 4, at Wed Apr 4 9:41AM, Brett Gordon wrote:
>> > >
>> > > > Hi Ralph,
>> > > >
>> > > > Thanks for your response.
>> > > >
>> > > > I did as you suggested, and it seems to work, but I still
>> > can't get
>> > > > the ring running.
>> > > >
>> > > > Terminal 1:
>> > > > brgordon at veritas:~> mpdallexit    //Just to make sure  
>> nothing else
>> > > > is running
>> > > > mpdallexit: cannot connect to local mpd
>> > (/tmp/mpd2.console_brgordon);
>> > > > possible causes:
>> > > >  1. no mpd is running on this host
>> > > >  2. an mpd is running but was started without a "console"
>> > (-n option)
>> > > > In case 1, you can start an mpd on this host with:
>> > > >    mpd &
>> > > > and you will be able to run jobs just on this host.
>> > > > For more details on starting mpds on a set of hosts, see
>> > > > the MPICH2 Installation Guide.
>> > > > brgordon at veritas:~> mpdcheck
>> > > > brgordon at veritas:~> mpdcheck -s
>> > > > server listening at INADDR_ANY on: veritas 23768
>> > > > server has conn on <socket._socketobject object at  
>> 0x2aaaaab42650>
>> > > > from ('128.2.93.142', 25125)
>> > > > server successfully recvd msg from client:
>> > hello_from_client_to_server
>> > > > brgordon at veritas:~>
>> > > >
>> > > > Terminal 2:
>> > > > brgordon at veritas:~> mpdcheck -c veritas 23768
>> > > > client successfully recvd ack from server:
>> > ack_from_server_to_client
>> > > > brgordon at veritas:~>
>> > > >
>> > > > I then tried to run mpdboot from the computer 'veritas',  
>> hoping to
>> > > > bring up a ring with 'veritas' and 'elaine', and got the
>> > following:
>> > > >
>> > > > brgordon at veritas:~> mpdboot -n 2 -f mpd.hosts -- 
>> user=brgordon --
>> > > > verbose  --chkup
>> > > > checking elaine
>> > > > there are 2 hosts up (counting local)
>> > > > running mpdallexit on veritas
>> > > > LAUNCHED mpd on veritas  via
>> > > > RUNNING: mpd on veritas
>> > > > LAUNCHED mpd on elaine  via  veritas
>> > > > mpdboot_veritas (handle_mpd_output 383): failed to connect  
>> to mpd
>> > > > on elaine
>> > > >
>> > > > brgordon at veritas:~> less mpd.hosts
>> > > > elaine
>> > > > brgordon at veritas:~> less .mpd.conf
>> > > > secretword=<my password>
>> > > >
>> > > > Same files exist on 'elaine', but the host is listed as  
>> 'veritas'.
>> > > >
>> > > > Thanks,
>> > > > Brett
>> > > >
>> > > >
>> > > >
>> > > > On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
>> > > >> In your output for mpdcheck below, there are jumbled
>> > lots of error
>> > > >> msgs from
>> > > >> an mpd as well.  Apparently you had started an mpd in that  
>> same
>> > > >> window at some
>> > > >> point.  Anyway, it is best to make sure that all mpd
>> > processes are
>> > > >> killed before doing
>> > > >> the mpdcheck.  Then, try it again.  As the manual suggests,  
>> it is
>> > > >> pointless to try using
>> > > >> mpdboot unless you have cleared up all issues first.
>> > Even starting
>> > > >> an mpd ring by
>> > > >> hand is only recommended after successfully debugging
>> > with mpdcheck.
>> > > >> So, I
>> > > >> suggest trying mpdcheck again, first with no options.
>> > Then, with -s
>> > > >> in one window
>> > > >> and -c n another.
>> > > >>
>> > > >> On TueApr 3, at Tue Apr 3 10:34PM, Brett Gordon wrote:
>> > > >>
>> > > >> > Hello,
>> > > >> >
>> > > >> > I have successfully installed mpich2-1.0.5 on two
>> > linux boxes. Both
>> > > >> > succeed in the standard tests involving one host
>> > solving the 'cpi'
>> > > >> > program.
>> > > >> >
>> > > >> > However, I'm running into two (probably related) problems:
>> > > >> >
>> > > >> > 1) When I try to run mpd as a server and client on the same
>> > > >> computer
>> > > >> > (as on page 31 of the install documentation), I get
>> > the following:
>> > > >> >
>> > > >> > brgordon at veritas:~> mpdcheck -s
>> > > >> > server listening at INADDR_ANY on: veritas 23761
>> > > >> > brgordon at veritas:~> mpdcheck -c veritas 23761
>> > > >> > veritas_23761 (recv_dict_msg 549):recv_dict_msg:
>> > errmsg=:invalid
>> > > >> > literal for int(): hello_fr:
>> > > >> >  mpdtb:
>> > > >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  549,
>> > > >> recv_dict_msg
>> > > >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  989,
>> > > >> > handle_ring_listener_connection
>> > > >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  743,
>> > > >> > handle_active_streams
>> > > >> >    /home/brgordon/mpich2-install/bin/mpd,  286,  runmainloop
>> > > >> >    /home/brgordon/mpich2-install/bin/mpd,  255,  run
>> > > >> >    /home/brgordon/mpich2-install/bin/mpd,  1470,  ?
>> > > >> >
>> > > >> > veritas_23761 (handle_ring_listener_connection 993):
>> > INVALID msg
>> > > >> from
>> > > >> > new connection :('128.2.93.142', 16587): msg=:{}:
>> > > >> > Traceback (most recent call last):
>> > > >> >  File "/home/brgordon/mpich2-install/bin/mpdcheck",
>> > line 105, in ?
>> > > >> >    msg = sock.recv(64)
>> > > >> > socket.error: (104, 'Connection reset by peer')
>> > > >> >
>> > > >> > 2) I also can't get a ring to work. I have setup ssh to work
>> > > >> without
>> > > >> > using passwords ('ssh veritas date' works fine). The
>> > workaround for
>> > > >> > mpdboot on page 9 of the install doc does not work for
>> > me, nor does
>> > > >> > running 'mpdcheck -f mpd.hosts -ssh'.
>> > > >> >
>> > > >> > When I try to run mpdboot, I get
>> > > >> > brgordon at elaine ~]$ mpdboot -n 2 -f mpd.hosts
>> > > >> > mpdboot_elaine.tepper.cmu.edu (handle_mpd_output 383):
>> > failed to
>> > > >> > connect to mpd on veritas
>> > > >> >
>> > > >> >
>> > > >> > I feel like I'm getting close to having this working,
>> > so I would
>> > > >> > greatly appreciate any help. Please let me know if
>> > there is more
>> > > >> > information I can provide.
>> > > >> >
>> > > >> > Thanks,
>> > > >> > Brett
>> > > >> >
>> > > >>
>> > > >>
>> > >
>> > >
>> >
>> >
>>
>>
>




More information about the mpich-discuss mailing list