[MPICH] Problem setting up a ring

Rajeev Thakur thakur at mcs.anl.gov
Wed Apr 4 13:17:17 CDT 2007


>From veritas, can you do "ssh elaine date"?

Rajeev 

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Brett Gordon
> Sent: Wednesday, April 04, 2007 11:07 AM
> To: Ralph Butler; mpich-discuss at mcs.anl.gov
> Subject: Re: [MPICH] Problem setting up a ring
> 
> Ralph,
> 
> I went back to the install guide and ran through the steps from 1 - 5
> without any problems.
> 
> But there was a problem trying to run mpdcheck -f mpd.hosts. Here is
> the output, showing me trying to connect from 'elaine' to 'veritas'.
> It seems like there is an ssh problem, but as you can see in the last
> call, I was able to ssh with no password.
> 
> -------------------------------------------------------
> [brgordon at elaine ~]$ less mpd.hosts
> veritas
> 
> [brgordon at elaine ~]$ mpdcheck -f mpd.hosts
> 
> [brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts   //First without -ssh
> obtaining hostname via gethostname and getfqdn
> gethostname gives  elaine.tepper.cmu.edu
> getfqdn gives  elaine.tepper.cmu.edu
> checking out unqualified hostname; make sure is not "localhost", etc.
> checking out qualified hostname; make sure is not "localhost", etc.
> obtain IP addrs via qualified and unqualified hostnames;  make sure
> other than 127.0.0.1
> gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'], 
> ['128.2.89.115'])
> gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'], 
> ['128.2.89.115'])
> checking that IP addrs resolve to same host
> now do some gethostbyaddr and gethostbyname_ex for machines 
> in hosts file
> checking gethostbyXXX for unqualified veritas
> gethostbyname_ex:  ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
> checking gethostbyXXX for qualified veritas
> gethostbyname_ex:  ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
> 
> [brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts -ssh  //and now with
> obtaining hostname via gethostname and getfqdn
> gethostname gives  elaine.tepper.cmu.edu
> getfqdn gives  elaine.tepper.cmu.edu
> checking out unqualified hostname; make sure is not "localhost", etc.
> checking out qualified hostname; make sure is not "localhost", etc.
> obtain IP addrs via qualified and unqualified hostnames;  make sure
> other than 127.0.0.1
> gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'], 
> ['128.2.89.115'])
> gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'], 
> ['128.2.89.115'])
> checking that IP addrs resolve to same host
> now do some gethostbyaddr and gethostbyname_ex for machines 
> in hosts file
> checking gethostbyXXX for unqualified veritas
> gethostbyname_ex:  ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
> checking gethostbyXXX for qualified veritas
> gethostbyname_ex:  ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
> trying: ssh veritas -x -n /bin/echo hello
> starting server: /home/brgordon/mpich2-install/bin/mpdcheck.py -s
> starting client: ssh veritas -x -n
> /home/brgordon/mpich2-install/bin/mpdcheck.py -c elaine.tepper.cmu.edu
> 36031
> client on veritas failed to access the server
> here is the output:
> Traceback (most recent call last):
>   File "/home/brgordon/mpich2-install/bin/mpdcheck.py", line 103, in ?
>     sock.connect((argv[argidx+1],int(argv[argidx+2])))  # 
> note double parens
>   File "<string>", line 1, in connect
> socket.error: (113, 'No route to host')
> 
> [brgordon at elaine ~]$ ssh veritas date
> Mon Apr  9 00:42:41 EDT 2007
> -------------------------------------------------------
> 
> I tried running the ssh command by hand from elaine, and then tried
> doing the mpdcheck command from veritas, but neither worked. Both
> returned a "no route to host" error.
> 
> Here is the first part of the output from mpdcheck -pc. Perhaps I need
> to edit the /etc/hosts or /etc/resolv.conf files on veritas?
> 
> Elaine
> -------
> [brgordon at elaine ~]$ mpdcheck -pc
> --- print results of: gethostbyname_ex(gethostname())
> ('elaine.tepper.cmu.edu', ['elaine'], ['128.2.89.115'])
> --- try to run /bin/hostname
> elaine.tepper.cmu.edu
> --- try to run uname -a
> Linux elaine.tepper.cmu.edu 2.6.18-1.2257.fc5 #1 SMP Fri Dec 15
> 16:07:14 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
> --- try to print /etc/hosts
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> 127.0.0.1               localhost.localdomain localhost
> 128.2.89.115            elaine.tepper.cmu.edu elaine
> 128.2.89.114    austen.tepper.cmu.edu      austen
> 128.2.89.117    puddy.tepper.cmu.edu       puddy.tepper
> 128.2.89.118    rlg.tepper.cmu.edu         rlg
> 128.2.13.161    unix31.andrew.cmu.edu      unix31
> 128.2.13.162    unix32.andrew.cmu.edu      unix32
> 128.2.92.151    bigp.tepper.cmu.edu        bigp
> 128.32.66.92    bear.haas.berkeley.edu     bear
> --- try to print /etc/resolv.conf
> search TEPPER.cmu.edu
> nameserver 128.2.1.11
> nameserver 128.2.1.10
> 
> [.....]
> 
> 
> Veritas
> ----------
> brgordon at veritas:~> mpdcheck -pc
> --- print results of: gethostbyname_ex(gethostname())
> ('veritas.tepper.cmu.edu', [], ['128.2.93.142'])
> --- try to run /bin/hostname
> veritas
> --- try to run uname -a
> Linux veritas 2.6.13-15.15-smp #1 SMP Mon Feb 26 14:11:33 UTC 2007
> x86_64 x86_64 x86_64 GNU/Linux
> --- try to print /etc/hosts
> #
> # hosts         This file describes a number of hostname-to-address
> #               mappings for the TCP/IP subsystem.  It is mostly
> #               used at boot time, when no name servers are running.
> #               On small systems, this file can be used instead of a
> #               "named" name server.
> # Syntax:
> #
> # IP-Address  Full-Qualified-Hostname  Short-Hostname
> #
> 
> 127.0.0.1       localhost
> 
> # special IPv6 addresses
> ::1             localhost ipv6-localhost ipv6-loopback
> 
> fe00::0         ipv6-localnet
> 
> ff00::0         ipv6-mcastprefix
> ff02::1         ipv6-allnodes
> ff02::2         ipv6-allrouters
> ff02::3         ipv6-allhosts
> 127.0.0.2       linux.site linux
> --- try to print /etc/resolv.conf
> ### BEGIN INFO
> #
> # Modified_by:  dhcpcd
> # Backup:       /etc/resolv.conf.saved.by.dhcpcd.eth0
> # Process:      dhcpcd
> # Process_id:   5677
> # Script:       /sbin/modify_resolvconf
> # Saveto:
> # Info:         This is a temporary resolv.conf created by 
> service dhcpcd.
> #               The previous file has been saved and will be 
> restored later.
> #
> #               If you don't like your resolv.conf to be changed, you
> #               can set 
> MODIFY_{RESOLV,NAMED}_CONF_DYNAMICALLY=no. This
> #               variables are placed in /etc/sysconfig/network/config.
> #
> #               You can also configure service dhcpcd not to 
> modify it.
> #
> #               If you don't like dhcpcd to change your nameserver
> #               settings
> #               then either set DHCLIENT_MODIFY_RESOLV_CONF=no
> #               in /etc/sysconfig/network/dhcp, or
> #               set MODIFY_RESOLV_CONF_DYNAMICALLY=no in
> #               /etc/sysconfig/network/config or (manually) use dhcpcd
> #               with -R.  If you only want to keep your 
> searchlist, set
> #               DHCLIENT_KEEP_SEARCHLIST=yes in 
> /etc/sysconfig/network/dhcp or
> #               (manually) use the -K option.
> #
> ### END INFO
> search tepper.cmu.edu
> nameserver 128.2.1.10
> nameserver 128.2.1.11
> 
> [.....]
> 
> -------------------------------------------------------------
> 
> Thanks,
> 
> Brett
> 
> 
> 
> On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
> > First I should point out that mpdallexit will probably fail if you
> > have not
> > successfully debugged host/net config problems; use kill -9.
> > Then, I note that you are trying mpdboot with 2 hosts.  
> But, the manual
> > suggests that when there are multiple hosts, that config has to be
> > verified for all.  This includes running mpcheck as server 
> and client
> > on each, then reversing their roles, running mpd by hand on 
> each, etc.
> > Here is a typical blurb I send out as a reminder:
> >
> > Sometimes there are problems with mpd or mpdboot while following
> > the Quick Start portion of the mpich2 install guide.  This typically
> > happens somewhere during Steps 10-13, but may occur during other
> > steps as well.  The guide suggests that when mpd/mpdboot problems
> > arise, you follow the procedures in Appendix A 
> (Troubleshooting MPDs).
> >
> > Section A.1 (Getting Started with MPD) provides a 7-step procedure
> > to follow to get one or more mpds to working, first by hand, and
> > then via mpdboot.  However, some of the early steps begin with a
> > pre-MPD program called mpdcheck.  That program is designed to help
> > determine in advance if there will be problems associated wtih host
> > or network configuration.  The instructions in section A.1 suggest
> > first using mpdcheck on individual machines, and then pair-wise.
> > It is particularly important to try the pair-wise experiments where
> > one machine plays the role of the server and the other the client,
> > and then to reverse the roles.
> >
> > Sometimes the procedures in A.1 indicate that MPDs are not likely
> > to run on your systems due to problems with host and/or network
> > configuration.  At those points, you are referred to subsequent
> > sections, e.g. A.2 Debugging host/network configuration problems,
> > or A.3 Firewalls, etc.
> >
> >
> > On WedApr 4, at Wed Apr 4 9:41AM, Brett Gordon wrote:
> >
> > > Hi Ralph,
> > >
> > > Thanks for your response.
> > >
> > > I did as you suggested, and it seems to work, but I still 
> can't get
> > > the ring running.
> > >
> > > Terminal 1:
> > > brgordon at veritas:~> mpdallexit    //Just to make sure nothing else
> > > is running
> > > mpdallexit: cannot connect to local mpd 
> (/tmp/mpd2.console_brgordon);
> > > possible causes:
> > >  1. no mpd is running on this host
> > >  2. an mpd is running but was started without a "console" 
> (-n option)
> > > In case 1, you can start an mpd on this host with:
> > >    mpd &
> > > and you will be able to run jobs just on this host.
> > > For more details on starting mpds on a set of hosts, see
> > > the MPICH2 Installation Guide.
> > > brgordon at veritas:~> mpdcheck
> > > brgordon at veritas:~> mpdcheck -s
> > > server listening at INADDR_ANY on: veritas 23768
> > > server has conn on <socket._socketobject object at 0x2aaaaab42650>
> > > from ('128.2.93.142', 25125)
> > > server successfully recvd msg from client: 
> hello_from_client_to_server
> > > brgordon at veritas:~>
> > >
> > > Terminal 2:
> > > brgordon at veritas:~> mpdcheck -c veritas 23768
> > > client successfully recvd ack from server: 
> ack_from_server_to_client
> > > brgordon at veritas:~>
> > >
> > > I then tried to run mpdboot from the computer 'veritas', hoping to
> > > bring up a ring with 'veritas' and 'elaine', and got the 
> following:
> > >
> > > brgordon at veritas:~> mpdboot -n 2 -f mpd.hosts --user=brgordon --
> > > verbose  --chkup
> > > checking elaine
> > > there are 2 hosts up (counting local)
> > > running mpdallexit on veritas
> > > LAUNCHED mpd on veritas  via
> > > RUNNING: mpd on veritas
> > > LAUNCHED mpd on elaine  via  veritas
> > > mpdboot_veritas (handle_mpd_output 383): failed to connect to mpd
> > > on elaine
> > >
> > > brgordon at veritas:~> less mpd.hosts
> > > elaine
> > > brgordon at veritas:~> less .mpd.conf
> > > secretword=<my password>
> > >
> > > Same files exist on 'elaine', but the host is listed as 'veritas'.
> > >
> > > Thanks,
> > > Brett
> > >
> > >
> > >
> > > On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
> > >> In your output for mpdcheck below, there are jumbled 
> lots of error
> > >> msgs from
> > >> an mpd as well.  Apparently you had started an mpd in that same
> > >> window at some
> > >> point.  Anyway, it is best to make sure that all mpd 
> processes are
> > >> killed before doing
> > >> the mpdcheck.  Then, try it again.  As the manual suggests, it is
> > >> pointless to try using
> > >> mpdboot unless you have cleared up all issues first.  
> Even starting
> > >> an mpd ring by
> > >> hand is only recommended after successfully debugging 
> with mpdcheck.
> > >> So, I
> > >> suggest trying mpdcheck again, first with no options.  
> Then, with -s
> > >> in one window
> > >> and -c n another.
> > >>
> > >> On TueApr 3, at Tue Apr 3 10:34PM, Brett Gordon wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > I have successfully installed mpich2-1.0.5 on two 
> linux boxes. Both
> > >> > succeed in the standard tests involving one host 
> solving the 'cpi'
> > >> > program.
> > >> >
> > >> > However, I'm running into two (probably related) problems:
> > >> >
> > >> > 1) When I try to run mpd as a server and client on the same
> > >> computer
> > >> > (as on page 31 of the install documentation), I get 
> the following:
> > >> >
> > >> > brgordon at veritas:~> mpdcheck -s
> > >> > server listening at INADDR_ANY on: veritas 23761
> > >> > brgordon at veritas:~> mpdcheck -c veritas 23761
> > >> > veritas_23761 (recv_dict_msg 549):recv_dict_msg: 
> errmsg=:invalid
> > >> > literal for int(): hello_fr:
> > >> >  mpdtb:
> > >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  549,
> > >> recv_dict_msg
> > >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  989,
> > >> > handle_ring_listener_connection
> > >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  743,
> > >> > handle_active_streams
> > >> >    /home/brgordon/mpich2-install/bin/mpd,  286,  runmainloop
> > >> >    /home/brgordon/mpich2-install/bin/mpd,  255,  run
> > >> >    /home/brgordon/mpich2-install/bin/mpd,  1470,  ?
> > >> >
> > >> > veritas_23761 (handle_ring_listener_connection 993): 
> INVALID msg
> > >> from
> > >> > new connection :('128.2.93.142', 16587): msg=:{}:
> > >> > Traceback (most recent call last):
> > >> >  File "/home/brgordon/mpich2-install/bin/mpdcheck", 
> line 105, in ?
> > >> >    msg = sock.recv(64)
> > >> > socket.error: (104, 'Connection reset by peer')
> > >> >
> > >> > 2) I also can't get a ring to work. I have setup ssh to work
> > >> without
> > >> > using passwords ('ssh veritas date' works fine). The 
> workaround for
> > >> > mpdboot on page 9 of the install doc does not work for 
> me, nor does
> > >> > running 'mpdcheck -f mpd.hosts -ssh'.
> > >> >
> > >> > When I try to run mpdboot, I get
> > >> > brgordon at elaine ~]$ mpdboot -n 2 -f mpd.hosts
> > >> > mpdboot_elaine.tepper.cmu.edu (handle_mpd_output 383): 
> failed to
> > >> > connect to mpd on veritas
> > >> >
> > >> >
> > >> > I feel like I'm getting close to having this working, 
> so I would
> > >> > greatly appreciate any help. Please let me know if 
> there is more
> > >> > information I can provide.
> > >> >
> > >> > Thanks,
> > >> > Brett
> > >> >
> > >>
> > >>
> >
> >
> 
> 




More information about the mpich-discuss mailing list