[MPICH] Problem setting up a ring

Brett Gordon brgordon at gmail.com
Wed Apr 4 11:07:15 CDT 2007


Ralph,

I went back to the install guide and ran through the steps from 1 - 5
without any problems.

But there was a problem trying to run mpdcheck -f mpd.hosts. Here is
the output, showing me trying to connect from 'elaine' to 'veritas'.
It seems like there is an ssh problem, but as you can see in the last
call, I was able to ssh with no password.

-------------------------------------------------------
[brgordon at elaine ~]$ less mpd.hosts
veritas

[brgordon at elaine ~]$ mpdcheck -f mpd.hosts

[brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts   //First without -ssh
obtaining hostname via gethostname and getfqdn
gethostname gives  elaine.tepper.cmu.edu
getfqdn gives  elaine.tepper.cmu.edu
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames;  make sure
other than 127.0.0.1
gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'], ['128.2.89.115'])
gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'], ['128.2.89.115'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
checking gethostbyXXX for unqualified veritas
gethostbyname_ex:  ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
checking gethostbyXXX for qualified veritas
gethostbyname_ex:  ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])

[brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts -ssh  //and now with
obtaining hostname via gethostname and getfqdn
gethostname gives  elaine.tepper.cmu.edu
getfqdn gives  elaine.tepper.cmu.edu
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames;  make sure
other than 127.0.0.1
gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'], ['128.2.89.115'])
gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'], ['128.2.89.115'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
checking gethostbyXXX for unqualified veritas
gethostbyname_ex:  ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
checking gethostbyXXX for qualified veritas
gethostbyname_ex:  ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
trying: ssh veritas -x -n /bin/echo hello
starting server: /home/brgordon/mpich2-install/bin/mpdcheck.py -s
starting client: ssh veritas -x -n
/home/brgordon/mpich2-install/bin/mpdcheck.py -c elaine.tepper.cmu.edu
36031
client on veritas failed to access the server
here is the output:
Traceback (most recent call last):
  File "/home/brgordon/mpich2-install/bin/mpdcheck.py", line 103, in ?
    sock.connect((argv[argidx+1],int(argv[argidx+2])))  # note double parens
  File "<string>", line 1, in connect
socket.error: (113, 'No route to host')

[brgordon at elaine ~]$ ssh veritas date
Mon Apr  9 00:42:41 EDT 2007
-------------------------------------------------------

I tried running the ssh command by hand from elaine, and then tried
doing the mpdcheck command from veritas, but neither worked. Both
returned a "no route to host" error.

Here is the first part of the output from mpdcheck -pc. Perhaps I need
to edit the /etc/hosts or /etc/resolv.conf files on veritas?

Elaine
-------
[brgordon at elaine ~]$ mpdcheck -pc
--- print results of: gethostbyname_ex(gethostname())
('elaine.tepper.cmu.edu', ['elaine'], ['128.2.89.115'])
--- try to run /bin/hostname
elaine.tepper.cmu.edu
--- try to run uname -a
Linux elaine.tepper.cmu.edu 2.6.18-1.2257.fc5 #1 SMP Fri Dec 15
16:07:14 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
--- try to print /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
128.2.89.115            elaine.tepper.cmu.edu elaine
128.2.89.114    austen.tepper.cmu.edu      austen
128.2.89.117    puddy.tepper.cmu.edu       puddy.tepper
128.2.89.118    rlg.tepper.cmu.edu         rlg
128.2.13.161    unix31.andrew.cmu.edu      unix31
128.2.13.162    unix32.andrew.cmu.edu      unix32
128.2.92.151    bigp.tepper.cmu.edu        bigp
128.32.66.92    bear.haas.berkeley.edu     bear
--- try to print /etc/resolv.conf
search TEPPER.cmu.edu
nameserver 128.2.1.11
nameserver 128.2.1.10

[.....]


Veritas
----------
brgordon at veritas:~> mpdcheck -pc
--- print results of: gethostbyname_ex(gethostname())
('veritas.tepper.cmu.edu', [], ['128.2.93.142'])
--- try to run /bin/hostname
veritas
--- try to run uname -a
Linux veritas 2.6.13-15.15-smp #1 SMP Mon Feb 26 14:11:33 UTC 2007
x86_64 x86_64 x86_64 GNU/Linux
--- try to print /etc/hosts
#
# hosts         This file describes a number of hostname-to-address
#               mappings for the TCP/IP subsystem.  It is mostly
#               used at boot time, when no name servers are running.
#               On small systems, this file can be used instead of a
#               "named" name server.
# Syntax:
#
# IP-Address  Full-Qualified-Hostname  Short-Hostname
#

127.0.0.1       localhost

# special IPv6 addresses
::1             localhost ipv6-localhost ipv6-loopback

fe00::0         ipv6-localnet

ff00::0         ipv6-mcastprefix
ff02::1         ipv6-allnodes
ff02::2         ipv6-allrouters
ff02::3         ipv6-allhosts
127.0.0.2       linux.site linux
--- try to print /etc/resolv.conf
### BEGIN INFO
#
# Modified_by:  dhcpcd
# Backup:       /etc/resolv.conf.saved.by.dhcpcd.eth0
# Process:      dhcpcd
# Process_id:   5677
# Script:       /sbin/modify_resolvconf
# Saveto:
# Info:         This is a temporary resolv.conf created by service dhcpcd.
#               The previous file has been saved and will be restored later.
#
#               If you don't like your resolv.conf to be changed, you
#               can set MODIFY_{RESOLV,NAMED}_CONF_DYNAMICALLY=no. This
#               variables are placed in /etc/sysconfig/network/config.
#
#               You can also configure service dhcpcd not to modify it.
#
#               If you don't like dhcpcd to change your nameserver
#               settings
#               then either set DHCLIENT_MODIFY_RESOLV_CONF=no
#               in /etc/sysconfig/network/dhcp, or
#               set MODIFY_RESOLV_CONF_DYNAMICALLY=no in
#               /etc/sysconfig/network/config or (manually) use dhcpcd
#               with -R.  If you only want to keep your searchlist, set
#               DHCLIENT_KEEP_SEARCHLIST=yes in /etc/sysconfig/network/dhcp or
#               (manually) use the -K option.
#
### END INFO
search tepper.cmu.edu
nameserver 128.2.1.10
nameserver 128.2.1.11

[.....]

-------------------------------------------------------------

Thanks,

Brett



On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
> First I should point out that mpdallexit will probably fail if you
> have not
> successfully debugged host/net config problems; use kill -9.
> Then, I note that you are trying mpdboot with 2 hosts.  But, the manual
> suggests that when there are multiple hosts, that config has to be
> verified for all.  This includes running mpcheck as server and client
> on each, then reversing their roles, running mpd by hand on each, etc.
> Here is a typical blurb I send out as a reminder:
>
> Sometimes there are problems with mpd or mpdboot while following
> the Quick Start portion of the mpich2 install guide.  This typically
> happens somewhere during Steps 10-13, but may occur during other
> steps as well.  The guide suggests that when mpd/mpdboot problems
> arise, you follow the procedures in Appendix A (Troubleshooting MPDs).
>
> Section A.1 (Getting Started with MPD) provides a 7-step procedure
> to follow to get one or more mpds to working, first by hand, and
> then via mpdboot.  However, some of the early steps begin with a
> pre-MPD program called mpdcheck.  That program is designed to help
> determine in advance if there will be problems associated wtih host
> or network configuration.  The instructions in section A.1 suggest
> first using mpdcheck on individual machines, and then pair-wise.
> It is particularly important to try the pair-wise experiments where
> one machine plays the role of the server and the other the client,
> and then to reverse the roles.
>
> Sometimes the procedures in A.1 indicate that MPDs are not likely
> to run on your systems due to problems with host and/or network
> configuration.  At those points, you are referred to subsequent
> sections, e.g. A.2 Debugging host/network configuration problems,
> or A.3 Firewalls, etc.
>
>
> On WedApr 4, at Wed Apr 4 9:41AM, Brett Gordon wrote:
>
> > Hi Ralph,
> >
> > Thanks for your response.
> >
> > I did as you suggested, and it seems to work, but I still can't get
> > the ring running.
> >
> > Terminal 1:
> > brgordon at veritas:~> mpdallexit    //Just to make sure nothing else
> > is running
> > mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_brgordon);
> > possible causes:
> >  1. no mpd is running on this host
> >  2. an mpd is running but was started without a "console" (-n option)
> > In case 1, you can start an mpd on this host with:
> >    mpd &
> > and you will be able to run jobs just on this host.
> > For more details on starting mpds on a set of hosts, see
> > the MPICH2 Installation Guide.
> > brgordon at veritas:~> mpdcheck
> > brgordon at veritas:~> mpdcheck -s
> > server listening at INADDR_ANY on: veritas 23768
> > server has conn on <socket._socketobject object at 0x2aaaaab42650>
> > from ('128.2.93.142', 25125)
> > server successfully recvd msg from client: hello_from_client_to_server
> > brgordon at veritas:~>
> >
> > Terminal 2:
> > brgordon at veritas:~> mpdcheck -c veritas 23768
> > client successfully recvd ack from server: ack_from_server_to_client
> > brgordon at veritas:~>
> >
> > I then tried to run mpdboot from the computer 'veritas', hoping to
> > bring up a ring with 'veritas' and 'elaine', and got the following:
> >
> > brgordon at veritas:~> mpdboot -n 2 -f mpd.hosts --user=brgordon --
> > verbose  --chkup
> > checking elaine
> > there are 2 hosts up (counting local)
> > running mpdallexit on veritas
> > LAUNCHED mpd on veritas  via
> > RUNNING: mpd on veritas
> > LAUNCHED mpd on elaine  via  veritas
> > mpdboot_veritas (handle_mpd_output 383): failed to connect to mpd
> > on elaine
> >
> > brgordon at veritas:~> less mpd.hosts
> > elaine
> > brgordon at veritas:~> less .mpd.conf
> > secretword=<my password>
> >
> > Same files exist on 'elaine', but the host is listed as 'veritas'.
> >
> > Thanks,
> > Brett
> >
> >
> >
> > On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
> >> In your output for mpdcheck below, there are jumbled lots of error
> >> msgs from
> >> an mpd as well.  Apparently you had started an mpd in that same
> >> window at some
> >> point.  Anyway, it is best to make sure that all mpd processes are
> >> killed before doing
> >> the mpdcheck.  Then, try it again.  As the manual suggests, it is
> >> pointless to try using
> >> mpdboot unless you have cleared up all issues first.  Even starting
> >> an mpd ring by
> >> hand is only recommended after successfully debugging with mpdcheck.
> >> So, I
> >> suggest trying mpdcheck again, first with no options.  Then, with -s
> >> in one window
> >> and -c n another.
> >>
> >> On TueApr 3, at Tue Apr 3 10:34PM, Brett Gordon wrote:
> >>
> >> > Hello,
> >> >
> >> > I have successfully installed mpich2-1.0.5 on two linux boxes. Both
> >> > succeed in the standard tests involving one host solving the 'cpi'
> >> > program.
> >> >
> >> > However, I'm running into two (probably related) problems:
> >> >
> >> > 1) When I try to run mpd as a server and client on the same
> >> computer
> >> > (as on page 31 of the install documentation), I get the following:
> >> >
> >> > brgordon at veritas:~> mpdcheck -s
> >> > server listening at INADDR_ANY on: veritas 23761
> >> > brgordon at veritas:~> mpdcheck -c veritas 23761
> >> > veritas_23761 (recv_dict_msg 549):recv_dict_msg: errmsg=:invalid
> >> > literal for int(): hello_fr:
> >> >  mpdtb:
> >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  549,
> >> recv_dict_msg
> >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  989,
> >> > handle_ring_listener_connection
> >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  743,
> >> > handle_active_streams
> >> >    /home/brgordon/mpich2-install/bin/mpd,  286,  runmainloop
> >> >    /home/brgordon/mpich2-install/bin/mpd,  255,  run
> >> >    /home/brgordon/mpich2-install/bin/mpd,  1470,  ?
> >> >
> >> > veritas_23761 (handle_ring_listener_connection 993): INVALID msg
> >> from
> >> > new connection :('128.2.93.142', 16587): msg=:{}:
> >> > Traceback (most recent call last):
> >> >  File "/home/brgordon/mpich2-install/bin/mpdcheck", line 105, in ?
> >> >    msg = sock.recv(64)
> >> > socket.error: (104, 'Connection reset by peer')
> >> >
> >> > 2) I also can't get a ring to work. I have setup ssh to work
> >> without
> >> > using passwords ('ssh veritas date' works fine). The workaround for
> >> > mpdboot on page 9 of the install doc does not work for me, nor does
> >> > running 'mpdcheck -f mpd.hosts -ssh'.
> >> >
> >> > When I try to run mpdboot, I get
> >> > brgordon at elaine ~]$ mpdboot -n 2 -f mpd.hosts
> >> > mpdboot_elaine.tepper.cmu.edu (handle_mpd_output 383): failed to
> >> > connect to mpd on veritas
> >> >
> >> >
> >> > I feel like I'm getting close to having this working, so I would
> >> > greatly appreciate any help. Please let me know if there is more
> >> > information I can provide.
> >> >
> >> > Thanks,
> >> > Brett
> >> >
> >>
> >>
>
>




More information about the mpich-discuss mailing list