[MPICH] Problem setting up a ring

Ralph Butler rbutler at mtsu.edu
Wed Apr 4 13:03:55 CDT 2007


Yes, mpdcheck has now identified the problem.  As you point out  
below, you get "no route to host"
msgs when trying to reach one host from the other.  You need to  
configure the machines so that
they can find each other.  There are a variety of ways to accomplish  
that.  Some folks put info into
/etc/hosts, others use DNS servers.  This is of course outside the  
bounds of mpd and/or mpdcheck.
I would say it is time to turn to your sysadmin and/or netadmin for  
configuration assistance.

On WedApr 4, at Wed Apr 4 11:07AM, Brett Gordon wrote:

> Ralph,
>
> I went back to the install guide and ran through the steps from 1 - 5
> without any problems.
>
> But there was a problem trying to run mpdcheck -f mpd.hosts. Here is
> the output, showing me trying to connect from 'elaine' to 'veritas'.
> It seems like there is an ssh problem, but as you can see in the last
> call, I was able to ssh with no password.
>
> -------------------------------------------------------
> [brgordon at elaine ~]$ less mpd.hosts
> veritas
>
> [brgordon at elaine ~]$ mpdcheck -f mpd.hosts
>
> [brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts   //First without -ssh
> obtaining hostname via gethostname and getfqdn
> gethostname gives  elaine.tepper.cmu.edu
> getfqdn gives  elaine.tepper.cmu.edu
> checking out unqualified hostname; make sure is not "localhost", etc.
> checking out qualified hostname; make sure is not "localhost", etc.
> obtain IP addrs via qualified and unqualified hostnames;  make sure
> other than 127.0.0.1
> gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],  
> ['128.2.89.115'])
> gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],  
> ['128.2.89.115'])
> checking that IP addrs resolve to same host
> now do some gethostbyaddr and gethostbyname_ex for machines in  
> hosts file
> checking gethostbyXXX for unqualified veritas
> gethostbyname_ex:  ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
> checking gethostbyXXX for qualified veritas
> gethostbyname_ex:  ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
>
> [brgordon at elaine ~]$ mpdcheck -v -f mpd.hosts -ssh  //and now with
> obtaining hostname via gethostname and getfqdn
> gethostname gives  elaine.tepper.cmu.edu
> getfqdn gives  elaine.tepper.cmu.edu
> checking out unqualified hostname; make sure is not "localhost", etc.
> checking out qualified hostname; make sure is not "localhost", etc.
> obtain IP addrs via qualified and unqualified hostnames;  make sure
> other than 127.0.0.1
> gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],  
> ['128.2.89.115'])
> gethostbyname_ex:  ('elaine.tepper.cmu.edu', ['elaine'],  
> ['128.2.89.115'])
> checking that IP addrs resolve to same host
> now do some gethostbyaddr and gethostbyname_ex for machines in  
> hosts file
> checking gethostbyXXX for unqualified veritas
> gethostbyname_ex:  ('veritas.TEPPER.cmu.edu', [], ['128.2.93.142'])
> checking gethostbyXXX for qualified veritas
> gethostbyname_ex:  ('VERITAS.TEPPER.cmu.edu', [], ['128.2.93.142'])
> trying: ssh veritas -x -n /bin/echo hello
> starting server: /home/brgordon/mpich2-install/bin/mpdcheck.py -s
> starting client: ssh veritas -x -n
> /home/brgordon/mpich2-install/bin/mpdcheck.py -c elaine.tepper.cmu.edu
> 36031
> client on veritas failed to access the server
> here is the output:
> Traceback (most recent call last):
>  File "/home/brgordon/mpich2-install/bin/mpdcheck.py", line 103, in ?
>    sock.connect((argv[argidx+1],int(argv[argidx+2])))  # note  
> double parens
>  File "<string>", line 1, in connect
> socket.error: (113, 'No route to host')
>
> [brgordon at elaine ~]$ ssh veritas date
> Mon Apr  9 00:42:41 EDT 2007
> -------------------------------------------------------
>
> I tried running the ssh command by hand from elaine, and then tried
> doing the mpdcheck command from veritas, but neither worked. Both
> returned a "no route to host" error.
>
> Here is the first part of the output from mpdcheck -pc. Perhaps I need
> to edit the /etc/hosts or /etc/resolv.conf files on veritas?
>
> Elaine
> -------
> [brgordon at elaine ~]$ mpdcheck -pc
> --- print results of: gethostbyname_ex(gethostname())
> ('elaine.tepper.cmu.edu', ['elaine'], ['128.2.89.115'])
> --- try to run /bin/hostname
> elaine.tepper.cmu.edu
> --- try to run uname -a
> Linux elaine.tepper.cmu.edu 2.6.18-1.2257.fc5 #1 SMP Fri Dec 15
> 16:07:14 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
> --- try to print /etc/hosts
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> 127.0.0.1               localhost.localdomain localhost
> 128.2.89.115            elaine.tepper.cmu.edu elaine
> 128.2.89.114    austen.tepper.cmu.edu      austen
> 128.2.89.117    puddy.tepper.cmu.edu       puddy.tepper
> 128.2.89.118    rlg.tepper.cmu.edu         rlg
> 128.2.13.161    unix31.andrew.cmu.edu      unix31
> 128.2.13.162    unix32.andrew.cmu.edu      unix32
> 128.2.92.151    bigp.tepper.cmu.edu        bigp
> 128.32.66.92    bear.haas.berkeley.edu     bear
> --- try to print /etc/resolv.conf
> search TEPPER.cmu.edu
> nameserver 128.2.1.11
> nameserver 128.2.1.10
>
> [.....]
>
>
> Veritas
> ----------
> brgordon at veritas:~> mpdcheck -pc
> --- print results of: gethostbyname_ex(gethostname())
> ('veritas.tepper.cmu.edu', [], ['128.2.93.142'])
> --- try to run /bin/hostname
> veritas
> --- try to run uname -a
> Linux veritas 2.6.13-15.15-smp #1 SMP Mon Feb 26 14:11:33 UTC 2007
> x86_64 x86_64 x86_64 GNU/Linux
> --- try to print /etc/hosts
> #
> # hosts         This file describes a number of hostname-to-address
> #               mappings for the TCP/IP subsystem.  It is mostly
> #               used at boot time, when no name servers are running.
> #               On small systems, this file can be used instead of a
> #               "named" name server.
> # Syntax:
> #
> # IP-Address  Full-Qualified-Hostname  Short-Hostname
> #
>
> 127.0.0.1       localhost
>
> # special IPv6 addresses
> ::1             localhost ipv6-localhost ipv6-loopback
>
> fe00::0         ipv6-localnet
>
> ff00::0         ipv6-mcastprefix
> ff02::1         ipv6-allnodes
> ff02::2         ipv6-allrouters
> ff02::3         ipv6-allhosts
> 127.0.0.2       linux.site linux
> --- try to print /etc/resolv.conf
> ### BEGIN INFO
> #
> # Modified_by:  dhcpcd
> # Backup:       /etc/resolv.conf.saved.by.dhcpcd.eth0
> # Process:      dhcpcd
> # Process_id:   5677
> # Script:       /sbin/modify_resolvconf
> # Saveto:
> # Info:         This is a temporary resolv.conf created by service  
> dhcpcd.
> #               The previous file has been saved and will be  
> restored later.
> #
> #               If you don't like your resolv.conf to be changed, you
> #               can set MODIFY_{RESOLV,NAMED}_CONF_DYNAMICALLY=no.  
> This
> #               variables are placed in /etc/sysconfig/network/config.
> #
> #               You can also configure service dhcpcd not to modify  
> it.
> #
> #               If you don't like dhcpcd to change your nameserver
> #               settings
> #               then either set DHCLIENT_MODIFY_RESOLV_CONF=no
> #               in /etc/sysconfig/network/dhcp, or
> #               set MODIFY_RESOLV_CONF_DYNAMICALLY=no in
> #               /etc/sysconfig/network/config or (manually) use dhcpcd
> #               with -R.  If you only want to keep your searchlist,  
> set
> #               DHCLIENT_KEEP_SEARCHLIST=yes in /etc/sysconfig/ 
> network/dhcp or
> #               (manually) use the -K option.
> #
> ### END INFO
> search tepper.cmu.edu
> nameserver 128.2.1.10
> nameserver 128.2.1.11
>
> [.....]
>
> -------------------------------------------------------------
>
> Thanks,
>
> Brett
>
>
>
> On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
>> First I should point out that mpdallexit will probably fail if you
>> have not
>> successfully debugged host/net config problems; use kill -9.
>> Then, I note that you are trying mpdboot with 2 hosts.  But, the  
>> manual
>> suggests that when there are multiple hosts, that config has to be
>> verified for all.  This includes running mpcheck as server and client
>> on each, then reversing their roles, running mpd by hand on each,  
>> etc.
>> Here is a typical blurb I send out as a reminder:
>>
>> Sometimes there are problems with mpd or mpdboot while following
>> the Quick Start portion of the mpich2 install guide.  This typically
>> happens somewhere during Steps 10-13, but may occur during other
>> steps as well.  The guide suggests that when mpd/mpdboot problems
>> arise, you follow the procedures in Appendix A (Troubleshooting  
>> MPDs).
>>
>> Section A.1 (Getting Started with MPD) provides a 7-step procedure
>> to follow to get one or more mpds to working, first by hand, and
>> then via mpdboot.  However, some of the early steps begin with a
>> pre-MPD program called mpdcheck.  That program is designed to help
>> determine in advance if there will be problems associated wtih host
>> or network configuration.  The instructions in section A.1 suggest
>> first using mpdcheck on individual machines, and then pair-wise.
>> It is particularly important to try the pair-wise experiments where
>> one machine plays the role of the server and the other the client,
>> and then to reverse the roles.
>>
>> Sometimes the procedures in A.1 indicate that MPDs are not likely
>> to run on your systems due to problems with host and/or network
>> configuration.  At those points, you are referred to subsequent
>> sections, e.g. A.2 Debugging host/network configuration problems,
>> or A.3 Firewalls, etc.
>>
>>
>> On WedApr 4, at Wed Apr 4 9:41AM, Brett Gordon wrote:
>>
>> > Hi Ralph,
>> >
>> > Thanks for your response.
>> >
>> > I did as you suggested, and it seems to work, but I still can't get
>> > the ring running.
>> >
>> > Terminal 1:
>> > brgordon at veritas:~> mpdallexit    //Just to make sure nothing else
>> > is running
>> > mpdallexit: cannot connect to local mpd (/tmp/ 
>> mpd2.console_brgordon);
>> > possible causes:
>> >  1. no mpd is running on this host
>> >  2. an mpd is running but was started without a "console" (-n  
>> option)
>> > In case 1, you can start an mpd on this host with:
>> >    mpd &
>> > and you will be able to run jobs just on this host.
>> > For more details on starting mpds on a set of hosts, see
>> > the MPICH2 Installation Guide.
>> > brgordon at veritas:~> mpdcheck
>> > brgordon at veritas:~> mpdcheck -s
>> > server listening at INADDR_ANY on: veritas 23768
>> > server has conn on <socket._socketobject object at 0x2aaaaab42650>
>> > from ('128.2.93.142', 25125)
>> > server successfully recvd msg from client:  
>> hello_from_client_to_server
>> > brgordon at veritas:~>
>> >
>> > Terminal 2:
>> > brgordon at veritas:~> mpdcheck -c veritas 23768
>> > client successfully recvd ack from server:  
>> ack_from_server_to_client
>> > brgordon at veritas:~>
>> >
>> > I then tried to run mpdboot from the computer 'veritas', hoping to
>> > bring up a ring with 'veritas' and 'elaine', and got the following:
>> >
>> > brgordon at veritas:~> mpdboot -n 2 -f mpd.hosts --user=brgordon --
>> > verbose  --chkup
>> > checking elaine
>> > there are 2 hosts up (counting local)
>> > running mpdallexit on veritas
>> > LAUNCHED mpd on veritas  via
>> > RUNNING: mpd on veritas
>> > LAUNCHED mpd on elaine  via  veritas
>> > mpdboot_veritas (handle_mpd_output 383): failed to connect to mpd
>> > on elaine
>> >
>> > brgordon at veritas:~> less mpd.hosts
>> > elaine
>> > brgordon at veritas:~> less .mpd.conf
>> > secretword=<my password>
>> >
>> > Same files exist on 'elaine', but the host is listed as 'veritas'.
>> >
>> > Thanks,
>> > Brett
>> >
>> >
>> >
>> > On 4/4/07, Ralph Butler <rbutler at mtsu.edu> wrote:
>> >> In your output for mpdcheck below, there are jumbled lots of error
>> >> msgs from
>> >> an mpd as well.  Apparently you had started an mpd in that same
>> >> window at some
>> >> point.  Anyway, it is best to make sure that all mpd processes are
>> >> killed before doing
>> >> the mpdcheck.  Then, try it again.  As the manual suggests, it is
>> >> pointless to try using
>> >> mpdboot unless you have cleared up all issues first.  Even  
>> starting
>> >> an mpd ring by
>> >> hand is only recommended after successfully debugging with  
>> mpdcheck.
>> >> So, I
>> >> suggest trying mpdcheck again, first with no options.  Then,  
>> with -s
>> >> in one window
>> >> and -c n another.
>> >>
>> >> On TueApr 3, at Tue Apr 3 10:34PM, Brett Gordon wrote:
>> >>
>> >> > Hello,
>> >> >
>> >> > I have successfully installed mpich2-1.0.5 on two linux  
>> boxes. Both
>> >> > succeed in the standard tests involving one host solving the  
>> 'cpi'
>> >> > program.
>> >> >
>> >> > However, I'm running into two (probably related) problems:
>> >> >
>> >> > 1) When I try to run mpd as a server and client on the same
>> >> computer
>> >> > (as on page 31 of the install documentation), I get the  
>> following:
>> >> >
>> >> > brgordon at veritas:~> mpdcheck -s
>> >> > server listening at INADDR_ANY on: veritas 23761
>> >> > brgordon at veritas:~> mpdcheck -c veritas 23761
>> >> > veritas_23761 (recv_dict_msg 549):recv_dict_msg: errmsg=:invalid
>> >> > literal for int(): hello_fr:
>> >> >  mpdtb:
>> >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  549,
>> >> recv_dict_msg
>> >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  989,
>> >> > handle_ring_listener_connection
>> >> >    /home/brgordon/mpich2-install/bin/mpdlib.py,  743,
>> >> > handle_active_streams
>> >> >    /home/brgordon/mpich2-install/bin/mpd,  286,  runmainloop
>> >> >    /home/brgordon/mpich2-install/bin/mpd,  255,  run
>> >> >    /home/brgordon/mpich2-install/bin/mpd,  1470,  ?
>> >> >
>> >> > veritas_23761 (handle_ring_listener_connection 993): INVALID msg
>> >> from
>> >> > new connection :('128.2.93.142', 16587): msg=:{}:
>> >> > Traceback (most recent call last):
>> >> >  File "/home/brgordon/mpich2-install/bin/mpdcheck", line 105,  
>> in ?
>> >> >    msg = sock.recv(64)
>> >> > socket.error: (104, 'Connection reset by peer')
>> >> >
>> >> > 2) I also can't get a ring to work. I have setup ssh to work
>> >> without
>> >> > using passwords ('ssh veritas date' works fine). The  
>> workaround for
>> >> > mpdboot on page 9 of the install doc does not work for me,  
>> nor does
>> >> > running 'mpdcheck -f mpd.hosts -ssh'.
>> >> >
>> >> > When I try to run mpdboot, I get
>> >> > brgordon at elaine ~]$ mpdboot -n 2 -f mpd.hosts
>> >> > mpdboot_elaine.tepper.cmu.edu (handle_mpd_output 383): failed to
>> >> > connect to mpd on veritas
>> >> >
>> >> >
>> >> > I feel like I'm getting close to having this working, so I would
>> >> > greatly appreciate any help. Please let me know if there is more
>> >> > information I can provide.
>> >> >
>> >> > Thanks,
>> >> > Brett
>> >> >
>> >>
>> >>
>>
>>




More information about the mpich-discuss mailing list