[MPICH] mpdboot and mpdcheck problems
Ralph Butler
rbutler at mtsu.edu
Wed Aug 2 16:30:59 CDT 2006
The key is to concentrate on getting 2 nodes to working together
first. So, if there are
problems between the head node and a compute node, then start with
them. As the
manual suggests, you should be able to run mpdcheck as a server (-s
option) on the
head node and run it as a client (-c ... args) on the compute node.
If that fails, then
the output may be useful. If it succeeds, then reverse the roles and
try it again, i.e.
server on compute node and client on head. mpdcheck is not really an
mpd program;
it is a pre-mpd program that tries to see if your system is OK. If
any of these runs fail,
you probably have configuration problems that you need to resolve.
We may be able
to offer some help depending on the system. You have attached the
result of
"mpdcheck -pc" from one node, but if two are failing to work
together, we really need
the output from both to make any educated guesses. Typical problems
include having
some form of firewalling turned on bocking ports from one machine to
another. We may
be able to guess that that is what you have, but not necessarily able
to guess how you
have done it. The manual addresses these issues somewhat
--ralph
On WedAug 2, at Wed Aug 2 2:59PM, Zach Ponder wrote:
> I'm having some troubles getting Mpich2-1.0.3 up and running on a
> three computer setup, one master two computation nodes. I've seen
> a mailing archive of someone that seemed to have a similar problem,
> and they were able to correct it in some manner.
>
> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/
> 2006/04/msg00037.html
>
> It seemed to be a problem with the mpd being addressed to
> 127.0.0.1. Not entirely sure if I'm in the same situation, but I
> am stuck on how to fix it. I'm afraid that it is some sort of
> simple networking issue, but since this is my first venture into
> cluster computing everything is posing a challenge.
>
> Things I'm able to do or have done:
>
> ping between boxes
> ssh between boxes without password
> bring up an mpd on each box
> made the changes to mpd.py (commented two lines)
>
> Things I'm unable to do:
>
> use mpdboot to bring up a ring of mpds
> manually start a server/client mpd on two machines(gives error
> along lines of unable to ping)
>
> I don't receive any errors when running mpdcheck, but not the case
> when I run mpdcheck -f ~/Desktop/mpd.hosts -ssh
>
> [cobalt at bhead home]$ mpdcheck -f ~/Desktop/mpd.hosts -ssh
> ** timed out waiting for client on b1.aero.nd.edu to produce output
> client on b1.aero.nd.edu failed to access the server
> here is the output:
> Traceback (most recent call last):
> File "/home/cobalt/mpich2-install/bin/mpdcheck.py", line 103, in ?
> sock.connect((argv[argidx+1],int(argv[argidx+2]))) # note
> double parens
> File "<string>", line 1, in connect
> socket.error: (113, 'No route to host')
>
> And here is the output from mpdcheck -pc:
>
> [cobalt at bhead home]$ mpdcheck -pc
> --- print results of: gethostbyname_ex(gethostname())
> ('bhead.aero.nd.edu', ['bhead'], ['192.168.2.1'])
> --- try to run /bin/hostname
> bhead.aero.nd.edu
> --- try to run uname -a
> Linux bhead.aero.nd.edu 2.6.9-34.EL #1 Mon Mar 13 11:31:17 CST 2006
> i686 i686 i386 GNU/Linux
> --- try to print /etc/hosts
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> 192.168.2.102 b2.aero.nd.edu b2
> 192.168.2.101 b1.aero.nd.edu b1
> 192.168.2.1 bhead.aero.nd.edu bhead
> --- try to print /etc/resolv.conf
> ; generated by /sbin/dhclient-script
> search aero.nd.edu
> nameserver 192.168.2.1
> --- try to run /sbin/ifconfig -a
> eth0 Link encap:Ethernet HWaddr 00:11:11:95:8F:63
> inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:
> 255.255.255.0
> inet6 addr: fe80::211:11ff:fe95:8f63/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:263 errors:0 dropped:0 overruns:0 frame:0
> TX packets:293 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:40718 (39.7 KiB) TX bytes:39246 (38.3 KiB)
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:1475 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1475 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:2939426 (2.8 MiB) TX bytes:2939426 (2.8 MiB)
>
> sit0 Link encap:IPv6-in-IPv4
> NOARP MTU:1480 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>
> --- try to print /etc/nsswitch.conf
> #
> # /etc/nsswitch.conf
> #
> # An example Name Service Switch config file. This file should be
> # sorted with the most-used services at the beginning.
> #
> # The entry '[NOTFOUND=return]' means that the search for an
> # entry should stop if the search in the previous entry turned
> # up nothing. Note that if the search failed due to some other reason
> # (like no NIS server responding) then the search continues with the
> # next entry.
> #
> # Legal entries are:
> #
> # nis or yp Use NIS (NIS version 2), also
> called YP
> # dns Use DNS (Domain Name Service)
> # files Use the local files
> # db Use the local database (.db) files
> # compat Use NIS on compat mode
> # hesiod Use Hesiod for user lookups
> # ldap Use LDAP (only if nss_ldap is
> installed)
> # nisplus or nis+ Use NIS+ (NIS version 3), unsupported
> # [NOTFOUND=return] Stop searching if not found so far
> #
>
> # To use db, put the "db" in front of "files" for entries you want
> to be
> # looked up first in the databases
> #
> # Example:
> #passwd: db files ldap nis
> #shadow: db files ldap nis
> #group: db files ldap nis
>
> passwd: files
> shadow: files
> group: files
>
> #hosts: db files ldap nis dns
> hosts: files dns
>
> # Example - obey only what ldap tells us...
> #services: ldap [NOTFOUND=return] files
> #networks: ldap [NOTFOUND=return] files
> #protocols: ldap [NOTFOUND=return] files
> #rpc: ldap [NOTFOUND=return] files
> #ethers: ldap [NOTFOUND=return] files
>
> bootparams: files
> ethers: files
> netmasks: files
> networks: files
> protocols: files
> rpc: files
> services: files
> netgroup: files
> publickey: files
> automount: files
> aliases: files
> [cobalt at bhead home]$
>
>
> Thanks for your attention,
>
> Zach Ponder
> Graduate Student
> University of Notre Dame
> Department of Aerospace and Mechanical Engineering
> zponder at nd.edu
>
>
More information about the mpich-discuss
mailing list