[MPICH] mpdboot and mpdcheck problems
Michele Trenti
trenti at stsci.edu
Wed Aug 2 16:10:04 CDT 2006
Hi Zach,
I recently had similar troubles, that in my case were related to the
network settings, i.e. the firewall in the cluster was configured to
block all ports except those used by ssh/sshd (see MPICH2 Installer's
guide - version 1.0.3 - Sec A.1, page 25).
If this is what is happening in your cluster, I solved the problem (1) by
asking the network administrator to open a small range of ports (about 30
looks fine) to allow MPI communication between the nodes and (2) by
setting the MPICH_PORT_RANGE variable to match the range of open ports
(but you will need to upgrade to version 1.0.4 for this!).
Hope that this will be of help.
Cheers,
Michele
Michele Trenti
Space Telescope Science Institute
3700 San Martin Drive Phone: +1 410 338 4987
Baltimore MD 21218 U.S. Fax: +1 410 338 4767
" We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time. "
T. S. Eliot
On Wed, 2 Aug 2006, Zach Ponder wrote:
> I'm having some troubles getting Mpich2-1.0.3 up and running on a three
> computer setup, one master two computation nodes. I've seen a mailing
> archive of someone that seemed to have a similar problem, and they were able
> to correct it in some manner.
>
> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/2006/04/msg00037.html
>
> It seemed to be a problem with the mpd being addressed to 127.0.0.1. Not
> entirely sure if I'm in the same situation, but I am stuck on how to fix it.
> I'm afraid that it is some sort of simple networking issue, but since this is
> my first venture into cluster computing everything is posing a challenge.
>
> Things I'm able to do or have done:
>
> ping between boxes
> ssh between boxes without password
> bring up an mpd on each box
> made the changes to mpd.py (commented two lines)
> Things I'm unable to do:
>
> use mpdboot to bring up a ring of mpds
> manually start a server/client mpd on two machines(gives error along
> lines of unable to ping)
>
> I don't receive any errors when running mpdcheck, but not the case when I run
> mpdcheck -f ~/Desktop/mpd.hosts -ssh
>
> [cobalt at bhead home]$ mpdcheck -f ~/Desktop/mpd.hosts -ssh
> ** timed out waiting for client on b1.aero.nd.edu to produce output
> client on b1.aero.nd.edu failed to access the server
> here is the output:
> Traceback (most recent call last):
> File "/home/cobalt/mpich2-install/bin/mpdcheck.py", line 103, in ?
> sock.connect((argv[argidx+1],int(argv[argidx+2]))) # note double parens
> File "<string>", line 1, in connect
> socket.error: (113, 'No route to host')
>
> And here is the output from mpdcheck -pc:
>
> [cobalt at bhead home]$ mpdcheck -pc
> --- print results of: gethostbyname_ex(gethostname())
> ('bhead.aero.nd.edu', ['bhead'], ['192.168.2.1'])
> --- try to run /bin/hostname
> bhead.aero.nd.edu
> --- try to run uname -a
> Linux bhead.aero.nd.edu 2.6.9-34.EL #1 Mon Mar 13 11:31:17 CST 2006 i686 i686
> i386 GNU/Linux
> --- try to print /etc/hosts
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> 192.168.2.102 b2.aero.nd.edu b2
> 192.168.2.101 b1.aero.nd.edu b1
> 192.168.2.1 bhead.aero.nd.edu bhead
> --- try to print /etc/resolv.conf
> ; generated by /sbin/dhclient-script
> search aero.nd.edu
> nameserver 192.168.2.1
> --- try to run /sbin/ifconfig -a
> eth0 Link encap:Ethernet HWaddr 00:11:11:95:8F:63
> inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0
> inet6 addr: fe80::211:11ff:fe95:8f63/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:263 errors:0 dropped:0 overruns:0 frame:0
> TX packets:293 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:40718 (39.7 KiB) TX bytes:39246 (38.3 KiB)
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:1475 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1475 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:2939426 (2.8 MiB) TX bytes:2939426 (2.8 MiB)
>
> sit0 Link encap:IPv6-in-IPv4
> NOARP MTU:1480 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>
> --- try to print /etc/nsswitch.conf
> #
> # /etc/nsswitch.conf
> #
> # An example Name Service Switch config file. This file should be
> # sorted with the most-used services at the beginning.
> #
> # The entry '[NOTFOUND=return]' means that the search for an
> # entry should stop if the search in the previous entry turned
> # up nothing. Note that if the search failed due to some other reason
> # (like no NIS server responding) then the search continues with the
> # next entry.
> #
> # Legal entries are:
> #
> # nis or yp Use NIS (NIS version 2), also called YP
> # dns Use DNS (Domain Name Service)
> # files Use the local files
> # db Use the local database (.db) files
> # compat Use NIS on compat mode
> # hesiod Use Hesiod for user lookups
> # ldap Use LDAP (only if nss_ldap is installed)
> # nisplus or nis+ Use NIS+ (NIS version 3), unsupported
> # [NOTFOUND=return] Stop searching if not found so far
> #
>
> # To use db, put the "db" in front of "files" for entries you want to be
> # looked up first in the databases
> #
> # Example:
> #passwd: db files ldap nis
> #shadow: db files ldap nis
> #group: db files ldap nis
>
> passwd: files
> shadow: files
> group: files
>
> #hosts: db files ldap nis dns
> hosts: files dns
>
> # Example - obey only what ldap tells us...
> #services: ldap [NOTFOUND=return] files
> #networks: ldap [NOTFOUND=return] files
> #protocols: ldap [NOTFOUND=return] files
> #rpc: ldap [NOTFOUND=return] files
> #ethers: ldap [NOTFOUND=return] files
>
> bootparams: files
> ethers: files
> netmasks: files
> networks: files
> protocols: files
> rpc: files
> services: files
> netgroup: files
> publickey: files
> automount: files
> aliases: files
> [cobalt at bhead home]$
>
>
> Thanks for your attention,
>
> Zach Ponder
> Graduate Student
> University of Notre Dame
> Department of Aerospace and Mechanical Engineering
> zponder at nd.edu
>
>
More information about the mpich-discuss
mailing list