[MPICH] mpdboot and mpdcheck problems

Michele Trenti trenti at stsci.edu
Wed Aug 2 16:10:04 CDT 2006


Hi Zach,

I recently had similar troubles, that in my case were related to the 
network settings, i.e. the firewall in the cluster was configured to 
block all ports except those used by ssh/sshd (see MPICH2 Installer's 
guide - version 1.0.3 - Sec A.1, page 25).

If this is what is happening in your cluster, I solved the problem (1) by 
asking the network administrator to open a small range of ports (about 30 
looks fine) to allow MPI communication between the nodes and (2) by 
setting the MPICH_PORT_RANGE variable to match the range of open ports 
(but you will need to upgrade to version 1.0.4 for this!).

Hope that this will be of help.
Cheers,

Michele

Michele Trenti
Space Telescope Science Institute
3700 San Martin Drive                       Phone: +1 410 338 4987
Baltimore MD 21218 U.S.                       Fax: +1 410 338 4767


" We shall not cease from exploration
   And the end of all our exploring
   Will be to arrive where we started
   And know the place for the first time. "

                                      T. S. Eliot


On Wed, 2 Aug 2006, Zach Ponder wrote:

> I'm having some troubles getting Mpich2-1.0.3 up and running on a three 
> computer setup, one master two computation nodes.  I've seen a mailing 
> archive of someone that seemed to have a similar problem, and they were able 
> to correct it in some manner.
>
> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/2006/04/msg00037.html
>
> It seemed to be a problem with the mpd being addressed to 127.0.0.1.  Not 
> entirely sure if I'm in the same situation, but I am stuck on how to fix it. 
> I'm afraid that it is some sort of simple networking issue, but since this is 
> my first venture into cluster computing everything is posing a challenge.
>
> Things I'm able to do or have done:
>
> 	ping between boxes
> 	ssh between boxes without password
> 	bring up an mpd on each box
> 	made the changes to mpd.py (commented two lines)
> 	Things I'm unable to do:
>
> 	use mpdboot to bring up a ring of mpds
> 	manually start a server/client mpd on two machines(gives error along 
> lines of unable to ping)
>
> I don't receive any errors when running mpdcheck, but not the case when I run 
> mpdcheck -f ~/Desktop/mpd.hosts -ssh
>
> [cobalt at bhead home]$ mpdcheck -f ~/Desktop/mpd.hosts -ssh
> ** timed out waiting for client on b1.aero.nd.edu to produce output
> client on b1.aero.nd.edu failed to access the server
> here is the output:
> Traceback (most recent call last):
> File "/home/cobalt/mpich2-install/bin/mpdcheck.py", line 103, in ?
>   sock.connect((argv[argidx+1],int(argv[argidx+2])))  # note double parens
> File "<string>", line 1, in connect
> socket.error: (113, 'No route to host')
>
> And here is the output from mpdcheck -pc:
>
> [cobalt at bhead home]$ mpdcheck -pc
> --- print results of: gethostbyname_ex(gethostname())
> ('bhead.aero.nd.edu', ['bhead'], ['192.168.2.1'])
> --- try to run /bin/hostname
> bhead.aero.nd.edu
> --- try to run uname -a
> Linux bhead.aero.nd.edu 2.6.9-34.EL #1 Mon Mar 13 11:31:17 CST 2006 i686 i686 
> i386 GNU/Linux
> --- try to print /etc/hosts
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> 192.168.2.102   b2.aero.nd.edu  b2
> 192.168.2.101   b1.aero.nd.edu  b1
> 192.168.2.1     bhead.aero.nd.edu       bhead
> --- try to print /etc/resolv.conf
> ; generated by /sbin/dhclient-script
> search aero.nd.edu
> nameserver 192.168.2.1
> --- try to run /sbin/ifconfig -a
> eth0      Link encap:Ethernet  HWaddr 00:11:11:95:8F:63
>         inet addr:192.168.2.1  Bcast:192.168.2.255  Mask:255.255.255.0
>         inet6 addr: fe80::211:11ff:fe95:8f63/64 Scope:Link
>         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>         RX packets:263 errors:0 dropped:0 overruns:0 frame:0
>         TX packets:293 errors:0 dropped:0 overruns:0 carrier:0
>         collisions:0 txqueuelen:1000
>         RX bytes:40718 (39.7 KiB)  TX bytes:39246 (38.3 KiB)
>
> lo        Link encap:Local Loopback
>         inet addr:127.0.0.1  Mask:255.0.0.0
>         inet6 addr: ::1/128 Scope:Host
>         UP LOOPBACK RUNNING  MTU:16436  Metric:1
>         RX packets:1475 errors:0 dropped:0 overruns:0 frame:0
>         TX packets:1475 errors:0 dropped:0 overruns:0 carrier:0
>         collisions:0 txqueuelen:0
>         RX bytes:2939426 (2.8 MiB)  TX bytes:2939426 (2.8 MiB)
>
> sit0      Link encap:IPv6-in-IPv4
>         NOARP  MTU:1480  Metric:1
>         RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>         TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>         collisions:0 txqueuelen:0
>         RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>
> --- try to print /etc/nsswitch.conf
> #
> # /etc/nsswitch.conf
> #
> # An example Name Service Switch config file. This file should be
> # sorted with the most-used services at the beginning.
> #
> # The entry '[NOTFOUND=return]' means that the search for an
> # entry should stop if the search in the previous entry turned
> # up nothing. Note that if the search failed due to some other reason
> # (like no NIS server responding) then the search continues with the
> # next entry.
> #
> # Legal entries are:
> #
> #       nis or yp               Use NIS (NIS version 2), also called YP
> #       dns                     Use DNS (Domain Name Service)
> #       files                   Use the local files
> #       db                      Use the local database (.db) files
> #       compat                  Use NIS on compat mode
> #       hesiod                  Use Hesiod for user lookups
> #       ldap                    Use LDAP (only if nss_ldap is installed)
> #       nisplus or nis+         Use NIS+ (NIS version 3), unsupported
> #       [NOTFOUND=return]       Stop searching if not found so far
> #
>
> # To use db, put the "db" in front of "files" for entries you want to be
> # looked up first in the databases
> #
> # Example:
> #passwd:    db files ldap nis
> #shadow:    db files ldap nis
> #group:     db files ldap nis
>
> passwd:     files
> shadow:     files
> group:      files
>
> #hosts:     db files ldap nis dns
> hosts:      files dns
>
> # Example - obey only what ldap tells us...
> #services:  ldap [NOTFOUND=return] files
> #networks:  ldap [NOTFOUND=return] files
> #protocols: ldap [NOTFOUND=return] files
> #rpc:       ldap [NOTFOUND=return] files
> #ethers:    ldap [NOTFOUND=return] files
>
> bootparams: files
> ethers:     files
> netmasks:   files
> networks:   files
> protocols:  files
> rpc:        files
> services:   files
> netgroup:   files
> publickey:  files
> automount:  files
> aliases:    files
> [cobalt at bhead home]$
>
>
> Thanks for your attention,
>
> Zach Ponder
> Graduate Student
> University of Notre Dame
> Department of Aerospace and Mechanical Engineering
> zponder at nd.edu
>
>




More information about the mpich-discuss mailing list