[MPICH] mpdboot and mpdcheck problems

Ralph Butler rbutler at mtsu.edu
Wed Aug 2 16:30:59 CDT 2006


The key is to concentrate on getting 2 nodes to working together  
first.  So, if there are
problems between the head node and a compute node, then start with  
them.   As the
manual suggests, you should be able to run mpdcheck as a server (-s  
option) on the
head node and run it as a client (-c ... args) on the compute node.   
If that fails, then
the output may be useful.  If it succeeds, then reverse the roles and  
try it again, i.e.
server on compute node and client on head.  mpdcheck is not really an  
mpd program;
it is a pre-mpd program that tries to see if your system is OK.  If  
any of these runs fail,
you probably have configuration problems that you need to resolve.   
We may be able
to offer some help depending on the system.  You have attached the  
result of
"mpdcheck -pc" from one node, but if two are failing to work  
together, we really need
the output from both to make any educated guesses.  Typical problems  
include having
some form of firewalling turned on bocking ports from one machine to  
another.  We may
be able to guess that that is what you have, but not necessarily able  
to guess how you
have done it.  The manual addresses these issues somewhat
--ralph

On WedAug 2, at Wed Aug 2 2:59PM, Zach Ponder wrote:

> I'm having some troubles getting Mpich2-1.0.3 up and running on a  
> three computer setup, one master two computation nodes.  I've seen  
> a mailing archive of someone that seemed to have a similar problem,  
> and they were able to correct it in some manner.
>
> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/ 
> 2006/04/msg00037.html
>
> It seemed to be a problem with the mpd being addressed to  
> 127.0.0.1.  Not entirely sure if I'm in the same situation, but I  
> am stuck on how to fix it.  I'm afraid that it is some sort of  
> simple networking issue, but since this is my first venture into  
> cluster computing everything is posing a challenge.
>
> Things I'm able to do or have done:
>
> 	ping between boxes
> 	ssh between boxes without password
> 	bring up an mpd on each box
> 	made the changes to mpd.py (commented two lines)
> 	
> Things I'm unable to do:
>
> 	use mpdboot to bring up a ring of mpds
> 	manually start a server/client mpd on two machines(gives error  
> along lines of unable to ping)
>
> I don't receive any errors when running mpdcheck, but not the case  
> when I run mpdcheck -f ~/Desktop/mpd.hosts -ssh
>
> [cobalt at bhead home]$ mpdcheck -f ~/Desktop/mpd.hosts -ssh
> ** timed out waiting for client on b1.aero.nd.edu to produce output
> client on b1.aero.nd.edu failed to access the server
> here is the output:
> Traceback (most recent call last):
>   File "/home/cobalt/mpich2-install/bin/mpdcheck.py", line 103, in ?
>     sock.connect((argv[argidx+1],int(argv[argidx+2])))  # note  
> double parens
>   File "<string>", line 1, in connect
> socket.error: (113, 'No route to host')
>
> And here is the output from mpdcheck -pc:
>
> [cobalt at bhead home]$ mpdcheck -pc
> --- print results of: gethostbyname_ex(gethostname())
> ('bhead.aero.nd.edu', ['bhead'], ['192.168.2.1'])
> --- try to run /bin/hostname
> bhead.aero.nd.edu
> --- try to run uname -a
> Linux bhead.aero.nd.edu 2.6.9-34.EL #1 Mon Mar 13 11:31:17 CST 2006  
> i686 i686 i386 GNU/Linux
> --- try to print /etc/hosts
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> 192.168.2.102   b2.aero.nd.edu  b2
> 192.168.2.101   b1.aero.nd.edu  b1
> 192.168.2.1     bhead.aero.nd.edu       bhead
> --- try to print /etc/resolv.conf
> ; generated by /sbin/dhclient-script
> search aero.nd.edu
> nameserver 192.168.2.1
> --- try to run /sbin/ifconfig -a
> eth0      Link encap:Ethernet  HWaddr 00:11:11:95:8F:63
>           inet addr:192.168.2.1  Bcast:192.168.2.255  Mask: 
> 255.255.255.0
>           inet6 addr: fe80::211:11ff:fe95:8f63/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:263 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:293 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:40718 (39.7 KiB)  TX bytes:39246 (38.3 KiB)
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:1475 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:1475 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:2939426 (2.8 MiB)  TX bytes:2939426 (2.8 MiB)
>
> sit0      Link encap:IPv6-in-IPv4
>           NOARP  MTU:1480  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>
> --- try to print /etc/nsswitch.conf
> #
> # /etc/nsswitch.conf
> #
> # An example Name Service Switch config file. This file should be
> # sorted with the most-used services at the beginning.
> #
> # The entry '[NOTFOUND=return]' means that the search for an
> # entry should stop if the search in the previous entry turned
> # up nothing. Note that if the search failed due to some other reason
> # (like no NIS server responding) then the search continues with the
> # next entry.
> #
> # Legal entries are:
> #
> #       nis or yp               Use NIS (NIS version 2), also  
> called YP
> #       dns                     Use DNS (Domain Name Service)
> #       files                   Use the local files
> #       db                      Use the local database (.db) files
> #       compat                  Use NIS on compat mode
> #       hesiod                  Use Hesiod for user lookups
> #       ldap                    Use LDAP (only if nss_ldap is  
> installed)
> #       nisplus or nis+         Use NIS+ (NIS version 3), unsupported
> #       [NOTFOUND=return]       Stop searching if not found so far
> #
>
> # To use db, put the "db" in front of "files" for entries you want  
> to be
> # looked up first in the databases
> #
> # Example:
> #passwd:    db files ldap nis
> #shadow:    db files ldap nis
> #group:     db files ldap nis
>
> passwd:     files
> shadow:     files
> group:      files
>
> #hosts:     db files ldap nis dns
> hosts:      files dns
>
> # Example - obey only what ldap tells us...
> #services:  ldap [NOTFOUND=return] files
> #networks:  ldap [NOTFOUND=return] files
> #protocols: ldap [NOTFOUND=return] files
> #rpc:       ldap [NOTFOUND=return] files
> #ethers:    ldap [NOTFOUND=return] files
>
> bootparams: files
> ethers:     files
> netmasks:   files
> networks:   files
> protocols:  files
> rpc:        files
> services:   files
> netgroup:   files
> publickey:  files
> automount:  files
> aliases:    files
> [cobalt at bhead home]$
>
>
> Thanks for your attention,
>
> Zach Ponder
> Graduate Student
> University of Notre Dame
> Department of Aerospace and Mechanical Engineering
> zponder at nd.edu
>
>




More information about the mpich-discuss mailing list