[MPICH] mpdboot and mpdcheck problems

Zach Ponder zponder at nd.edu
Thu Aug 3 12:49:32 CDT 2006


Hi everyone,

Wanted to follow up on my inquiry.  It turns out that by problems  
were stemming from the Linux firewall blocking necessary ports.   
Since my network is completely disconnected from the outside world, I  
totally disabled the firewall, for the time being.  This allowed me  
to get through the installation manual's testing without error.  I  
plan on leaving the firewall off for the present time, at least until  
I can become more knowledgeable with networking.  Unless of course,  
this could lead to problems with MPICH.

Thanks to everyone for their help,

Zach Ponder
Graduate Student
University of Notre Dame
Department of Aerospace and Mechanical Engineering
zponder at nd.edu



On Aug 2, 2006, at 5:30 PM, Ralph Butler wrote:

> The key is to concentrate on getting 2 nodes to working together  
> first.  So, if there are
> problems between the head node and a compute node, then start with  
> them.   As the
> manual suggests, you should be able to run mpdcheck as a server (-s  
> option) on the
> head node and run it as a client (-c ... args) on the compute  
> node.  If that fails, then
> the output may be useful.  If it succeeds, then reverse the roles  
> and try it again, i.e.
> server on compute node and client on head.  mpdcheck is not really  
> an mpd program;
> it is a pre-mpd program that tries to see if your system is OK.  If  
> any of these runs fail,
> you probably have configuration problems that you need to resolve.   
> We may be able
> to offer some help depending on the system.  You have attached the  
> result of
> "mpdcheck -pc" from one node, but if two are failing to work  
> together, we really need
> the output from both to make any educated guesses.  Typical  
> problems include having
> some form of firewalling turned on bocking ports from one machine  
> to another.  We may
> be able to guess that that is what you have, but not necessarily  
> able to guess how you
> have done it.  The manual addresses these issues somewhat
> --ralph
>
> On WedAug 2, at Wed Aug 2 2:59PM, Zach Ponder wrote:
>
>> I'm having some troubles getting Mpich2-1.0.3 up and running on a  
>> three computer setup, one master two computation nodes.  I've seen  
>> a mailing archive of someone that seemed to have a similar  
>> problem, and they were able to correct it in some manner.
>>
>> http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/ 
>> 2006/04/msg00037.html
>>
>> It seemed to be a problem with the mpd being addressed to  
>> 127.0.0.1.  Not entirely sure if I'm in the same situation, but I  
>> am stuck on how to fix it.  I'm afraid that it is some sort of  
>> simple networking issue, but since this is my first venture into  
>> cluster computing everything is posing a challenge.
>>
>> Things I'm able to do or have done:
>>
>> 	ping between boxes
>> 	ssh between boxes without password
>> 	bring up an mpd on each box
>> 	made the changes to mpd.py (commented two lines)
>> 	
>> Things I'm unable to do:
>>
>> 	use mpdboot to bring up a ring of mpds
>> 	manually start a server/client mpd on two machines(gives error  
>> along lines of unable to ping)
>>
>> I don't receive any errors when running mpdcheck, but not the case  
>> when I run mpdcheck -f ~/Desktop/mpd.hosts -ssh
>>
>> [cobalt at bhead home]$ mpdcheck -f ~/Desktop/mpd.hosts -ssh
>> ** timed out waiting for client on b1.aero.nd.edu to produce output
>> client on b1.aero.nd.edu failed to access the server
>> here is the output:
>> Traceback (most recent call last):
>>   File "/home/cobalt/mpich2-install/bin/mpdcheck.py", line 103, in ?
>>     sock.connect((argv[argidx+1],int(argv[argidx+2])))  # note  
>> double parens
>>   File "<string>", line 1, in connect
>> socket.error: (113, 'No route to host')
>>
>> And here is the output from mpdcheck -pc:
>>
>> [cobalt at bhead home]$ mpdcheck -pc
>> --- print results of: gethostbyname_ex(gethostname())
>> ('bhead.aero.nd.edu', ['bhead'], ['192.168.2.1'])
>> --- try to run /bin/hostname
>> bhead.aero.nd.edu
>> --- try to run uname -a
>> Linux bhead.aero.nd.edu 2.6.9-34.EL #1 Mon Mar 13 11:31:17 CST  
>> 2006 i686 i686 i386 GNU/Linux
>> --- try to print /etc/hosts
>> # Do not remove the following line, or various programs
>> # that require network functionality will fail.
>> 192.168.2.102   b2.aero.nd.edu  b2
>> 192.168.2.101   b1.aero.nd.edu  b1
>> 192.168.2.1     bhead.aero.nd.edu       bhead
>> --- try to print /etc/resolv.conf
>> ; generated by /sbin/dhclient-script
>> search aero.nd.edu
>> nameserver 192.168.2.1
>> --- try to run /sbin/ifconfig -a
>> eth0      Link encap:Ethernet  HWaddr 00:11:11:95:8F:63
>>           inet addr:192.168.2.1  Bcast:192.168.2.255  Mask: 
>> 255.255.255.0
>>           inet6 addr: fe80::211:11ff:fe95:8f63/64 Scope:Link
>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>           RX packets:263 errors:0 dropped:0 overruns:0 frame:0
>>           TX packets:293 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:1000
>>           RX bytes:40718 (39.7 KiB)  TX bytes:39246 (38.3 KiB)
>>
>> lo        Link encap:Local Loopback
>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>           inet6 addr: ::1/128 Scope:Host
>>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>           RX packets:1475 errors:0 dropped:0 overruns:0 frame:0
>>           TX packets:1475 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:0
>>           RX bytes:2939426 (2.8 MiB)  TX bytes:2939426 (2.8 MiB)
>>
>> sit0      Link encap:IPv6-in-IPv4
>>           NOARP  MTU:1480  Metric:1
>>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:0
>>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>>
>> --- try to print /etc/nsswitch.conf
>> #
>> # /etc/nsswitch.conf
>> #
>> # An example Name Service Switch config file. This file should be
>> # sorted with the most-used services at the beginning.
>> #
>> # The entry '[NOTFOUND=return]' means that the search for an
>> # entry should stop if the search in the previous entry turned
>> # up nothing. Note that if the search failed due to some other reason
>> # (like no NIS server responding) then the search continues with the
>> # next entry.
>> #
>> # Legal entries are:
>> #
>> #       nis or yp               Use NIS (NIS version 2), also  
>> called YP
>> #       dns                     Use DNS (Domain Name Service)
>> #       files                   Use the local files
>> #       db                      Use the local database (.db) files
>> #       compat                  Use NIS on compat mode
>> #       hesiod                  Use Hesiod for user lookups
>> #       ldap                    Use LDAP (only if nss_ldap is  
>> installed)
>> #       nisplus or nis+         Use NIS+ (NIS version 3), unsupported
>> #       [NOTFOUND=return]       Stop searching if not found so far
>> #
>>
>> # To use db, put the "db" in front of "files" for entries you want  
>> to be
>> # looked up first in the databases
>> #
>> # Example:
>> #passwd:    db files ldap nis
>> #shadow:    db files ldap nis
>> #group:     db files ldap nis
>>
>> passwd:     files
>> shadow:     files
>> group:      files
>>
>> #hosts:     db files ldap nis dns
>> hosts:      files dns
>>
>> # Example - obey only what ldap tells us...
>> #services:  ldap [NOTFOUND=return] files
>> #networks:  ldap [NOTFOUND=return] files
>> #protocols: ldap [NOTFOUND=return] files
>> #rpc:       ldap [NOTFOUND=return] files
>> #ethers:    ldap [NOTFOUND=return] files
>>
>> bootparams: files
>> ethers:     files
>> netmasks:   files
>> networks:   files
>> protocols:  files
>> rpc:        files
>> services:   files
>> netgroup:   files
>> publickey:  files
>> automount:  files
>> aliases:    files
>> [cobalt at bhead home]$
>>
>>
>> Thanks for your attention,
>>
>> Zach Ponder
>> Graduate Student
>> University of Notre Dame
>> Department of Aerospace and Mechanical Engineering
>> zponder at nd.edu
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060803/6ae7f590/attachment.htm>


More information about the mpich-discuss mailing list