[MPICH] mpdboot and mpdcheck problems

Rajeev Thakur thakur at mcs.anl.gov
Wed Aug 2 15:32:27 CDT 2006


Why don't you first try with the latest release, 1.0.4. There were some bugs
in MPD that are fixed there.
 
Rajeev
 


  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Zach Ponder
Sent: Wednesday, August 02, 2006 2:59 PM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] mpdboot and mpdcheck problems


I'm having some troubles getting Mpich2-1.0.3 up and running on a three
computer setup, one master two computation nodes. I've seen a mailing
archive of someone that seemed to have a similar problem, and they were able
to correct it in some manner. 

http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/2006/04/msg
00037.html
<http://www-unix.mcs.anl.gov/web-mail-archive/lists/mpich-discuss/2006/04/ms
g00037.html> 

It seemed to be a problem with the mpd being addressed to 127.0.0.1. Not
entirely sure if I'm in the same situation, but I am stuck on how to fix it.
I'm afraid that it is some sort of simple networking issue, but since this
is my first venture into cluster computing everything is posing a challenge.


Things I'm able to do or have done:

ping between boxes
ssh between boxes without password
bring up an mpd on each box
made the changes to mpd.py (commented two lines)

Things I'm unable to do:

use mpdboot to bring up a ring of mpds
manually start a server/client mpd on two machines(gives error along lines
of unable to ping)

I don't receive any errors when running mpdcheck, but not the case when I
run mpdcheck -f ~/Desktop/mpd.hosts -ssh

[cobalt at bhead home]$ mpdcheck -f ~/Desktop/mpd.hosts -ssh
** timed out waiting for client on b1.aero.nd.edu to produce output
client on b1.aero.nd.edu failed to access the server
here is the output:
Traceback (most recent call last):
File "/home/cobalt/mpich2-install/bin/mpdcheck.py", line 103, in ?
sock.connect((argv[argidx+1],int(argv[argidx+2]))) # note double parens
File "<string>", line 1, in connect
socket.error: (113, 'No route to host')

And here is the output from mpdcheck -pc:

[cobalt at bhead home]$ mpdcheck -pc
--- print results of: gethostbyname_ex(gethostname())
('bhead.aero.nd.edu', ['bhead'], ['192.168.2.1'])
--- try to run /bin/hostname
bhead.aero.nd.edu
--- try to run uname -a
Linux bhead.aero.nd.edu 2.6.9-34.EL #1 Mon Mar 13 11:31:17 CST 2006 i686
i686 i386 GNU/Linux
--- try to print /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
192.168.2.102 b2.aero.nd.edu b2
192.168.2.101 b1.aero.nd.edu b1
192.168.2.1 bhead.aero.nd.edu bhead
--- try to print /etc/resolv.conf
; generated by /sbin/dhclient-script
search aero.nd.edu
nameserver 192.168.2.1
--- try to run /sbin/ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:11:11:95:8F:63
inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0
inet6 addr: fe80::211:11ff:fe95:8f63/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:263 errors:0 dropped:0 overruns:0 frame:0
TX packets:293 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:40718 (39.7 KiB) TX bytes:39246 (38.3 KiB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1475 errors:0 dropped:0 overruns:0 frame:0
TX packets:1475 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2939426 (2.8 MiB) TX bytes:2939426 (2.8 MiB)

sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

--- try to print /etc/nsswitch.conf
#
# /etc/nsswitch.conf
#
# An example Name Service Switch config file. This file should be
# sorted with the most-used services at the beginning.
#
# The entry '[NOTFOUND=return]' means that the search for an
# entry should stop if the search in the previous entry turned
# up nothing. Note that if the search failed due to some other reason
# (like no NIS server responding) then the search continues with the
# next entry.
#
# Legal entries are:
#
# nis or yp Use NIS (NIS version 2), also called YP
# dns Use DNS (Domain Name Service)
# files Use the local files
# db Use the local database (.db) files
# compat Use NIS on compat mode
# hesiod Use Hesiod for user lookups
# ldap Use LDAP (only if nss_ldap is installed)
# nisplus or nis+ Use NIS+ (NIS version 3), unsupported
# [NOTFOUND=return] Stop searching if not found so far
#

# To use db, put the "db" in front of "files" for entries you want to be
# looked up first in the databases
#
# Example:
#passwd: db files ldap nis
#shadow: db files ldap nis
#group: db files ldap nis

passwd: files
shadow: files
group: files

#hosts: db files ldap nis dns
hosts: files dns

# Example - obey only what ldap tells us...
#services: ldap [NOTFOUND=return] files
#networks: ldap [NOTFOUND=return] files
#protocols: ldap [NOTFOUND=return] files
#rpc: ldap [NOTFOUND=return] files
#ethers: ldap [NOTFOUND=return] files

bootparams: files
ethers: files
netmasks: files
networks: files
protocols: files
rpc: files
services: files
netgroup: files
publickey: files
automount: files
aliases: files
[cobalt at bhead home]$


Thanks for your attention,


Zach Ponder
Graduate Student
University of Notre Dame
Department of Aerospace and Mechanical Engineering
zponder at nd.edu



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20060802/e4d8d071/attachment.htm>


More information about the mpich-discuss mailing list