[MPICH] MPICH2 cluster, Local Network
Ralph M. Butler
rbutler at mtsu.edu
Tue Mar 21 20:57:27 CST 2006
The manual suggests that you get mpd to working by starting 2 (or more)
by hand before using mpdboot. If there are problems along the wayt to
doing that, the manual suggest that you go into the Troubleshooting
section and follow the instructions there. Generally, that is enough for
folks, but if you are still having problems, it recommends things to run
and output to forward that might be helpful.
Having said all that, it is likely that you are merely running into a
known bug that a simple patch will fix. I am attaching the patch.
You should apply it the installed code for mpd.py which is used by
mpdboot when starting mpds. If you are likely to re-install several
times, you may also want to apply the patch to the src as well. This
patch really just comments out 2 lines in the mpd.py file.
> Date: Tue, 21 Mar 2006 19:35:28 -0500
> From: Samuel Winchenbach <swinchen at eece.maine.edu>
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] MPICH2 cluster, Local Network
>
> Hello all,
>
> I have been trying to set up a small cluster for a school project and I
> almost had it working. When I first started I had two computers that had
> fully qualified domain names - everything worked fine. Next I tried adding 6
> more computers and networking them together with a gigabit switch. At this
> point all the computers had 192.168.2.X addresses and all of them had the
> hostname "localhost.localdomain"
>
> for a small test I started with 2 nodes.
> I configured the headnode to export "/home/cluster" and the compute node to
> mount it. After setting up SSH everything was going smoothly... I could SSH
> from the head node to the compute node without a password, and NFS seemed to
> be working fine.
>
> Now, just to make things more clear I set the hostname of the head node to
> "node0" and the hostname of the compute node to "node3" and I modified the
> hosts file and added the IP-name mapping. I added "node3" to the mpd.hosts
> file, and gave it a try.
>
> [cluster at node0 ~]$ mpdboot -n 2
> mpdboot_node0 (handle_mpd_output 359): failed to ping mpd on node3; recvd
> output={}
>
> That is no good. But I can access the computer:
>
> [cluster at node0 ~]$ ping node3
> PING node3 (192.168.2.3) 56(84) bytes of data.
> 64 bytes from node3 (192.168.2.3): icmp_seq=0 ttl=64 time=0.313 ms
>
>
> And mpd loads fine on the head node:
> [cluster at node0 ~]$ mpdboot
> [cluster at node0 ~]$ mpdtrace
> node0
>
> Alright, at this point I guess I need to ask if anyone has any ideas? I am
> in bad need of some help. I thought I would be able to figure it out but the
> deadline for the project is quickly approaching.
>
> Here are the config files I needed to modify on the head node:
>
> /etc/exports:
> /home/cluster 192.168.2.0/255.255.255.0(rw)
>
> /etc/hosts.allow:
> portmap: 192.168.2.
> lockd: 192.168.2.
> rquotad: 192.168.2.
> mountd: 192.168.2.
> statd: 192.168.2.
>
> /etc/hosts.deny:
> portmap: ALL
> lockd: ALL
> mountd: ALL
> rquotad: ALL
> statd: ALL
>
> /etc/hosts:
> 127.0.0.1 localhost.localdomain localhost node0
> 192.168.2.3 node3
>
> /etc/sysconfig/network:
> NETWORKING=yes
> HOSTNAME=node0
>
> /home/cluster/mpd.hosts:
> node0
> node3
>
> /home/cluster/.mpd.conf:
> MPD_SECRETWORD=dynamite
>
> On the compute node I really only needed to modify the hosts and network
> file, along with adding the following line to the fstab file:
> 192.168.2.4:/home/cluster /home/cluster nfs rw,hard,intr 0 0
>
> I guess that is it. Thanks for any help you might be able to give me.
>
> Sam
>
>
>
>
-------------- next part --------------
*** mpd.py Mon Nov 28 17:40:19 2005
--- mpd.new Wed Nov 30 13:56:54 2005
***************
*** 173,180 ****
print self.parmdb['MPD_LISTEN_PORT']
sys.stdout.flush()
##### NEXT 2 for debugging
! print >>sys.stderr, self.parmdb['MPD_LISTEN_PORT']
! sys.stderr.flush()
self.myRealUsername = mpd_get_my_username()
self.currRingSize = 1 # default
self.currRingNCPUs = 1 # default
--- 173,180 ----
print self.parmdb['MPD_LISTEN_PORT']
sys.stdout.flush()
##### NEXT 2 for debugging
! ## print >>sys.stderr, self.parmdb['MPD_LISTEN_PORT']
! ## sys.stderr.flush()
self.myRealUsername = mpd_get_my_username()
self.currRingSize = 1 # default
self.currRingNCPUs = 1 # default
More information about the mpich-discuss
mailing list