[MPICH] Machine with multiple IP addresses

Wed Sep 21 16:14:43 CDT 2005

> Date: Wed, 21 Sep 2005 14:44:27 -0400
> From: Sophie Duchesne <sophie.duchesne at ete.inrs.ca>
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] Machine with multiple IP addresses
>
> Hello all,
> I am very new to MPICH and to parallel computing. One machine in our cluster
> has 2 IP addresses. I am still trying to figure out how I can configure
> MPICH in order for it to know which address to use. Currently, it does not
> use the right address and I get this kind of messages:
>             Error 10065, process 5, host <hostname> Unable to connect to <IP
> address> on port <port number>
>
> Thank you for your help!
>
> Sophie Duchesne

Hi:

Some of the more confusing situations we encounter are those in
which folks have machines with multiple interfaces.  In particular, a
situation that is typical is a cluster which has a head node with
two interfaces (say eth0 and eth1).  The reason that this can become such
a problem is that it is common to essentially think of the hostname as
associated with an interface.  And, when there is just one interface,
things are usually fairly simple and straightforward.  However, attempts
to associate a single hostname with two or more interfaces often leads
to confusion.

Sometimes the waters can be muddied a bit more if the head node's "first"
interface (eth0) is the one attached to the Internet, and the "second"
interface (eth1) is the one attached to the switch (or hub) for the rest
of the cluster.  What can make this particularly confusing is that eth0
is often associated with the hostname and yet internal cluster nodes,
which must reach the head node via eth1, wish to refer to it by that same
hostname.

There are probably dozens to adequately configure things to work properly.
Unfortunately there are at least as many ways to get it wrong.  So, I am
going to demonstrate below, one method that is sufficient to support use
of MPD on such a cluster.  There are several parts beginning with part I.

In this sample cluster, there are two hosts, heckle and jeckle.
Jeckle is the head node and has two interfaces.  Part I shows the
ifconfig output from jeckle and shows both interfaces, eth0 and eth1.
Part II is the ifconfig output from heckle and shows that it has just
the one interface, eth0.  On both machines, the lo interface is the
loopback interface.

I.  ----------------------------------------------------------------

ifconfig from jeckle (head node):

Note that eth0 has a "real" IP address and is attached to the Internet.
Interface eth1 has a private subnet address 192.168.1.1 that is used only
locally on the switch to which all cluster machines are attached.

eth0      Link encap:Ethernet  HWaddr 00:07:E9:08:DC:EA
          inet addr:161.45.164.113  Bcast:161.45.167.255  Mask:255.255.248.0
          inet6 addr: fe80::207:e9ff:fe08:dcea/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:20770444 errors:0 dropped:0 overruns:0 frame:0
          TX packets:87273 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2235669962 (2.0 GiB)  TX bytes:14936504 (14.2 MiB)
          Base address:0xece0 Memory:fe1e0000-fe200000

eth1      Link encap:Ethernet  HWaddr 00:08:74:40:1E:C5
          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::208:74ff:fe40:1ec5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:34581 errors:0 dropped:0 overruns:0 frame:0
          TX packets:41778 errors:0 dropped:0 overruns:0 carrier:4
          collisions:59 txqueuelen:1000
          RX bytes:4380292 (4.1 MiB)  TX bytes:4768788 (4.5 MiB)
          Interrupt:18 Base address:0xec00

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:2315 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2315 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:249879 (244.0 KiB)  TX bytes:249879 (244.0 KiB)

II.  ----------------------------------------------------------------

ifconfig from heckle (cluster node):

Note that there is just one interface, eth0.  It has the private subnet
address 192.168.1.2 which attached to the switch for the cluster.

eth0      Link encap:Ethernet  HWaddr 00:B0:D0:F7:FB:14
          inet addr:192.168.1.2  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::2b0:d0ff:fef7:fb14/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:35068 errors:0 dropped:0 overruns:1 frame:0
          TX packets:34569 errors:0 dropped:0 overruns:0 carrier:3
          collisions:20 txqueuelen:1000
          RX bytes:4085058 (3.8 MiB)  TX bytes:4377654 (4.1 MiB)
          Interrupt:11 Base address:0xdc00

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:579 errors:0 dropped:0 overruns:0 frame:0
          TX packets:579 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:58564 (57.1 KiB)  TX bytes:58564 (57.1 KiB)

III.  ----------------------------------------------------------------

In /etc/nsswitch.conf on both jeckle and heckle:
    hosts:          files dns

This entry indicates that we wish to resolve host info from files
first (e.g. /etc/hosts)  then dns.  Of course, only the head node
really has access to a dns server, via its eth0.

IV.  ----------------------------------------------------------------

/etc/hosts on heckle (cluster node):
127.0.0.1  localhost.localdomain localhost
192.168.1.2  heckle
192.168.1.1  jeckle

V.  ----------------------------------------------------------------

/etc/hosts on jeckle (head node):
127.0.0.1  localhost.localdomain localhost
161.45.162.50 torvalds torvalds.cs.mtsu.edu  # DNS server
161.45.164.113  jeckle
192.168.1.2 heckle

VI.  ----------------------------------------------------------------

Now, we are ready to run an MPD ring and use it to execute programs
on both nodes of our cluster.  First, we will start the MPDs by hand.
Below, there is another example where I start them via mpdboot.  Really,
you should probably start them by hand during initial testing.

On jeckle in window 1:
rbutler at jeckle:~/mpd$ ./mpdallexit.py  <--- make sure old ring is down
rbutler at jeckle:~/mpd$ ./mpd.py -e --ifhn=192.168.1.1 &
[1] 22696  <--- This shows us the pid for the background job
33431  <--- This is the port# printed by the starting mpd

On heckle in window 2:
heckle:~/mpd> ./mpd.py -h jeckle -p 33431 &  <-- note the port#

Back on jeckle in window 1:
rbutler at jeckle:~/mpd$ ./mpiexec.py -n 2 hostname
jeckle
heckle

VII.  ----------------------------------------------------------------

The same demo again using mpdboot to start the ring.

rbutler at jeckle:~/mpd$ ./mpdallexit.py  <--- make sure old ring is down
rbutler at jeckle:~/mpd$ cat mpd.hosts
heckle
rbutler at jeckle:~/mpd$ ./mpdboot.py --ifhn=192.168.1.1 -n 2
rbutler at jeckle:~/mpd$ ./mpdtrace.py -l
jeckle_33409 (192.168.1.1)
heckle_32900 (192.168.1.1)
rbutler at jeckle:~/mpd$
rbutler at jeckle:~/mpd$ ./mpiexec.py -n 2 hostname
jeckle
heckle