[mpich2-dev] HYDRA: Using Multiple Ethernet Interfaces

Tue Aug 31 17:15:46 CDT 2010

Just a small follow up.

I am using PBS (Torque). It seems to me that a possible reason is it is
always using the PBS nodes which were requested even if I tell it to use
these other eth2 IP's. Or unset the HYDRA_RMK environment variable.

Looking at it a bit more, the -rmk flag is defaulted to "pbs" so it still
maybe using pbs?  In the help (mpiexec -h) it says "-rmk dummy" is a valid
option, but when I use this, it errors out saying not a valid option.  I
can't seem to change -rmk to anything other than pbs.  Could this parameter
be taking default over all the others (-hosts, -f, -iface), and then always
using the node IP's PBS gave me (the 10GigE network in this case)?

~cody

On Tue, Aug 31, 2010 at 2:11 PM, Cody R. Brown <cody at cs.ubc.ca> wrote:

> Hello;
>
> I have a system with 2 Ethernets (eth0 and eth2). eth0 is connected to a
> 10GigE switch, and eth2 is connected to a separate GigE switch.
>
> Using HYDRA in version 1.2.1p1, when I want to use a different interface, I
> get my desired results. The file "hostsGigE" has 2 host names with the gige
> interface IP's, well "hosts10GigE" has 2 10gige IP's.  The following
> commands work:
>     # mpiexec -f hostsGigE -n 2 `pwd`/osu_bw         ---Shows bandwidth
> around 117MB/s
>     # mpiexec -f hosts10GigE -n 2 `pwd`/osu_bw      ---Shows bandwidth
> around 900MB/s
>
> When using the latest MPICH2 (1.3a2 and 1.3b1), it seems to always be using
> the 10GigE network
>     # mpiexec -f hostsGigE -n 2 `pwd`/osu_bw     --Shows bandwidth around
> 900MB/s
>     # mpiexec -f hosts10GigE -n 2 `pwd`/osu_bw   --Shows bandwidth around
> 900MB/s
>
> These commands also show bandwidth around 900MB/s (including using the IP
> instead of hostnames) (IE using -iface -hosts and -f flags):
>     # mpiexec -f hosts10GigE -n 2 -iface eth2 `pwd`/osu_bw
>     # mpiexec -hosts node01-eth2,node02-eth2 -iface eth2 -n 2 `pwd`/osu_bw
>     # mpiexec -hosts 172.20.101.1,172.20.101.2 -n 2 `pwd`/osu_bw
>
>
> Anyone know what I am doing wrong?  And why it works as expected in the
> HYDRA 1.2.1p1 version, but not in the latest 1.3b1?  I am a little confused
> on how it even knows about the 10GigE network when I only gave it GigE
> hostnames? Perhaps my system is sending it out on the 10GigE network, but
> then why does it work fine in 1.2.1p1?
>
> The system I am running on is Linux: CentOS 5.5.  It is a cluster running
> with PBS (Torque).  I do have HYDRA_RMK set to "pbs", but I also tried it
> with this environment variable unset.  It seems the command line parameters
> take default.  The info here "
> http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager#Hydra_with_Non-Ethernet_Networks"
> shows what I am doing should work.  My "ifconfig" is below.
>
> Any help would be appreciated.
>
> ~cody
>
>
> eth0      Link encap:Ethernet  HWaddr 00:1B:21:69:79:A0
>           inet addr:192.168.20.1  Bcast:192.168.20.255  Mask:255.255.255.0
>           inet6 addr: fe80::21b:21ff:fe69:79a0/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:7454839 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:149930410 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:45436437528 (42.3 GiB)  TX bytes:221935890089 (206.6
> GiB)
> eth2      Link encap:Ethernet  HWaddr E4:1F:13:4D:13:0E
>           inet addr:172.20.101.1  Bcast:172.20.101.255  Mask:255.255.255.0
>           inet6 addr: fe80::e61f:13ff:fe4d:130e/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:556581 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:8745499 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:39219489 (37.4 MiB)  TX bytes:12766433186 (11.8 GiB)
>           Memory:92b60000-92b80000
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20100831/7ad387c2/attachment.htm>