[mpich2-dev] HYDRA: Using Multiple Ethernet Interfaces

Pavan Balaji balaji at mcs.anl.gov
Tue Aug 31 18:33:16 CDT 2010


Cody,

Thanks for the report. Yes, this was indeed a bug. I've fixed it in 
r7124 [http://trac.mcs.anl.gov/projects/mpich2/changeset/7124].

I've created a "nightly snapshot" for you to try it out 
[http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/hydra]. 
It'll also be there in the next release.

FYI, Hydra tries to autodetect any available RMK. In your case, it 
autodetected PBS and tried to use it. If you explicitly specify a 
hostfile (using the -f option), it'll ignore the RMK and use the 
user-specified host file instead. Similarly, you can force Hydra to not 
autodetect any RMK by using "-rmk none" (I changed the name to "none" 
from "dummy", as its more intuitive).

  -- Pavan

On 08/31/2010 05:15 PM, Cody R. Brown wrote:
> Just a small follow up.
>
> I am using PBS (Torque). It seems to me that a possible reason is it is
> always using the PBS nodes which were requested even if I tell it to use
> these other eth2 IP's. Or unset the HYDRA_RMK environment variable.
>
> Looking at it a bit more, the -rmk flag is defaulted to "pbs" so it
> still maybe using pbs?  In the help (mpiexec -h) it says "-rmk dummy" is
> a valid option, but when I use this, it errors out saying not a valid
> option.  I can't seem to change -rmk to anything other than pbs.  Could
> this parameter be taking default over all the others (-hosts, -f,
> -iface), and then always using the node IP's PBS gave me (the 10GigE
> network in this case)?
>
> ~cody
>
> On Tue, Aug 31, 2010 at 2:11 PM, Cody R. Brown <cody at cs.ubc.ca
> <mailto:cody at cs.ubc.ca>> wrote:
>
>     Hello;
>
>     I have a system with 2 Ethernets (eth0 and eth2). eth0 is connected
>     to a 10GigE switch, and eth2 is connected to a separate GigE switch.
>
>     Using HYDRA in version 1.2.1p1, when I want to use a different
>     interface, I get my desired results. The file "hostsGigE" has 2 host
>     names with the gige interface IP's, well "hosts10GigE" has 2 10gige
>     IP's.  The following commands work:
>          # mpiexec -f hostsGigE -n 2 `pwd`/osu_bw         ---Shows
>     bandwidth around 117MB/s
>          # mpiexec -f hosts10GigE -n 2 `pwd`/osu_bw      ---Shows
>     bandwidth around 900MB/s
>
>     When using the latest MPICH2 (1.3a2 and 1.3b1), it seems to always
>     be using the 10GigE network
>          # mpiexec -f hostsGigE -n 2 `pwd`/osu_bw
>     --Shows bandwidth around 900MB/s
>          # mpiexec -f hosts10GigE -n 2 `pwd`/osu_bw
>     --Shows bandwidth around 900MB/s
>
>     These commands also show bandwidth around 900MB/s
>     (including using the IP instead of hostnames) (IE using -iface
>     -hosts and -f flags):
>          # mpiexec -f hosts10GigE -n 2 -iface eth2 `pwd`/osu_bw
>          # mpiexec -hosts node01-eth2,node02-eth2 -iface eth2 -n 2
>     `pwd`/osu_bw
>          # mpiexec -hosts 172.20.101.1,172.20.101.2 -n 2 `pwd`/osu_bw
>
>
>     Anyone know what I am doing wrong?  And why it works as expected in
>     the HYDRA 1.2.1p1 version, but not in the latest 1.3b1?  I am a
>     little confused on how it even knows about the 10GigE network when I
>     only gave it GigE hostnames? Perhaps my system is sending it out on
>     the 10GigE network, but then why does it work fine in 1.2.1p1?
>
>     The system I am running on is Linux: CentOS 5.5.  It is a cluster
>     running with PBS (Torque).  I do have HYDRA_RMK set to "pbs", but I
>     also tried it with this environment variable unset.  It seems the
>     command line parameters take default.  The info here
>     "http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager#Hydra_with_Non-Ethernet_Networks"
>     shows what I am doing should work.  My "ifconfig" is below.
>
>     Any help would be appreciated.
>
>     ~cody
>
>
>     eth0      Link encap:Ethernet  HWaddr 00:1B:21:69:79:A0
>                inet addr:192.168.20.1  Bcast:192.168.20.255
>       Mask:255.255.255.0
>                inet6 addr: fe80::21b:21ff:fe69:79a0/64 Scope:Link
>                UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>                RX packets:7454839 errors:0 dropped:0 overruns:0 frame:0
>                TX packets:149930410 errors:0 dropped:0 overruns:0 carrier:0
>                collisions:0 txqueuelen:1000
>                RX bytes:45436437528 (42.3 GiB)  TX bytes:221935890089
>     (206.6 GiB)
>     eth2      Link encap:Ethernet  HWaddr E4:1F:13:4D:13:0E
>                inet addr:172.20.101.1  Bcast:172.20.101.255
>       Mask:255.255.255.0
>                inet6 addr: fe80::e61f:13ff:fe4d:130e/64 Scope:Link
>                UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>                RX packets:556581 errors:0 dropped:0 overruns:0 frame:0
>                TX packets:8745499 errors:0 dropped:0 overruns:0 carrier:0
>                collisions:0 txqueuelen:1000
>                RX bytes:39219489 (37.4 MiB)  TX bytes:12766433186 (11.8 GiB)
>                Memory:92b60000-92b80000
>
>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich2-dev mailing list