[mpich2-dev] HYDRA: Using Multiple Ethernet Interfaces
Pavan Balaji
balaji at mcs.anl.gov
Tue Aug 31 18:33:16 CDT 2010
Cody,
Thanks for the report. Yes, this was indeed a bug. I've fixed it in
r7124 [http://trac.mcs.anl.gov/projects/mpich2/changeset/7124].
I've created a "nightly snapshot" for you to try it out
[http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/hydra].
It'll also be there in the next release.
FYI, Hydra tries to autodetect any available RMK. In your case, it
autodetected PBS and tried to use it. If you explicitly specify a
hostfile (using the -f option), it'll ignore the RMK and use the
user-specified host file instead. Similarly, you can force Hydra to not
autodetect any RMK by using "-rmk none" (I changed the name to "none"
from "dummy", as its more intuitive).
-- Pavan
On 08/31/2010 05:15 PM, Cody R. Brown wrote:
> Just a small follow up.
>
> I am using PBS (Torque). It seems to me that a possible reason is it is
> always using the PBS nodes which were requested even if I tell it to use
> these other eth2 IP's. Or unset the HYDRA_RMK environment variable.
>
> Looking at it a bit more, the -rmk flag is defaulted to "pbs" so it
> still maybe using pbs? In the help (mpiexec -h) it says "-rmk dummy" is
> a valid option, but when I use this, it errors out saying not a valid
> option. I can't seem to change -rmk to anything other than pbs. Could
> this parameter be taking default over all the others (-hosts, -f,
> -iface), and then always using the node IP's PBS gave me (the 10GigE
> network in this case)?
>
> ~cody
>
> On Tue, Aug 31, 2010 at 2:11 PM, Cody R. Brown <cody at cs.ubc.ca
> <mailto:cody at cs.ubc.ca>> wrote:
>
> Hello;
>
> I have a system with 2 Ethernets (eth0 and eth2). eth0 is connected
> to a 10GigE switch, and eth2 is connected to a separate GigE switch.
>
> Using HYDRA in version 1.2.1p1, when I want to use a different
> interface, I get my desired results. The file "hostsGigE" has 2 host
> names with the gige interface IP's, well "hosts10GigE" has 2 10gige
> IP's. The following commands work:
> # mpiexec -f hostsGigE -n 2 `pwd`/osu_bw ---Shows
> bandwidth around 117MB/s
> # mpiexec -f hosts10GigE -n 2 `pwd`/osu_bw ---Shows
> bandwidth around 900MB/s
>
> When using the latest MPICH2 (1.3a2 and 1.3b1), it seems to always
> be using the 10GigE network
> # mpiexec -f hostsGigE -n 2 `pwd`/osu_bw
> --Shows bandwidth around 900MB/s
> # mpiexec -f hosts10GigE -n 2 `pwd`/osu_bw
> --Shows bandwidth around 900MB/s
>
> These commands also show bandwidth around 900MB/s
> (including using the IP instead of hostnames) (IE using -iface
> -hosts and -f flags):
> # mpiexec -f hosts10GigE -n 2 -iface eth2 `pwd`/osu_bw
> # mpiexec -hosts node01-eth2,node02-eth2 -iface eth2 -n 2
> `pwd`/osu_bw
> # mpiexec -hosts 172.20.101.1,172.20.101.2 -n 2 `pwd`/osu_bw
>
>
> Anyone know what I am doing wrong? And why it works as expected in
> the HYDRA 1.2.1p1 version, but not in the latest 1.3b1? I am a
> little confused on how it even knows about the 10GigE network when I
> only gave it GigE hostnames? Perhaps my system is sending it out on
> the 10GigE network, but then why does it work fine in 1.2.1p1?
>
> The system I am running on is Linux: CentOS 5.5. It is a cluster
> running with PBS (Torque). I do have HYDRA_RMK set to "pbs", but I
> also tried it with this environment variable unset. It seems the
> command line parameters take default. The info here
> "http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager#Hydra_with_Non-Ethernet_Networks"
> shows what I am doing should work. My "ifconfig" is below.
>
> Any help would be appreciated.
>
> ~cody
>
>
> eth0 Link encap:Ethernet HWaddr 00:1B:21:69:79:A0
> inet addr:192.168.20.1 Bcast:192.168.20.255
> Mask:255.255.255.0
> inet6 addr: fe80::21b:21ff:fe69:79a0/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:7454839 errors:0 dropped:0 overruns:0 frame:0
> TX packets:149930410 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:45436437528 (42.3 GiB) TX bytes:221935890089
> (206.6 GiB)
> eth2 Link encap:Ethernet HWaddr E4:1F:13:4D:13:0E
> inet addr:172.20.101.1 Bcast:172.20.101.255
> Mask:255.255.255.0
> inet6 addr: fe80::e61f:13ff:fe4d:130e/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:556581 errors:0 dropped:0 overruns:0 frame:0
> TX packets:8745499 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:39219489 (37.4 MiB) TX bytes:12766433186 (11.8 GiB)
> Memory:92b60000-92b80000
>
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich2-dev
mailing list