Thanks for the hint. The problem still seems to exist with the -disable-hostname-propagation. I tried with the short hostname, the long hostname, and the direct IP. It's weird the OpenMPI works fine.<div><br><div><br></div>
<div><div>$ mpiexec -disable-hostname-propagation -n 2 -hosts host1 `pwd`/networld</div><div>Hello world (Rank: 0 / Host: host1)</div><div>Hello world (Rank: 1 / Host: host1)</div><div>Msg from 1: 'Hello from node rank 1.'</div>
<div><br></div><div>(this is executed on host1, and ran on host2 fine)</div><div><div>$ mpiexec -disable-hostname-propagation -n 2 -hosts host2 `pwd`/networld</div><div>Hello world (Rank: 0 / Host: host2)</div><div>Hello world (Rank: 1 / Host: host2)</div>
<div>Msg from 1: 'Hello from node rank 1.'</div></div><div><br></div><div>$ mpiexec -disable-hostname-propagation -n 2 -hosts host1,host2 `pwd`/networld</div><div><div>Hello world (Rank: 0 / Host: host1)</div><div>
Hello world (Rank: 1 / Host: host2)</div><div>Fatal error in MPI_Send: Other MPI error, error stack:</div><div>MPI_Send(173)..............: MPI_Send(buf=0x7fff619fc1e0, count=50, MPI_CHARACTER, dest=0, tag=1, MPI_COMM_WORLD) failed</div>
<div>MPID_nem_tcp_connpoll(1826): Communication error with rank 0: Connection refused</div></div><div><div><br></div><div>--</div>Cody R. Brown, M.Sc. Student<br> UBC Department of Computer Science<br> 201-2366 Main Mall, Vancouver, BC, V6T 1Z4<br>
Office: ICCS x409<br>
<br><br><div class="gmail_quote">On Tue, Nov 29, 2011 at 5:04 PM, Pavan Balaji <span dir="ltr"><<a href="mailto:balaji@mcs.anl.gov">balaji@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
Can you try running mpiexec with the option -disable-hostname-propagation to see if it helps?<br>
<br>
-- Pavan<div class="HOEnZb"><div class="h5"><br>
<br>
On 11/30/2011 08:04 AM, Cody R. Brown wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hello;<br>
<br>
I am trying to install MPICH2 on our department machines. I can run a<br>
simple helloworld example (no mpi_send). However when I run an MPI<br>
program which requires an MPI_Send (or other TCP connection), it errors<br>
out with the following. The example is a simple helloworld example using<br>
an MPI_Send:<br>
<br>
cody$ mpiexec -n 2 -hosts host1,host2 ./networld<br>
Hello world (Rank: 0 / Host: host1)<br>
Hello world (Rank: 1 / Host: host2)<br>
Fatal error in MPI_Send: Other MPI error, error stack:<br>
MPI_Send(173)..............: MPI_Send(buf=0x7fff26b4cb80, count=50,<br>
MPI_CHARACTER, dest=0, tag=1, MPI_COMM_WORLD) failed<br>
MPID_nem_tcp_connpoll(1826): Communication error with rank 0: Connection<br>
refused<br>
<br>
<br>
We have determined there is no firewall between the machines, and<br>
passwordless ssh is set up, ect. I can telnet into the hydra damon from<br>
the 2nd host. Interestingly, I can install OpenMPI, and it works fine.<br>
It runs fine on a single host, (even if I run it purely on the remote<br>
host2 from the local host1 -- it works). Just when we are using 2+ hosts<br>
so that it needs to make the TCP connection.<br>
<br>
For some reason MPICH2 can't seem to get the TCP connection info to make<br>
the TCP connect between the machines.<br>
<br>
I not too sure if there is much info you guys can give. I was just<br>
curious if you have seen or heard of this before. The system is an<br>
"openSUSE 11.4 (x86_64)". The MPICH2 version is 1.4.1p1.<br>
<br>
--<br>
Cody R. Brown<br>
UBC Department of Computer Science<br>
201-2366 Main Mall, Vancouver, BC, V6T 1Z4<br>
Office: ICCS x409<br>
</blockquote>
<br></div></div><span class="HOEnZb"><font color="#888888">
-- <br>
Pavan Balaji<br>
<a href="http://www.mcs.anl.gov/~balaji" target="_blank">http://www.mcs.anl.gov/~balaji</a><br>
<br>
</font></span></blockquote></div><br></div></div></div>