[mpich2-dev] Problem making a TCP Connection

Cody R. Brown cody at cs.ubc.ca
Wed Nov 30 02:32:02 CST 2011


Thanks for the hint. The problem still seems to exist with
the -disable-hostname-propagation. I tried with the short hostname, the
long hostname, and the direct IP. It's weird the OpenMPI works fine.


$ mpiexec -disable-hostname-propagation -n 2 -hosts host1 `pwd`/networld
Hello world (Rank: 0 / Host: host1)
Hello world (Rank: 1 / Host: host1)
Msg from 1: 'Hello from node rank 1.'

(this is executed on host1, and ran on host2 fine)
$ mpiexec -disable-hostname-propagation -n 2 -hosts host2 `pwd`/networld
Hello world (Rank: 0 / Host: host2)
Hello world (Rank: 1 / Host: host2)
Msg from 1: 'Hello from node rank 1.'

$ mpiexec -disable-hostname-propagation -n 2 -hosts host1,host2
`pwd`/networld
Hello world (Rank: 0 / Host: host1)
Hello world (Rank: 1 / Host: host2)
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173)..............: MPI_Send(buf=0x7fff619fc1e0, count=50,
MPI_CHARACTER, dest=0, tag=1, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1826): Communication error with rank 0: Connection
refused

--
Cody R. Brown, M.Sc. Student
  UBC Department of Computer Science
  201-2366 Main Mall, Vancouver, BC, V6T 1Z4
  Office: ICCS x409


On Tue, Nov 29, 2011 at 5:04 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

>
> Can you try running mpiexec with the option -disable-hostname-propagation
> to see if it helps?
>
>  -- Pavan
>
>
> On 11/30/2011 08:04 AM, Cody R. Brown wrote:
>
>> Hello;
>>
>> I am trying to install MPICH2 on our department machines. I can run a
>> simple helloworld example (no mpi_send). However when I run an MPI
>> program which requires an MPI_Send (or other TCP connection), it errors
>> out with the following. The example is a simple helloworld example using
>> an MPI_Send:
>>
>> cody$ mpiexec -n 2 -hosts host1,host2 ./networld
>> Hello world (Rank: 0 / Host: host1)
>> Hello world (Rank: 1 / Host: host2)
>> Fatal error in MPI_Send: Other MPI error, error stack:
>> MPI_Send(173)..............: MPI_Send(buf=0x7fff26b4cb80, count=50,
>> MPI_CHARACTER, dest=0, tag=1, MPI_COMM_WORLD) failed
>> MPID_nem_tcp_connpoll(1826): Communication error with rank 0: Connection
>> refused
>>
>>
>> We have determined there is no firewall between the machines, and
>> passwordless ssh is set up, ect. I can telnet into the hydra damon from
>> the 2nd host. Interestingly, I can install OpenMPI, and it works fine.
>> It runs fine on a single host, (even if I run it purely on the remote
>> host2 from the local host1 -- it works). Just when we are using 2+ hosts
>> so that it needs to make the TCP connection.
>>
>> For some reason MPICH2 can't seem to get the TCP connection info to make
>> the TCP connect between the machines.
>>
>> I not too sure if there is much info you guys can give. I was just
>> curious if you have seen or heard of this before. The system is an
>> "openSUSE 11.4 (x86_64)". The MPICH2 version is 1.4.1p1.
>>
>> --
>> Cody R. Brown
>>   UBC Department of Computer Science
>>   201-2366 Main Mall, Vancouver, BC, V6T 1Z4
>>   Office: ICCS x409
>>
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20111130/83d5e524/attachment.htm>


More information about the mpich2-dev mailing list