[mpich-discuss] RE : Connection refused with 3 processes, no issue with 2 processes.

BOUVIER Benjamin benjamin.bouvier at thalesgroup.com
Tue Jun 12 09:04:37 CDT 2012


Hi,

Thank you for your answer. However, I've found the source of the problem, on OpenMPI mailing list : the network configuration was wrong between nodes, because of network interfaces.
Briefly, and hoping it could help other people having this issue, each node had 2 network interfaces eth0 and eth1. For a given pair of nodes, they used eth0 to communicate while another pair could use eth1. In the end, if a triplet contained pairs which communicated by both interfaces, the launch failed (for instance, if node1 and node2 communicates through eth0, node1 and node3 through eth1, then .node1, node2 and node3 won't be able to communicate together).
This MPICH issue was also caused by that. The solution was to define which network interface to use, when using mpiexec (you add parameter "-mca btl_tcp_if_include eth1" when using mpiexec, with OpenMPI implementation).

Hope this will help .
--
Benjamin BOUVIER

________________________________________

This is almost certainly a firewall or network configuration issue.  I would start by disabling any firewall while debugging and then reenabling it later in order to figure out which rules should be altered.

What does your hostfile look like?  How are you invoking mpiexec?  Depending on the answers to these questions, it's entirely reasonable for you to have a problem with 3 processes but not 2.

-Dave

On Jun 11, 2012, at 9:37 AM CDT, BOUVIER Benjamin wrote:

> Hi everybody,
>
> I'm a new user of MPICH2 and I experiment some issue on a very simple program. The idea is that when I launch the program locally, there's no problem. When I launch the program on two network-connected machines, there's no problem. But when I launch it on three network-connected machines, there is the issue : "Connection refused".
> The three machines are correctly connected, I can connect over SSH from one to another with success.
>
> I've tried to use OpenMPI instead of MPICH2 firstly, but running the program with OMPI blocks with 2 different machines. The fact that it blocks with 2 different MPI implementations lets me think that's it's not a library bug but maybe a network or system local issue.
>
> Here is the sample program :
>
> # include <mpi.h>
> # include <stdio.h>
> # include <string.h>
>
> int main(int argc, char **argv)
> {
>    int rank, size;
>    const char someString[] = "Can haz cheezburgerz?";
>
>    MPI_Init(&argc, &argv);
>
>    MPI_Comm_rank( MPI_COMM_WORLD, & rank );
>    MPI_Comm_size( MPI_COMM_WORLD, & size );
>
>    if ( rank == 0 )
>    {
>        int n = 42;
>        int i;
>        for( i = 1; i < size; ++i)
>        {
>            MPI_Send( &n, 1, MPI_INT, i, 0, MPI_COMM_WORLD );
>            MPI_Send( &someString, strlen( someString )+1, MPI_CHAR, i, 0, MPI_COMM_WORLD );
>        }
>    } else {
>        char buffer[ 128 ];
>        int received;
>        MPI_Status stat;
>        MPI_Recv( &received, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &stat );
>        printf( "[Worker] Number : %d\n", received );
>        MPI_Recv( buffer, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat );
>        printf( "[Worker] String : %s\n", buffer );
>    }
>
>    MPI_Finalize();
> }
>
> When I launch the program on 3 machines connected by network, here is the message I get :
>
> [Worker] Number : 42
> [Worker] String : Can haz cheezburgerz?
> Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(173)..............: MPI_Send(buf=0x7fff6c4bac2c, count=1, MPI_INT, dest=2, tag=0, MPI_COMM_WORLD) failed
> MPID_nem_tcp_connpoll(1826): Communication error with rank 2: Connection refused
>
> Launching the program on 2 machines doesn't show any particular issue.
>
> I'm using MPICH2 version 1.4.1p1, locally compiled.
> $ uname -a
> Linux trtp7097 2.6.32-220.13.1.el6.x86_64 #1 SMP Thu Mar 29 11:46:40 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
> $ cat /etc/redhat-release
> Red Hat Enterprise Linux Workstation release 6.2 (Santiago)
>
> I saw on another topic that it could be a firewall issue. It seems astonishing as if there was a rule for connection blocking, it would apply also when there are 2 processes.
> Does anybody has any idea ?
> Thanks in advance for your help,
> --
> Benjamin Bouvier
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list