[mpich-discuss] child node can't contact parent?

Jayesh Krishna jayesh at mcs.anl.gov
Wed Aug 27 17:12:11 CDT 2008


 Hi,
  Looks like something is wrong with the setup of your machines.

# Can you ping from one machine to the other ?

  - From Computer1 try pinging Computer2
  - From Computer2 try pinging Computer1

# Start debugging by running a non-MPI program (like hostname)

   mpiexec -hosts 2 IPAddress_Of_Computer1 IPAddress_Of_Computer2 hostname

# Then debug with a simple hello world program (don't debug your setup
with a complex program)

----------------- hello world ---------------
#include <stdio.h>
#include "mpi.h"

int main(int argc, char *argv[]){
	int rank=-1;
	MPI_Init(&argc, &argv);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	printf("[%d] Hello world\n", rank);
	MPI_Finalize();
}
----------------- hello world ---------------

 Let us know the results.

Regards,
Jayesh

-----Original Message-----
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Bathgate
Sent: Wednesday, August 27, 2008 3:31 PM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] child node can't contact parent?

Hi All,

I apologize in advance for the length of this email; I'm new to the world
of MPI and I want to include everything that might be relevant.  I have
the Win32 IA32 binary of MPICH2 installed on two machines.  They are
running Windows XP Pro. x64 Edition with Service Pack 2 and they each have
an Intel Xeon processor.  To simplify things I took them off our network,
gave them their own router, and dropped their Windows firewalls.  I have
assigned the machines static IP's with the router (192.168.5.100 for
Computer1, and 192.168.5.200 for Computer2).  I've registered the local
Administrator accounts (which have identical passwords and credentials)
with mpiexec on each machine.  And everything below was attempted from the
Administrator account.

I've tried running the cpi.exe example but it just hangs:

    C:\Program Files (x86)\MPICH2\examples> mpiexec -hosts 2 Computer1
Computer2 .\cpi.exe
    Enter the number of intervals: (0 quits) 1

(here I waited about 20 minutes, then Ctrl+C)

    mpiexec aborting job
   
    job aborted:
    rank: node: exit code[: error message]
    0: Computer1: 123: mpiexec aborting job
    1: Computer2: 123

It runs perfectly fine if I have it execute it with the -localonly tag.  
To explore this issue I wrote a simple program that uses MPI_Comm_spawn to
spawn a worker program.  The master then sends the worker a message and
they both exit.  The manager node runs the code that follows here:
    #include <mpi.h>
    #include <stdio.h>

    int main (int argc, char* argv[])
    {
        int                     someVariable = 10;

        MPI_Info         info;
        MPI_Comm     workercomm;
        MPI_Request   request;
        MPI_Status      status;
   
        MPI_Init( &argc, &argv );
    
        fprintf( stdout, "In Master - someVariable = %i  \n", someVariable
);
        fflush( stdout );

        MPI_Info_create( &info );
        MPI_Info_set( info, "host", "Computer2" );
        MPI_Comm_spawn( "C:\\MPIworker\\Debug\\MPIworker.exe",
MPI_ARGV_NULL,
            1, info, 0, MPI_COMM_SELF, &workercomm, MPI_ERRCODES_IGNORE );
   
        MPI_Info_free( &info );
 
        MPI_Isend( &someVariable, 1, MPI_INT, 0, 0, workercomm,
&(request) );
        MPI_Waitall( 1, request, status );

        fprintf(stdout,"Done sending\n");
        fflush(stdout);

        MPI_Finalize();
        return 0;
    }
The worker code follows here:
    #include <mpi.h>
    #include <stdio.h>

    int main (int argc, char* argv[])
    {
        int                      someVariable = 0;
        MPI_Comm      parentcomm;
        MPI_Request    request;
        MPI_Status       status;
   
        MPI_Init( &argc, &argv );
   
        fprintf(stdout, "In Worker: Before receive - someVariable = %i
\n",someVariable);
        fflush( stdout );

        MPI_Comm_get_parent( &parentcomm );
        MPI_Irecv( &someVariable, 1, MPI_INT, 0, 0, parentcomm, &request
);

        MPI_Wait( &request, &status );
        fprintf( stdout, "After receive - someVariable = %i\n",
someVariable );
        fflush( stdout );

        MPI_Finalize();
        return 0;
    }

When I run this code I get the following results:
    C:\MPImanager\Debug\> mpiexec -n 1 MPImanager.exe
    In Master - someVariable = 10
    Fatal error in MPI_Init: Other MPI error, error stack:
    MPIR_Init_thread<294>............................:Initialization
failed
    MPID_Init<242>...................................:Spawned process
group was unable to connect back to parent on port <tag=0 port=8673
description=computer1.usask.ca ifname=192.168.5.100>
    MPID_Comm_connect<187>...........................:
    MPIDI_Comm_connect<369>..........................:
    MPIDI_Create_inter_root_communicator_connect<133>:
    MPIDI_CH3I_Connect_to_root_sock<289>.............:
    MPIDU_Sock_post_connect<1228>....................: unable to connect
to computer1.usask.ca on port 8673, exhuasted all endpoints <errno -1>
    MPIDU_Sock_post_connect<1244>....................: gethostbyname
failed, The requested name is valid, but no data of the requested type was
found. <errno 11004>

    Job aborted:
    rank: node: exit code[: error message]
    0: computer2: 1: fatal error in MPI_Init: other MPI error, error
stack:
    MPIR_Init_thread<294>............................:Initialization
failed
    MPID_Init<242>...................................:Spawned process
group was unable to connect back to parent on port <tag=0 port=8673
description=computer1.usask.ca ifname=192.168.5.100>
    MPID_Comm_connect<187>...........................:
    MPIDI_Comm_connect<369>..........................:
    MPIDI_Create_inter_root_communicator_connect<133>:
    MPIDI_CH3I_Connect_to_root_sock<289>.............:
    MPIDU_Sock_post_connect<1228>....................: unable to connect
to computer1.usask.ca on port 8673, exhuasted all endpoints <errno -1>
    MPIDU_Sock_post_connect<1244>....................: gethostbyname
failed, The requested name is valid, but no data of the requested type was
found. <errno 11004>

(Here I waited several minutes before pressing ctrl+c)

    mpiexec aborting job ...

(Here I waited several more minutes before pressing ctrl+c and returning
to the command prompt)

So the program is able to spawn a process on the worker, but then when the
worker is unable to contact the manager node MPI_Init fails.  The error
stack shows that it has the correct IP address and tries to use port 8673.
At first I thought the problem might be that it was appending the domain
name (usask.ca) from their old network, but the IP address is still
correct so now I'm not sure.

If I change the code so Computer2 is the manager and Computer1 is the
worker the results are the same.  But just like cpi.exe if I confine both
the worker and the manager to the local host it performs perfectly.  I
assume this is an issue with either the way I've set up my network, or the
way I've set up MPICH2 on the computers.  Does anyone know what would
cause an error like this?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080827/ca50221b/attachment.htm>


More information about the mpich-discuss mailing list