[mpich-discuss] child node can't contact parent?

Tony Bathgate tony.bathgate at usask.ca
Wed Aug 27 15:31:01 CDT 2008


Hi All,

I apologize in advance for the length of this email; I'm new to the 
world of MPI and I want to include everything that might be relevant.  I 
have the Win32 IA32 binary of MPICH2 installed on two machines.  They 
are running Windows XP Pro. x64 Edition with Service Pack 2 and they 
each have an Intel Xeon processor.  To simplify things I took them off 
our network, gave them their own router, and dropped their Windows 
firewalls.  I have assigned the machines static IP's with the router 
(192.168.5.100 for Computer1, and 192.168.5.200 for Computer2).  I've 
registered the local Administrator accounts (which have identical 
passwords and credentials) with mpiexec on each machine.  And everything 
below was attempted from the Administrator account.

I've tried running the cpi.exe example but it just hangs:

    C:\Program Files (x86)\MPICH2\examples> mpiexec -hosts 2 Computer1 
Computer2 .\cpi.exe
    Enter the number of intervals: (0 quits) 1

(here I waited about 20 minutes, then Ctrl+C)

    mpiexec aborting job
   
    job aborted:
    rank: node: exit code[: error message]
    0: Computer1: 123: mpiexec aborting job
    1: Computer2: 123

It runs perfectly fine if I have it execute it with the -localonly tag.  
To explore this issue I wrote a simple program that uses MPI_Comm_spawn 
to spawn a worker program.  The master then sends the worker a message 
and they both exit.  The manager node runs the code that follows here:
    #include <mpi.h>
    #include <stdio.h>

    int main (int argc, char* argv[])
    {
        int                     someVariable = 10;

        MPI_Info         info;
        MPI_Comm     workercomm;
        MPI_Request   request;
        MPI_Status      status;
   
        MPI_Init( &argc, &argv );
    
        fprintf( stdout, "In Master - someVariable = %i  \n", 
someVariable );
        fflush( stdout );

        MPI_Info_create( &info );
        MPI_Info_set( info, "host", "Computer2" );
        MPI_Comm_spawn( "C:\\MPIworker\\Debug\\MPIworker.exe", 
MPI_ARGV_NULL,
            1, info, 0, MPI_COMM_SELF, &workercomm, MPI_ERRCODES_IGNORE );
   
        MPI_Info_free( &info );
 
        MPI_Isend( &someVariable, 1, MPI_INT, 0, 0, workercomm, 
&(request) );
        MPI_Waitall( 1, request, status );

        fprintf(stdout,"Done sending\n");
        fflush(stdout);

        MPI_Finalize();
        return 0;
    }
The worker code follows here:
    #include <mpi.h>
    #include <stdio.h>

    int main (int argc, char* argv[])
    {
        int                      someVariable = 0;
        MPI_Comm      parentcomm;
        MPI_Request    request;
        MPI_Status       status;
   
        MPI_Init( &argc, &argv );
   
        fprintf(stdout, "In Worker: Before receive - someVariable = %i  
\n",someVariable);
        fflush( stdout );

        MPI_Comm_get_parent( &parentcomm );
        MPI_Irecv( &someVariable, 1, MPI_INT, 0, 0, parentcomm, &request );

        MPI_Wait( &request, &status );
        fprintf( stdout, "After receive - someVariable = %i\n", 
someVariable );
        fflush( stdout );

        MPI_Finalize();
        return 0;
    }

When I run this code I get the following results:
    C:\MPImanager\Debug\> mpiexec -n 1 MPImanager.exe
    In Master - someVariable = 10
    Fatal error in MPI_Init: Other MPI error, error stack:
    MPIR_Init_thread<294>............................:Initialization failed
    MPID_Init<242>...................................:Spawned process 
group was unable to connect back to parent on port <tag=0 port=8673 
description=computer1.usask.ca ifname=192.168.5.100>
    MPID_Comm_connect<187>...........................:
    MPIDI_Comm_connect<369>..........................:
    MPIDI_Create_inter_root_communicator_connect<133>:
    MPIDI_CH3I_Connect_to_root_sock<289>.............:
    MPIDU_Sock_post_connect<1228>....................: unable to connect 
to computer1.usask.ca on port 8673, exhuasted all endpoints <errno -1>
    MPIDU_Sock_post_connect<1244>....................: gethostbyname 
failed, The requested name is valid, but no data of the requested type 
was found. <errno 11004>

    Job aborted:
    rank: node: exit code[: error message]
    0: computer2: 1: fatal error in MPI_Init: other MPI error, error stack:
    MPIR_Init_thread<294>............................:Initialization failed
    MPID_Init<242>...................................:Spawned process 
group was unable to connect back to parent on port <tag=0 port=8673 
description=computer1.usask.ca ifname=192.168.5.100>
    MPID_Comm_connect<187>...........................:
    MPIDI_Comm_connect<369>..........................:
    MPIDI_Create_inter_root_communicator_connect<133>:
    MPIDI_CH3I_Connect_to_root_sock<289>.............:
    MPIDU_Sock_post_connect<1228>....................: unable to connect 
to computer1.usask.ca on port 8673, exhuasted all endpoints <errno -1>
    MPIDU_Sock_post_connect<1244>....................: gethostbyname 
failed, The requested name is valid, but no data of the requested type 
was found. <errno 11004>

(Here I waited several minutes before pressing ctrl+c)

    mpiexec aborting job ...

(Here I waited several more minutes before pressing ctrl+c and returning 
to the command prompt)

So the program is able to spawn a process on the worker, but then when 
the worker is unable to contact the manager node MPI_Init fails.  The 
error stack shows that it has the correct IP address and tries to use 
port 8673.  At first I thought the problem might be that it was 
appending the domain name (usask.ca) from their old network, but the IP 
address is still correct so now I'm not sure.

If I change the code so Computer2 is the manager and Computer1 is the 
worker the results are the same.  But just like cpi.exe if I confine 
both the worker and the manager to the local host it performs 
perfectly.  I assume this is an issue with either the way I've set up my 
network, or the way I've set up MPICH2 on the computers.  Does anyone 
know what would cause an error like this?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tony_bathgate.vcf
Type: text/x-vcard
Size: 264 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080827/cb355388/attachment.vcf>


More information about the mpich-discuss mailing list