[mpich-discuss] child node can't contact parent?

Tony Bathgate tony.bathgate at usask.ca
Wed Aug 27 18:27:49 CDT 2008


Hi,

Thanks for the reply.
I had actually already pinged Computer1 from Computer2 and vice versa.  
The ping works fine.  I tried using hostname with mpiexec like you 
suggested and it works fine too, from both computers.  Now I'm baffled.  
I also tried your Hello World program and it crashed.  Here's the error 
messages I got:

C:\helloworld\Debug\> mpiexec -hosts 2 192.168.5.100 192.168.5.200 
helloworld.exe
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize<255>............: MPI_Finalize failed
MPI_Finalize<154>............:
MPID_Finalize<94>............:
MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002>
MPIR_Barrier<77>.............:
MPIC_Sendrecv<120>...........:
MPID_Isend<103>..............: failure occured while attempting to send 
an eager message
MPIDI_CH3_iSend<172>.........:
MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 0 using 
business card <port=8673 description=computer1.usask.ca 
ifname=192.168.5.100>
MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested name 
is valid, but no data of the requested type was found. <errno 11004>
job aborted:
rank: node: exit code[: error message]
0: 192.168.5.100: 1
1: 192.168.5.200: 1: Fatal error in MPI_Finalize: Other MPI error, error 
stack:
MPI_Finalize<255>............: MPI_Finalize failed
MPI_Finalize<154>............:
MPID_Finalize<94>............:
MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002>
MPIR_Barrier<77>.............:
MPIC_Sendrecv<120>...........:
MPID_Isend<103>..............: failure occured while attempting to send 
an eager message
MPIDI_CH3_iSend<172>.........:
MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 1 using 
business card <port=8673 description=computer1.usask.ca 
ifname=192.168.5.100>
MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested name 
is valid, but no data of the requested type was found. <errno 11004>

So it seems to me that it can execute programs remotely but not when the 
program relies on the MPICH2 c implementation libraries.  Does that make 
sense, and how could it be remedied?

Thanks again,
Tony

Jayesh Krishna wrote:
>
>  Hi,
>   Looks like something is wrong with the setup of your machines.
>
> # Can you ping from one machine to the other ?
>
>   - From Computer1 try pinging Computer2
>   - From Computer2 try pinging Computer1
>
> # Start debugging by running a non-MPI program (like hostname)
>
>    mpiexec -hosts 2 IPAddress_Of_Computer1 IPAddress_Of_Computer2 hostname
>
> # Then debug with a simple hello world program (don't debug your setup 
> with a complex program)
>
> ----------------- hello world ---------------
> #include <stdio.h>
> #include "mpi.h"
>
> int main(int argc, char *argv[]){
>         int rank=-1;
>         MPI_Init(&argc, &argv);
>         MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>         printf("[%d] Hello world\n", rank);
>         MPI_Finalize();
> }
> ----------------- hello world ---------------
>
>  Let us know the results.
>
> Regards,
> Jayesh
>
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Bathgate
> Sent: Wednesday, August 27, 2008 3:31 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] child node can't contact parent?
>
> Hi All,
>
> I apologize in advance for the length of this email; I'm new to the 
> world of MPI and I want to include everything that might be relevant.  
> I have the Win32 IA32 binary of MPICH2 installed on two machines.  
> They are running Windows XP Pro. x64 Edition with Service Pack 2 and 
> they each have an Intel Xeon processor.  To simplify things I took 
> them off our network, gave them their own router, and dropped their 
> Windows firewalls.  I have assigned the machines static IP's with the 
> router (192.168.5.100 for Computer1, and 192.168.5.200 for 
> Computer2).  I've registered the local Administrator accounts (which 
> have identical passwords and credentials) with mpiexec on each 
> machine.  And everything below was attempted from the Administrator 
> account.
>
> I've tried running the cpi.exe example but it just hangs:
>
>     C:\Program Files (x86)\MPICH2\examples> mpiexec -hosts 2 Computer1
> Computer2 .\cpi.exe
>     Enter the number of intervals: (0 quits) 1
>
> (here I waited about 20 minutes, then Ctrl+C)
>
>     mpiexec aborting job
>   
>     job aborted:
>     rank: node: exit code[: error message]
>     0: Computer1: 123: mpiexec aborting job
>     1: Computer2: 123
>
> It runs perfectly fine if I have it execute it with the -localonly tag. 
> To explore this issue I wrote a simple program that uses 
> MPI_Comm_spawn to spawn a worker program.  The master then sends the 
> worker a message and they both exit.  The manager node runs the code 
> that follows here:
>     #include <mpi.h>
>     #include <stdio.h>
>
>     int main (int argc, char* argv[])
>     {
>         int                     someVariable = 10;
>
>         MPI_Info         info;
>         MPI_Comm     workercomm;
>         MPI_Request   request;
>         MPI_Status      status;
>   
>         MPI_Init( &argc, &argv );
>    
>         fprintf( stdout, "In Master - someVariable = %i  \n", 
> someVariable );
>         fflush( stdout );
>
>         MPI_Info_create( &info );
>         MPI_Info_set( info, "host", "Computer2" );
>         MPI_Comm_spawn( "C:\\MPIworker\\Debug\\MPIworker.exe",
> MPI_ARGV_NULL,
>             1, info, 0, MPI_COMM_SELF, &workercomm, MPI_ERRCODES_IGNORE );
>   
>         MPI_Info_free( &info );
>
>         MPI_Isend( &someVariable, 1, MPI_INT, 0, 0, workercomm,
> &(request) );
>         MPI_Waitall( 1, request, status );
>
>         fprintf(stdout,"Done sending\n");
>         fflush(stdout);
>
>         MPI_Finalize();
>         return 0;
>     }
> The worker code follows here:
>     #include <mpi.h>
>     #include <stdio.h>
>
>     int main (int argc, char* argv[])
>     {
>         int                      someVariable = 0;
>         MPI_Comm      parentcomm;
>         MPI_Request    request;
>         MPI_Status       status;
>   
>         MPI_Init( &argc, &argv );
>   
>         fprintf(stdout, "In Worker: Before receive - someVariable = %i 
> \n",someVariable);
>         fflush( stdout );
>
>         MPI_Comm_get_parent( &parentcomm );
>         MPI_Irecv( &someVariable, 1, MPI_INT, 0, 0, parentcomm, 
> &request );
>
>         MPI_Wait( &request, &status );
>         fprintf( stdout, "After receive - someVariable = %i\n", 
> someVariable );
>         fflush( stdout );
>
>         MPI_Finalize();
>         return 0;
>     }
>
> When I run this code I get the following results:
>     C:\MPImanager\Debug\> mpiexec -n 1 MPImanager.exe
>     In Master - someVariable = 10
>     Fatal error in MPI_Init: Other MPI error, error stack:
>     MPIR_Init_thread<294>............................:Initialization 
> failed
>     MPID_Init<242>...................................:Spawned process 
> group was unable to connect back to parent on port <tag=0 port=8673 
> description=computer1.usask.ca ifname=192.168.5.100>
>     MPID_Comm_connect<187>...........................:
>     MPIDI_Comm_connect<369>..........................:
>     MPIDI_Create_inter_root_communicator_connect<133>:
>     MPIDI_CH3I_Connect_to_root_sock<289>.............:
>     MPIDU_Sock_post_connect<1228>....................: unable to 
> connect to computer1.usask.ca on port 8673, exhuasted all endpoints 
> <errno -1>
>     MPIDU_Sock_post_connect<1244>....................: gethostbyname 
> failed, The requested name is valid, but no data of the requested type 
> was found. <errno 11004>
>
>     Job aborted:
>     rank: node: exit code[: error message]
>     0: computer2: 1: fatal error in MPI_Init: other MPI error, error 
> stack:
>     MPIR_Init_thread<294>............................:Initialization 
> failed
>     MPID_Init<242>...................................:Spawned process 
> group was unable to connect back to parent on port <tag=0 port=8673 
> description=computer1.usask.ca ifname=192.168.5.100>
>     MPID_Comm_connect<187>...........................:
>     MPIDI_Comm_connect<369>..........................:
>     MPIDI_Create_inter_root_communicator_connect<133>:
>     MPIDI_CH3I_Connect_to_root_sock<289>.............:
>     MPIDU_Sock_post_connect<1228>....................: unable to 
> connect to computer1.usask.ca on port 8673, exhuasted all endpoints 
> <errno -1>
>     MPIDU_Sock_post_connect<1244>....................: gethostbyname 
> failed, The requested name is valid, but no data of the requested type 
> was found. <errno 11004>
>
> (Here I waited several minutes before pressing ctrl+c)
>
>     mpiexec aborting job ...
>
> (Here I waited several more minutes before pressing ctrl+c and 
> returning to the command prompt)
>
> So the program is able to spawn a process on the worker, but then when 
> the worker is unable to contact the manager node MPI_Init fails.  The 
> error stack shows that it has the correct IP address and tries to use 
> port 8673.  At first I thought the problem might be that it was 
> appending the domain name (usask.ca) from their old network, but the 
> IP address is still correct so now I'm not sure.
>
> If I change the code so Computer2 is the manager and Computer1 is the 
> worker the results are the same.  But just like cpi.exe if I confine 
> both the worker and the manager to the local host it performs 
> perfectly.  I assume this is an issue with either the way I've set up 
> my network, or the way I've set up MPICH2 on the computers.  Does 
> anyone know what would cause an error like this?
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tony_bathgate.vcf
Type: text/x-vcard
Size: 264 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080827/73c75c2e/attachment.vcf>


More information about the mpich-discuss mailing list