<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0">
<TITLE>RE: [mpich-discuss] child node can't contact parent?</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2> Hi,<BR>
Looks like something is wrong with the setup of your machines.<BR>
<BR>
# Can you ping from one machine to the other ?<BR>
<BR>
- From Computer1 try pinging Computer2<BR>
- From Computer2 try pinging Computer1<BR>
<BR>
# Start debugging by running a non-MPI program (like hostname)<BR>
<BR>
mpiexec -hosts 2 IPAddress_Of_Computer1 IPAddress_Of_Computer2 hostname<BR>
<BR>
# Then debug with a simple hello world program (don't debug your setup with a complex program)<BR>
<BR>
----------------- hello world ---------------<BR>
#include <stdio.h><BR>
#include "mpi.h"<BR>
<BR>
int main(int argc, char *argv[]){<BR>
int rank=-1;<BR>
MPI_Init(&argc, &argv);<BR>
MPI_Comm_rank(MPI_COMM_WORLD, &rank);<BR>
printf("[%d] Hello world\n", rank);<BR>
MPI_Finalize();<BR>
}<BR>
----------------- hello world ---------------<BR>
<BR>
Let us know the results.<BR>
<BR>
Regards,<BR>
Jayesh<BR>
<BR>
-----Original Message-----<BR>
From: owner-mpich-discuss@mcs.anl.gov [<A HREF="mailto:owner-mpich-discuss@mcs.anl.gov">mailto:owner-mpich-discuss@mcs.anl.gov</A>] On Behalf Of Tony Bathgate<BR>
Sent: Wednesday, August 27, 2008 3:31 PM<BR>
To: mpich-discuss@mcs.anl.gov<BR>
Subject: [mpich-discuss] child node can't contact parent?<BR>
<BR>
Hi All,<BR>
<BR>
I apologize in advance for the length of this email; I'm new to the world of MPI and I want to include everything that might be relevant. I have the Win32 IA32 binary of MPICH2 installed on two machines. They are running Windows XP Pro. x64 Edition with Service Pack 2 and they each have an Intel Xeon processor. To simplify things I took them off our network, gave them their own router, and dropped their Windows firewalls. I have assigned the machines static IP's with the router (192.168.5.100 for Computer1, and 192.168.5.200 for Computer2). I've registered the local Administrator accounts (which have identical passwords and credentials) with mpiexec on each machine. And everything below was attempted from the Administrator account.<BR>
<BR>
I've tried running the cpi.exe example but it just hangs:<BR>
<BR>
C:\Program Files (x86)\MPICH2\examples> mpiexec -hosts 2 Computer1<BR>
Computer2 .\cpi.exe<BR>
Enter the number of intervals: (0 quits) 1<BR>
<BR>
(here I waited about 20 minutes, then Ctrl+C)<BR>
<BR>
mpiexec aborting job<BR>
<BR>
job aborted:<BR>
rank: node: exit code[: error message]<BR>
0: Computer1: 123: mpiexec aborting job<BR>
1: Computer2: 123<BR>
<BR>
It runs perfectly fine if I have it execute it with the -localonly tag. <BR>
To explore this issue I wrote a simple program that uses MPI_Comm_spawn to spawn a worker program. The master then sends the worker a message and they both exit. The manager node runs the code that follows here:<BR>
#include <mpi.h><BR>
#include <stdio.h><BR>
<BR>
int main (int argc, char* argv[])<BR>
{<BR>
int someVariable = 10;<BR>
<BR>
MPI_Info info;<BR>
MPI_Comm workercomm;<BR>
MPI_Request request;<BR>
MPI_Status status;<BR>
<BR>
MPI_Init( &argc, &argv );<BR>
<BR>
fprintf( stdout, "In Master - someVariable = %i \n", someVariable );<BR>
fflush( stdout );<BR>
<BR>
MPI_Info_create( &info );<BR>
MPI_Info_set( info, "host", "Computer2" );<BR>
MPI_Comm_spawn( "C:\\MPIworker\\Debug\\MPIworker.exe",<BR>
MPI_ARGV_NULL,<BR>
1, info, 0, MPI_COMM_SELF, &workercomm, MPI_ERRCODES_IGNORE );<BR>
<BR>
MPI_Info_free( &info );<BR>
<BR>
MPI_Isend( &someVariable, 1, MPI_INT, 0, 0, workercomm,<BR>
&(request) );<BR>
MPI_Waitall( 1, request, status );<BR>
<BR>
fprintf(stdout,"Done sending\n");<BR>
fflush(stdout);<BR>
<BR>
MPI_Finalize();<BR>
return 0;<BR>
}<BR>
The worker code follows here:<BR>
#include <mpi.h><BR>
#include <stdio.h><BR>
<BR>
int main (int argc, char* argv[])<BR>
{<BR>
int someVariable = 0;<BR>
MPI_Comm parentcomm;<BR>
MPI_Request request;<BR>
MPI_Status status;<BR>
<BR>
MPI_Init( &argc, &argv );<BR>
<BR>
fprintf(stdout, "In Worker: Before receive - someVariable = %i \n",someVariable);<BR>
fflush( stdout );<BR>
<BR>
MPI_Comm_get_parent( &parentcomm );<BR>
MPI_Irecv( &someVariable, 1, MPI_INT, 0, 0, parentcomm, &request );<BR>
<BR>
MPI_Wait( &request, &status );<BR>
fprintf( stdout, "After receive - someVariable = %i\n", someVariable );<BR>
fflush( stdout );<BR>
<BR>
MPI_Finalize();<BR>
return 0;<BR>
}<BR>
<BR>
When I run this code I get the following results:<BR>
C:\MPImanager\Debug\> mpiexec -n 1 MPImanager.exe<BR>
In Master - someVariable = 10<BR>
Fatal error in MPI_Init: Other MPI error, error stack:<BR>
MPIR_Init_thread<294>............................:Initialization failed<BR>
MPID_Init<242>...................................:Spawned process group was unable to connect back to parent on port <tag=0 port=8673 description=computer1.usask.ca ifname=192.168.5.100><BR>
MPID_Comm_connect<187>...........................:<BR>
MPIDI_Comm_connect<369>..........................:<BR>
MPIDI_Create_inter_root_communicator_connect<133>:<BR>
MPIDI_CH3I_Connect_to_root_sock<289>.............:<BR>
MPIDU_Sock_post_connect<1228>....................: unable to connect to computer1.usask.ca on port 8673, exhuasted all endpoints <errno -1><BR>
MPIDU_Sock_post_connect<1244>....................: gethostbyname failed, The requested name is valid, but no data of the requested type was found. <errno 11004><BR>
<BR>
Job aborted:<BR>
rank: node: exit code[: error message]<BR>
0: computer2: 1: fatal error in MPI_Init: other MPI error, error stack:<BR>
MPIR_Init_thread<294>............................:Initialization failed<BR>
MPID_Init<242>...................................:Spawned process group was unable to connect back to parent on port <tag=0 port=8673 description=computer1.usask.ca ifname=192.168.5.100><BR>
MPID_Comm_connect<187>...........................:<BR>
MPIDI_Comm_connect<369>..........................:<BR>
MPIDI_Create_inter_root_communicator_connect<133>:<BR>
MPIDI_CH3I_Connect_to_root_sock<289>.............:<BR>
MPIDU_Sock_post_connect<1228>....................: unable to connect to computer1.usask.ca on port 8673, exhuasted all endpoints <errno -1><BR>
MPIDU_Sock_post_connect<1244>....................: gethostbyname failed, The requested name is valid, but no data of the requested type was found. <errno 11004><BR>
<BR>
(Here I waited several minutes before pressing ctrl+c)<BR>
<BR>
mpiexec aborting job ...<BR>
<BR>
(Here I waited several more minutes before pressing ctrl+c and returning to the command prompt)<BR>
<BR>
So the program is able to spawn a process on the worker, but then when the worker is unable to contact the manager node MPI_Init fails. The error stack shows that it has the correct IP address and tries to use port 8673. At first I thought the problem might be that it was appending the domain name (usask.ca) from their old network, but the IP address is still correct so now I'm not sure.<BR>
<BR>
If I change the code so Computer2 is the manager and Computer1 is the worker the results are the same. But just like cpi.exe if I confine both the worker and the manager to the local host it performs perfectly. I assume this is an issue with either the way I've set up my network, or the way I've set up MPICH2 on the computers. Does anyone know what would cause an error like this?<BR>
<BR>
</FONT>
</P>
</BODY>
</HTML>