[mpich-discuss] child node can't contact parent?
Tony Bathgate
tony.bathgate at usask.ca
Wed Aug 27 15:31:01 CDT 2008
Hi All,
I apologize in advance for the length of this email; I'm new to the
world of MPI and I want to include everything that might be relevant. I
have the Win32 IA32 binary of MPICH2 installed on two machines. They
are running Windows XP Pro. x64 Edition with Service Pack 2 and they
each have an Intel Xeon processor. To simplify things I took them off
our network, gave them their own router, and dropped their Windows
firewalls. I have assigned the machines static IP's with the router
(192.168.5.100 for Computer1, and 192.168.5.200 for Computer2). I've
registered the local Administrator accounts (which have identical
passwords and credentials) with mpiexec on each machine. And everything
below was attempted from the Administrator account.
I've tried running the cpi.exe example but it just hangs:
C:\Program Files (x86)\MPICH2\examples> mpiexec -hosts 2 Computer1
Computer2 .\cpi.exe
Enter the number of intervals: (0 quits) 1
(here I waited about 20 minutes, then Ctrl+C)
mpiexec aborting job
job aborted:
rank: node: exit code[: error message]
0: Computer1: 123: mpiexec aborting job
1: Computer2: 123
It runs perfectly fine if I have it execute it with the -localonly tag.
To explore this issue I wrote a simple program that uses MPI_Comm_spawn
to spawn a worker program. The master then sends the worker a message
and they both exit. The manager node runs the code that follows here:
#include <mpi.h>
#include <stdio.h>
int main (int argc, char* argv[])
{
int someVariable = 10;
MPI_Info info;
MPI_Comm workercomm;
MPI_Request request;
MPI_Status status;
MPI_Init( &argc, &argv );
fprintf( stdout, "In Master - someVariable = %i \n",
someVariable );
fflush( stdout );
MPI_Info_create( &info );
MPI_Info_set( info, "host", "Computer2" );
MPI_Comm_spawn( "C:\\MPIworker\\Debug\\MPIworker.exe",
MPI_ARGV_NULL,
1, info, 0, MPI_COMM_SELF, &workercomm, MPI_ERRCODES_IGNORE );
MPI_Info_free( &info );
MPI_Isend( &someVariable, 1, MPI_INT, 0, 0, workercomm,
&(request) );
MPI_Waitall( 1, request, status );
fprintf(stdout,"Done sending\n");
fflush(stdout);
MPI_Finalize();
return 0;
}
The worker code follows here:
#include <mpi.h>
#include <stdio.h>
int main (int argc, char* argv[])
{
int someVariable = 0;
MPI_Comm parentcomm;
MPI_Request request;
MPI_Status status;
MPI_Init( &argc, &argv );
fprintf(stdout, "In Worker: Before receive - someVariable = %i
\n",someVariable);
fflush( stdout );
MPI_Comm_get_parent( &parentcomm );
MPI_Irecv( &someVariable, 1, MPI_INT, 0, 0, parentcomm, &request );
MPI_Wait( &request, &status );
fprintf( stdout, "After receive - someVariable = %i\n",
someVariable );
fflush( stdout );
MPI_Finalize();
return 0;
}
When I run this code I get the following results:
C:\MPImanager\Debug\> mpiexec -n 1 MPImanager.exe
In Master - someVariable = 10
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread<294>............................:Initialization failed
MPID_Init<242>...................................:Spawned process
group was unable to connect back to parent on port <tag=0 port=8673
description=computer1.usask.ca ifname=192.168.5.100>
MPID_Comm_connect<187>...........................:
MPIDI_Comm_connect<369>..........................:
MPIDI_Create_inter_root_communicator_connect<133>:
MPIDI_CH3I_Connect_to_root_sock<289>.............:
MPIDU_Sock_post_connect<1228>....................: unable to connect
to computer1.usask.ca on port 8673, exhuasted all endpoints <errno -1>
MPIDU_Sock_post_connect<1244>....................: gethostbyname
failed, The requested name is valid, but no data of the requested type
was found. <errno 11004>
Job aborted:
rank: node: exit code[: error message]
0: computer2: 1: fatal error in MPI_Init: other MPI error, error stack:
MPIR_Init_thread<294>............................:Initialization failed
MPID_Init<242>...................................:Spawned process
group was unable to connect back to parent on port <tag=0 port=8673
description=computer1.usask.ca ifname=192.168.5.100>
MPID_Comm_connect<187>...........................:
MPIDI_Comm_connect<369>..........................:
MPIDI_Create_inter_root_communicator_connect<133>:
MPIDI_CH3I_Connect_to_root_sock<289>.............:
MPIDU_Sock_post_connect<1228>....................: unable to connect
to computer1.usask.ca on port 8673, exhuasted all endpoints <errno -1>
MPIDU_Sock_post_connect<1244>....................: gethostbyname
failed, The requested name is valid, but no data of the requested type
was found. <errno 11004>
(Here I waited several minutes before pressing ctrl+c)
mpiexec aborting job ...
(Here I waited several more minutes before pressing ctrl+c and returning
to the command prompt)
So the program is able to spawn a process on the worker, but then when
the worker is unable to contact the manager node MPI_Init fails. The
error stack shows that it has the correct IP address and tries to use
port 8673. At first I thought the problem might be that it was
appending the domain name (usask.ca) from their old network, but the IP
address is still correct so now I'm not sure.
If I change the code so Computer2 is the manager and Computer1 is the
worker the results are the same. But just like cpi.exe if I confine
both the worker and the manager to the local host it performs
perfectly. I assume this is an issue with either the way I've set up my
network, or the way I've set up MPICH2 on the computers. Does anyone
know what would cause an error like this?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tony_bathgate.vcf
Type: text/x-vcard
Size: 264 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080827/cb355388/attachment.vcf>
More information about the mpich-discuss
mailing list