<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0">
<TITLE>RE: [mpich-discuss] child node can't contact parent?</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2>Hi,<BR>
The description of error code 11004 in MS docs is,<BR>
<BR>
================================================================<BR>
The requested name is valid and was found in the database, but it does not have the correct associated data being resolved for. The usual example for this is a host name-to-address translation attempt (using gethostbyname or WSAAsyncGetHostByName) which uses the DNS (Domain Name Server). An MX record is returned but no A record—indicating the host itself exists, but is not directly reachable.<BR>
================================================================<BR>
<BR>
Looks like the DNS server for your machine does not have information about the computers/hosts in your setup.<BR>
<BR>
Regards,<BR>
Jayesh<BR>
<BR>
-----Original Message-----<BR>
From: owner-mpich-discuss@mcs.anl.gov [<A HREF="mailto:owner-mpich-discuss@mcs.anl.gov">mailto:owner-mpich-discuss@mcs.anl.gov</A>] On Behalf Of Tony Bathgate<BR>
Sent: Wednesday, August 27, 2008 6:28 PM<BR>
To: Jayesh Krishna<BR>
Cc: mpich-discuss@mcs.anl.gov<BR>
Subject: Re: [mpich-discuss] child node can't contact parent?<BR>
<BR>
Hi,<BR>
<BR>
Thanks for the reply.<BR>
I had actually already pinged Computer1 from Computer2 and vice versa. <BR>
The ping works fine. I tried using hostname with mpiexec like you suggested and it works fine too, from both computers. Now I'm baffled. <BR>
I also tried your Hello World program and it crashed. Here's the error messages I got:<BR>
<BR>
C:\helloworld\Debug\> mpiexec -hosts 2 192.168.5.100 192.168.5.200 helloworld.exe Fatal error in MPI_Finalize: Other MPI error, error stack:<BR>
MPI_Finalize<255>............: MPI_Finalize failed<BR>
MPI_Finalize<154>............:<BR>
MPID_Finalize<94>............:<BR>
MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002><BR>
MPIR_Barrier<77>.............:<BR>
MPIC_Sendrecv<120>...........:<BR>
MPID_Isend<103>..............: failure occured while attempting to send an eager message<BR>
MPIDI_CH3_iSend<172>.........:<BR>
MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 0 using business card <port=8673 description=computer1.usask.ca ifname=192.168.5.100><BR>
MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested name is valid, but no data of the requested type was found. <errno 11004> job aborted:<BR>
rank: node: exit code[: error message]<BR>
0: 192.168.5.100: 1<BR>
1: 192.168.5.200: 1: Fatal error in MPI_Finalize: Other MPI error, error<BR>
stack:<BR>
MPI_Finalize<255>............: MPI_Finalize failed<BR>
MPI_Finalize<154>............:<BR>
MPID_Finalize<94>............:<BR>
MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002><BR>
MPIR_Barrier<77>.............:<BR>
MPIC_Sendrecv<120>...........:<BR>
MPID_Isend<103>..............: failure occured while attempting to send an eager message<BR>
MPIDI_CH3_iSend<172>.........:<BR>
MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 1 using business card <port=8673 description=computer1.usask.ca ifname=192.168.5.100><BR>
MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested name is valid, but no data of the requested type was found. <errno 11004><BR>
<BR>
So it seems to me that it can execute programs remotely but not when the program relies on the MPICH2 c implementation libraries. Does that make sense, and how could it be remedied?<BR>
<BR>
Thanks again,<BR>
Tony<BR>
<BR>
Jayesh Krishna wrote:<BR>
><BR>
> Hi,<BR>
> Looks like something is wrong with the setup of your machines.<BR>
><BR>
> # Can you ping from one machine to the other ?<BR>
><BR>
> - From Computer1 try pinging Computer2<BR>
> - From Computer2 try pinging Computer1<BR>
><BR>
> # Start debugging by running a non-MPI program (like hostname)<BR>
><BR>
> mpiexec -hosts 2 IPAddress_Of_Computer1 IPAddress_Of_Computer2<BR>
> hostname<BR>
><BR>
> # Then debug with a simple hello world program (don't debug your setup<BR>
> with a complex program)<BR>
><BR>
> ----------------- hello world --------------- #include <stdio.h><BR>
> #include "mpi.h"<BR>
><BR>
> int main(int argc, char *argv[]){<BR>
> int rank=-1;<BR>
> MPI_Init(&argc, &argv);<BR>
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);<BR>
> printf("[%d] Hello world\n", rank);<BR>
> MPI_Finalize();<BR>
> }<BR>
> ----------------- hello world ---------------<BR>
><BR>
> Let us know the results.<BR>
><BR>
> Regards,<BR>
> Jayesh<BR>
><BR>
> -----Original Message-----<BR>
> From: owner-mpich-discuss@mcs.anl.gov<BR>
> [<A HREF="mailto:owner-mpich-discuss@mcs.anl.gov">mailto:owner-mpich-discuss@mcs.anl.gov</A>] On Behalf Of Tony Bathgate<BR>
> Sent: Wednesday, August 27, 2008 3:31 PM<BR>
> To: mpich-discuss@mcs.anl.gov<BR>
> Subject: [mpich-discuss] child node can't contact parent?<BR>
><BR>
> Hi All,<BR>
><BR>
> I apologize in advance for the length of this email; I'm new to the<BR>
> world of MPI and I want to include everything that might be relevant.<BR>
> I have the Win32 IA32 binary of MPICH2 installed on two machines. <BR>
> They are running Windows XP Pro. x64 Edition with Service Pack 2 and<BR>
> they each have an Intel Xeon processor. To simplify things I took<BR>
> them off our network, gave them their own router, and dropped their<BR>
> Windows firewalls. I have assigned the machines static IP's with the<BR>
> router (192.168.5.100 for Computer1, and 192.168.5.200 for Computer2). <BR>
> I've registered the local Administrator accounts (which have identical<BR>
> passwords and credentials) with mpiexec on each machine. And<BR>
> everything below was attempted from the Administrator account.<BR>
><BR>
> I've tried running the cpi.exe example but it just hangs:<BR>
><BR>
> C:\Program Files (x86)\MPICH2\examples> mpiexec -hosts 2 Computer1<BR>
> Computer2 .\cpi.exe<BR>
> Enter the number of intervals: (0 quits) 1<BR>
><BR>
> (here I waited about 20 minutes, then Ctrl+C)<BR>
><BR>
> mpiexec aborting job<BR>
> <BR>
> job aborted:<BR>
> rank: node: exit code[: error message]<BR>
> 0: Computer1: 123: mpiexec aborting job<BR>
> 1: Computer2: 123<BR>
><BR>
> It runs perfectly fine if I have it execute it with the -localonly tag.<BR>
> To explore this issue I wrote a simple program that uses<BR>
> MPI_Comm_spawn to spawn a worker program. The master then sends the<BR>
> worker a message and they both exit. The manager node runs the code<BR>
> that follows here:<BR>
> #include <mpi.h><BR>
> #include <stdio.h><BR>
><BR>
> int main (int argc, char* argv[])<BR>
> {<BR>
> int someVariable = 10;<BR>
><BR>
> MPI_Info info;<BR>
> MPI_Comm workercomm;<BR>
> MPI_Request request;<BR>
> MPI_Status status;<BR>
> <BR>
> MPI_Init( &argc, &argv );<BR>
> <BR>
> fprintf( stdout, "In Master - someVariable = %i \n",<BR>
> someVariable );<BR>
> fflush( stdout );<BR>
><BR>
> MPI_Info_create( &info );<BR>
> MPI_Info_set( info, "host", "Computer2" );<BR>
> MPI_Comm_spawn( "C:\\MPIworker\\Debug\\MPIworker.exe",<BR>
> MPI_ARGV_NULL,<BR>
> 1, info, 0, MPI_COMM_SELF, &workercomm,<BR>
> MPI_ERRCODES_IGNORE );<BR>
> <BR>
> MPI_Info_free( &info );<BR>
><BR>
> MPI_Isend( &someVariable, 1, MPI_INT, 0, 0, workercomm,<BR>
> &(request) );<BR>
> MPI_Waitall( 1, request, status );<BR>
><BR>
> fprintf(stdout,"Done sending\n");<BR>
> fflush(stdout);<BR>
><BR>
> MPI_Finalize();<BR>
> return 0;<BR>
> }<BR>
> The worker code follows here:<BR>
> #include <mpi.h><BR>
> #include <stdio.h><BR>
><BR>
> int main (int argc, char* argv[])<BR>
> {<BR>
> int someVariable = 0;<BR>
> MPI_Comm parentcomm;<BR>
> MPI_Request request;<BR>
> MPI_Status status;<BR>
> <BR>
> MPI_Init( &argc, &argv );<BR>
> <BR>
> fprintf(stdout, "In Worker: Before receive - someVariable = %i<BR>
> \n",someVariable);<BR>
> fflush( stdout );<BR>
><BR>
> MPI_Comm_get_parent( &parentcomm );<BR>
> MPI_Irecv( &someVariable, 1, MPI_INT, 0, 0, parentcomm,<BR>
> &request );<BR>
><BR>
> MPI_Wait( &request, &status );<BR>
> fprintf( stdout, "After receive - someVariable = %i\n",<BR>
> someVariable );<BR>
> fflush( stdout );<BR>
><BR>
> MPI_Finalize();<BR>
> return 0;<BR>
> }<BR>
><BR>
> When I run this code I get the following results:<BR>
> C:\MPImanager\Debug\> mpiexec -n 1 MPImanager.exe<BR>
> In Master - someVariable = 10<BR>
> Fatal error in MPI_Init: Other MPI error, error stack:<BR>
> MPIR_Init_thread<294>............................:Initialization<BR>
> failed<BR>
> MPID_Init<242>...................................:Spawned process<BR>
> group was unable to connect back to parent on port <tag=0 port=8673<BR>
> description=computer1.usask.ca ifname=192.168.5.100><BR>
> MPID_Comm_connect<187>...........................:<BR>
> MPIDI_Comm_connect<369>..........................:<BR>
> MPIDI_Create_inter_root_communicator_connect<133>:<BR>
> MPIDI_CH3I_Connect_to_root_sock<289>.............:<BR>
> MPIDU_Sock_post_connect<1228>....................: unable to<BR>
> connect to computer1.usask.ca on port 8673, exhuasted all endpoints<BR>
> <errno -1><BR>
> MPIDU_Sock_post_connect<1244>....................: gethostbyname<BR>
> failed, The requested name is valid, but no data of the requested type<BR>
> was found. <errno 11004><BR>
><BR>
> Job aborted:<BR>
> rank: node: exit code[: error message]<BR>
> 0: computer2: 1: fatal error in MPI_Init: other MPI error, error<BR>
> stack:<BR>
> MPIR_Init_thread<294>............................:Initialization<BR>
> failed<BR>
> MPID_Init<242>...................................:Spawned process<BR>
> group was unable to connect back to parent on port <tag=0 port=8673<BR>
> description=computer1.usask.ca ifname=192.168.5.100><BR>
> MPID_Comm_connect<187>...........................:<BR>
> MPIDI_Comm_connect<369>..........................:<BR>
> MPIDI_Create_inter_root_communicator_connect<133>:<BR>
> MPIDI_CH3I_Connect_to_root_sock<289>.............:<BR>
> MPIDU_Sock_post_connect<1228>....................: unable to<BR>
> connect to computer1.usask.ca on port 8673, exhuasted all endpoints<BR>
> <errno -1><BR>
> MPIDU_Sock_post_connect<1244>....................: gethostbyname<BR>
> failed, The requested name is valid, but no data of the requested type<BR>
> was found. <errno 11004><BR>
><BR>
> (Here I waited several minutes before pressing ctrl+c)<BR>
><BR>
> mpiexec aborting job ...<BR>
><BR>
> (Here I waited several more minutes before pressing ctrl+c and<BR>
> returning to the command prompt)<BR>
><BR>
> So the program is able to spawn a process on the worker, but then when<BR>
> the worker is unable to contact the manager node MPI_Init fails. The<BR>
> error stack shows that it has the correct IP address and tries to use<BR>
> port 8673. At first I thought the problem might be that it was<BR>
> appending the domain name (usask.ca) from their old network, but the<BR>
> IP address is still correct so now I'm not sure.<BR>
><BR>
> If I change the code so Computer2 is the manager and Computer1 is the<BR>
> worker the results are the same. But just like cpi.exe if I confine<BR>
> both the worker and the manager to the local host it performs<BR>
> perfectly. I assume this is an issue with either the way I've set up<BR>
> my network, or the way I've set up MPICH2 on the computers. Does<BR>
> anyone know what would cause an error like this?<BR>
><BR>
</FONT>
</P>
</BODY>
</HTML>