<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0">
<TITLE>RE: [mpich-discuss] child node can't contact parent?</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2> Hi,<BR>
Great! Let us know if you need any further help.<BR>
<BR>
Regards,<BR>
Jayesh<BR>
<BR>
-----Original Message-----<BR>
From: owner-mpich-discuss@mcs.anl.gov [<A HREF="mailto:owner-mpich-discuss@mcs.anl.gov">mailto:owner-mpich-discuss@mcs.anl.gov</A>] On Behalf Of Tony Bathgate<BR>
Sent: Thursday, August 28, 2008 3:39 PM<BR>
To: mpich-discuss@mcs.anl.gov<BR>
Subject: Re: [mpich-discuss] child node can't contact parent?<BR>
<BR>
Hi,<BR>
<BR>
Thanks for the help! Everything I've tried is working without a hitch now. To solve the DNS problem I simply rejoined the computers with the rest of our work network. Earlier while it was on the network I was having permission issues but to fix that I registered the local Administrator account with mpiexec, but still log on and make mpi calls through my network account (which I didn't realize you could do). I'm still a little wary about leaving holes in the firewall while its on the work network (just a set of open ports using the MPICH_PORT_RANGE variable), but its working and that's more important to me right now.<BR>
<BR>
Thanks again,<BR>
Tony<BR>
<BR>
Jayesh Krishna wrote:<BR>
><BR>
> Hi,<BR>
> The description of error code 11004 in MS docs is,<BR>
><BR>
> ================================================================<BR>
> The requested name is valid and was found in the database, but it does<BR>
> not have the correct associated data being resolved for. The usual<BR>
> example for this is a host name-to-address translation attempt (using<BR>
> gethostbyname or WSAAsyncGetHostByName) which uses the DNS (Domain<BR>
> Name Server). An MX record is returned but no A record—indicating the<BR>
> host itself exists, but is not directly reachable.<BR>
> ================================================================<BR>
><BR>
> Looks like the DNS server for your machine does not have information<BR>
> about the computers/hosts in your setup.<BR>
><BR>
> Regards,<BR>
> Jayesh<BR>
><BR>
> -----Original Message-----<BR>
> From: owner-mpich-discuss@mcs.anl.gov<BR>
> [<A HREF="mailto:owner-mpich-discuss@mcs.anl.gov">mailto:owner-mpich-discuss@mcs.anl.gov</A>] On Behalf Of Tony Bathgate<BR>
> Sent: Wednesday, August 27, 2008 6:28 PM<BR>
> To: Jayesh Krishna<BR>
> Cc: mpich-discuss@mcs.anl.gov<BR>
> Subject: Re: [mpich-discuss] child node can't contact parent?<BR>
><BR>
> Hi,<BR>
><BR>
> Thanks for the reply.<BR>
> I had actually already pinged Computer1 from Computer2 and vice versa.<BR>
> The ping works fine. I tried using hostname with mpiexec like you<BR>
> suggested and it works fine too, from both computers. Now I'm baffled.<BR>
> I also tried your Hello World program and it crashed. Here's the error<BR>
> messages I got:<BR>
><BR>
> C:\helloworld\Debug\> mpiexec -hosts 2 192.168.5.100 192.168.5.200<BR>
> helloworld.exe Fatal error in MPI_Finalize: Other MPI error, error stack:<BR>
> MPI_Finalize<255>............: MPI_Finalize failed<BR>
> MPI_Finalize<154>............:<BR>
> MPID_Finalize<94>............:<BR>
> MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002><BR>
> MPIR_Barrier<77>.............:<BR>
> MPIC_Sendrecv<120>...........:<BR>
> MPID_Isend<103>..............: failure occured while attempting to<BR>
> send an eager message<BR>
> MPIDI_CH3_iSend<172>.........:<BR>
> MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 0 using<BR>
> business card <port=8673 description=computer1.usask.ca<BR>
> ifname=192.168.5.100><BR>
> MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested<BR>
> name is valid, but no data of the requested type was found. <errno<BR>
> 11004> job aborted:<BR>
> rank: node: exit code[: error message]<BR>
> 0: 192.168.5.100: 1<BR>
> 1: 192.168.5.200: 1: Fatal error in MPI_Finalize: Other MPI error,<BR>
> error<BR>
> stack:<BR>
> MPI_Finalize<255>............: MPI_Finalize failed<BR>
> MPI_Finalize<154>............:<BR>
> MPID_Finalize<94>............:<BR>
> MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002><BR>
> MPIR_Barrier<77>.............:<BR>
> MPIC_Sendrecv<120>...........:<BR>
> MPID_Isend<103>..............: failure occured while attempting to<BR>
> send an eager message<BR>
> MPIDI_CH3_iSend<172>.........:<BR>
> MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 1 using<BR>
> business card <port=8673 description=computer1.usask.ca<BR>
> ifname=192.168.5.100><BR>
> MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested<BR>
> name is valid, but no data of the requested type was found. <errno<BR>
> 11004><BR>
><BR>
> So it seems to me that it can execute programs remotely but not when<BR>
> the program relies on the MPICH2 c implementation libraries. Does that<BR>
> make sense, and how could it be remedied?<BR>
><BR>
> Thanks again,<BR>
> Tony<BR>
><BR>
> Jayesh Krishna wrote:<BR>
> ><BR>
> > Hi,<BR>
> > Looks like something is wrong with the setup of your machines.<BR>
> ><BR>
> > # Can you ping from one machine to the other ?<BR>
> ><BR>
> > - From Computer1 try pinging Computer2<BR>
> > - From Computer2 try pinging Computer1<BR>
> ><BR>
> > # Start debugging by running a non-MPI program (like hostname)<BR>
> ><BR>
> > mpiexec -hosts 2 IPAddress_Of_Computer1 IPAddress_Of_Computer2<BR>
> > hostname<BR>
> ><BR>
> > # Then debug with a simple hello world program (don't debug your<BR>
> > setup with a complex program)<BR>
> ><BR>
> > ----------------- hello world --------------- #include <stdio.h><BR>
> > #include "mpi.h"<BR>
> ><BR>
> > int main(int argc, char *argv[]){<BR>
> > int rank=-1;<BR>
> > MPI_Init(&argc, &argv);<BR>
> > MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("[%d] Hello world\n",<BR>
> > rank); MPI_Finalize(); }<BR>
> > ----------------- hello world ---------------<BR>
> ><BR>
> > Let us know the results.<BR>
> ><BR>
> > Regards,<BR>
> > Jayesh<BR>
> ><BR>
> > -----Original Message-----<BR>
> > From: owner-mpich-discuss@mcs.anl.gov<BR>
> > [<A HREF="mailto:owner-mpich-discuss@mcs.anl.gov">mailto:owner-mpich-discuss@mcs.anl.gov</A>] On Behalf Of Tony Bathgate<BR>
> > Sent: Wednesday, August 27, 2008 3:31 PM<BR>
> > To: mpich-discuss@mcs.anl.gov<BR>
> > Subject: [mpich-discuss] child node can't contact parent?<BR>
> ><BR>
> > Hi All,<BR>
> ><BR>
> > I apologize in advance for the length of this email; I'm new to the<BR>
> > world of MPI and I want to include everything that might be relevant.<BR>
> > I have the Win32 IA32 binary of MPICH2 installed on two machines.<BR>
> > They are running Windows XP Pro. x64 Edition with Service Pack 2 and<BR>
> > they each have an Intel Xeon processor. To simplify things I took<BR>
> > them off our network, gave them their own router, and dropped their<BR>
> > Windows firewalls. I have assigned the machines static IP's with the<BR>
> > router (192.168.5.100 for Computer1, and 192.168.5.200 for Computer2).<BR>
> > I've registered the local Administrator accounts (which have<BR>
> > identical passwords and credentials) with mpiexec on each machine.<BR>
> > And everything below was attempted from the Administrator account.<BR>
> ><BR>
> > I've tried running the cpi.exe example but it just hangs:<BR>
> ><BR>
> > C:\Program Files (x86)\MPICH2\examples> mpiexec -hosts 2 Computer1<BR>
> > Computer2 .\cpi.exe<BR>
> > Enter the number of intervals: (0 quits) 1<BR>
> ><BR>
> > (here I waited about 20 minutes, then Ctrl+C)<BR>
> ><BR>
> > mpiexec aborting job<BR>
> ><BR>
> > job aborted:<BR>
> > rank: node: exit code[: error message]<BR>
> > 0: Computer1: 123: mpiexec aborting job<BR>
> > 1: Computer2: 123<BR>
> ><BR>
> > It runs perfectly fine if I have it execute it with the -localonly tag.<BR>
> > To explore this issue I wrote a simple program that uses<BR>
> > MPI_Comm_spawn to spawn a worker program. The master then sends the<BR>
> > worker a message and they both exit. The manager node runs the code<BR>
> > that follows here:<BR>
> > #include <mpi.h><BR>
> > #include <stdio.h><BR>
> ><BR>
> > int main (int argc, char* argv[])<BR>
> > {<BR>
> > int someVariable = 10;<BR>
> ><BR>
> > MPI_Info info;<BR>
> > MPI_Comm workercomm;<BR>
> > MPI_Request request;<BR>
> > MPI_Status status;<BR>
> ><BR>
> > MPI_Init( &argc, &argv );<BR>
> ><BR>
> > fprintf( stdout, "In Master - someVariable = %i \n", someVariable );<BR>
> > fflush( stdout );<BR>
> ><BR>
> > MPI_Info_create( &info );<BR>
> > MPI_Info_set( info, "host", "Computer2" ); MPI_Comm_spawn(<BR>
> > "C:\\MPIworker\\Debug\\MPIworker.exe",<BR>
> > MPI_ARGV_NULL,<BR>
> > 1, info, 0, MPI_COMM_SELF, &workercomm, MPI_ERRCODES_IGNORE );<BR>
> ><BR>
> > MPI_Info_free( &info );<BR>
> ><BR>
> > MPI_Isend( &someVariable, 1, MPI_INT, 0, 0, workercomm,<BR>
> > &(request) );<BR>
> > MPI_Waitall( 1, request, status );<BR>
> ><BR>
> > fprintf(stdout,"Done sending\n");<BR>
> > fflush(stdout);<BR>
> ><BR>
> > MPI_Finalize();<BR>
> > return 0;<BR>
> > }<BR>
> > The worker code follows here:<BR>
> > #include <mpi.h><BR>
> > #include <stdio.h><BR>
> ><BR>
> > int main (int argc, char* argv[])<BR>
> > {<BR>
> > int someVariable = 0;<BR>
> > MPI_Comm parentcomm;<BR>
> > MPI_Request request;<BR>
> > MPI_Status status;<BR>
> ><BR>
> > MPI_Init( &argc, &argv );<BR>
> ><BR>
> > fprintf(stdout, "In Worker: Before receive - someVariable = %i<BR>
> > \n",someVariable); fflush( stdout );<BR>
> ><BR>
> > MPI_Comm_get_parent( &parentcomm );<BR>
> > MPI_Irecv( &someVariable, 1, MPI_INT, 0, 0, parentcomm, &request );<BR>
> ><BR>
> > MPI_Wait( &request, &status );<BR>
> > fprintf( stdout, "After receive - someVariable = %i\n", someVariable<BR>
> > ); fflush( stdout );<BR>
> ><BR>
> > MPI_Finalize();<BR>
> > return 0;<BR>
> > }<BR>
> ><BR>
> > When I run this code I get the following results:<BR>
> > C:\MPImanager\Debug\> mpiexec -n 1 MPImanager.exe In Master -<BR>
> > someVariable = 10 Fatal error in MPI_Init: Other MPI error, error<BR>
> > stack:<BR>
> > MPIR_Init_thread<294>............................:Initialization<BR>
> > failed<BR>
> > MPID_Init<242>...................................:Spawned process<BR>
> > group was unable to connect back to parent on port <tag=0 port=8673<BR>
> > description=computer1.usask.ca ifname=192.168.5.100><BR>
> > MPID_Comm_connect<187>...........................:<BR>
> > MPIDI_Comm_connect<369>..........................:<BR>
> > MPIDI_Create_inter_root_communicator_connect<133>:<BR>
> > MPIDI_CH3I_Connect_to_root_sock<289>.............:<BR>
> > MPIDU_Sock_post_connect<1228>....................: unable to connect<BR>
> > to computer1.usask.ca on port 8673, exhuasted all endpoints <errno<BR>
> > -1><BR>
> > MPIDU_Sock_post_connect<1244>....................: gethostbyname<BR>
> > failed, The requested name is valid, but no data of the requested<BR>
> > type was found. <errno 11004><BR>
> ><BR>
> > Job aborted:<BR>
> > rank: node: exit code[: error message]<BR>
> > 0: computer2: 1: fatal error in MPI_Init: other MPI error, error<BR>
> > stack:<BR>
> > MPIR_Init_thread<294>............................:Initialization<BR>
> > failed<BR>
> > MPID_Init<242>...................................:Spawned process<BR>
> > group was unable to connect back to parent on port <tag=0 port=8673<BR>
> > description=computer1.usask.ca ifname=192.168.5.100><BR>
> > MPID_Comm_connect<187>...........................:<BR>
> > MPIDI_Comm_connect<369>..........................:<BR>
> > MPIDI_Create_inter_root_communicator_connect<133>:<BR>
> > MPIDI_CH3I_Connect_to_root_sock<289>.............:<BR>
> > MPIDU_Sock_post_connect<1228>....................: unable to connect<BR>
> > to computer1.usask.ca on port 8673, exhuasted all endpoints <errno<BR>
> > -1><BR>
> > MPIDU_Sock_post_connect<1244>....................: gethostbyname<BR>
> > failed, The requested name is valid, but no data of the requested<BR>
> > type was found. <errno 11004><BR>
> ><BR>
> > (Here I waited several minutes before pressing ctrl+c)<BR>
> ><BR>
> > mpiexec aborting job ...<BR>
> ><BR>
> > (Here I waited several more minutes before pressing ctrl+c and<BR>
> > returning to the command prompt)<BR>
> ><BR>
> > So the program is able to spawn a process on the worker, but then<BR>
> > when the worker is unable to contact the manager node MPI_Init<BR>
> > fails. The error stack shows that it has the correct IP address and<BR>
> > tries to use port 8673. At first I thought the problem might be that<BR>
> > it was appending the domain name (usask.ca) from their old network,<BR>
> > but the IP address is still correct so now I'm not sure.<BR>
> ><BR>
> > If I change the code so Computer2 is the manager and Computer1 is<BR>
> > the worker the results are the same. But just like cpi.exe if I<BR>
> > confine both the worker and the manager to the local host it<BR>
> > performs perfectly. I assume this is an issue with either the way<BR>
> > I've set up my network, or the way I've set up MPICH2 on the<BR>
> > computers. Does anyone know what would cause an error like this?<BR>
> ><BR>
><BR>
<BR>
--<BR>
Tony Bathgate<BR>
BEng Engineering Physics<BR>
BSc Computer Science<BR>
<BR>
Master's Candidate<BR>
University of Saskatchewan<BR>
116 Science Place<BR>
Saskatoon SK,<BR>
S7N 5E2<BR>
966-6452<BR>
<BR>
</FONT>
</P>
</BODY>
</HTML>