[mpich-discuss] child node can't contact parent?

Jayesh Krishna jayesh at mcs.anl.gov
Thu Aug 28 15:58:13 CDT 2008


 Hi,
  Great! Let us know if you need any further help.

Regards,
Jayesh

-----Original Message-----
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Bathgate
Sent: Thursday, August 28, 2008 3:39 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] child node can't contact parent?

Hi,

Thanks for the help! Everything I've tried is working without a hitch now.
To solve the DNS problem I simply rejoined the computers with the rest of
our work network. Earlier while it was on the network I was having
permission issues but to fix that I registered the local Administrator
account with mpiexec, but still log on and make mpi calls through my
network account (which I didn't realize you could do). I'm still a little
wary about leaving holes in the firewall while its on the work network
(just a set of open ports using the MPICH_PORT_RANGE variable), but its
working and that's more important to me right now.

Thanks again,
Tony

Jayesh Krishna wrote:
>
> Hi,
> The description of error code 11004 in MS docs is,
>
> ================================================================
> The requested name is valid and was found in the database, but it does 
> not have the correct associated data being resolved for. The usual 
> example for this is a host name-to-address translation attempt (using 
> gethostbyname or WSAAsyncGetHostByName) which uses the DNS (Domain 
> Name Server). An MX record is returned but no A record-indicating the 
> host itself exists, but is not directly reachable.
> ================================================================
>
> Looks like the DNS server for your machine does not have information 
> about the computers/hosts in your setup.
>
> Regards,
> Jayesh
>
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Bathgate
> Sent: Wednesday, August 27, 2008 6:28 PM
> To: Jayesh Krishna
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] child node can't contact parent?
>
> Hi,
>
> Thanks for the reply.
> I had actually already pinged Computer1 from Computer2 and vice versa.
> The ping works fine. I tried using hostname with mpiexec like you 
> suggested and it works fine too, from both computers. Now I'm baffled.
> I also tried your Hello World program and it crashed. Here's the error 
> messages I got:
>
> C:\helloworld\Debug\> mpiexec -hosts 2 192.168.5.100 192.168.5.200 
> helloworld.exe Fatal error in MPI_Finalize: Other MPI error, error
stack:
> MPI_Finalize<255>............: MPI_Finalize failed
> MPI_Finalize<154>............:
> MPID_Finalize<94>............:
> MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002>
> MPIR_Barrier<77>.............:
> MPIC_Sendrecv<120>...........:
> MPID_Isend<103>..............: failure occured while attempting to 
> send an eager message
> MPIDI_CH3_iSend<172>.........:
> MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 0 using 
> business card <port=8673 description=computer1.usask.ca 
> ifname=192.168.5.100>
> MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested 
> name is valid, but no data of the requested type was found. <errno
> 11004> job aborted:
> rank: node: exit code[: error message]
> 0: 192.168.5.100: 1
> 1: 192.168.5.200: 1: Fatal error in MPI_Finalize: Other MPI error, 
> error
> stack:
> MPI_Finalize<255>............: MPI_Finalize failed
> MPI_Finalize<154>............:
> MPID_Finalize<94>............:
> MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002>
> MPIR_Barrier<77>.............:
> MPIC_Sendrecv<120>...........:
> MPID_Isend<103>..............: failure occured while attempting to 
> send an eager message
> MPIDI_CH3_iSend<172>.........:
> MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 1 using 
> business card <port=8673 description=computer1.usask.ca 
> ifname=192.168.5.100>
> MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested 
> name is valid, but no data of the requested type was found. <errno 
> 11004>
>
> So it seems to me that it can execute programs remotely but not when 
> the program relies on the MPICH2 c implementation libraries. Does that 
> make sense, and how could it be remedied?
>
> Thanks again,
> Tony
>
> Jayesh Krishna wrote:
> >
> > Hi,
> > Looks like something is wrong with the setup of your machines.
> >
> > # Can you ping from one machine to the other ?
> >
> > - From Computer1 try pinging Computer2
> > - From Computer2 try pinging Computer1
> >
> > # Start debugging by running a non-MPI program (like hostname)
> >
> > mpiexec -hosts 2 IPAddress_Of_Computer1 IPAddress_Of_Computer2 
> > hostname
> >
> > # Then debug with a simple hello world program (don't debug your 
> > setup with a complex program)
> >
> > ----------------- hello world --------------- #include <stdio.h> 
> > #include "mpi.h"
> >
> > int main(int argc, char *argv[]){
> > int rank=-1;
> > MPI_Init(&argc, &argv);
> > MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("[%d] Hello world\n", 
> > rank); MPI_Finalize(); }
> > ----------------- hello world ---------------
> >
> > Let us know the results.
> >
> > Regards,
> > Jayesh
> >
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov 
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Bathgate
> > Sent: Wednesday, August 27, 2008 3:31 PM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [mpich-discuss] child node can't contact parent?
> >
> > Hi All,
> >
> > I apologize in advance for the length of this email; I'm new to the 
> > world of MPI and I want to include everything that might be relevant.
> > I have the Win32 IA32 binary of MPICH2 installed on two machines.
> > They are running Windows XP Pro. x64 Edition with Service Pack 2 and 
> > they each have an Intel Xeon processor. To simplify things I took 
> > them off our network, gave them their own router, and dropped their 
> > Windows firewalls. I have assigned the machines static IP's with the 
> > router (192.168.5.100 for Computer1, and 192.168.5.200 for Computer2).
> > I've registered the local Administrator accounts (which have 
> > identical passwords and credentials) with mpiexec on each machine. 
> > And everything below was attempted from the Administrator account.
> >
> > I've tried running the cpi.exe example but it just hangs:
> >
> > C:\Program Files (x86)\MPICH2\examples> mpiexec -hosts 2 Computer1
> > Computer2 .\cpi.exe
> > Enter the number of intervals: (0 quits) 1
> >
> > (here I waited about 20 minutes, then Ctrl+C)
> >
> > mpiexec aborting job
> >
> > job aborted:
> > rank: node: exit code[: error message]
> > 0: Computer1: 123: mpiexec aborting job
> > 1: Computer2: 123
> >
> > It runs perfectly fine if I have it execute it with the -localonly
tag.
> > To explore this issue I wrote a simple program that uses 
> > MPI_Comm_spawn to spawn a worker program. The master then sends the 
> > worker a message and they both exit. The manager node runs the code 
> > that follows here:
> > #include <mpi.h>
> > #include <stdio.h>
> >
> > int main (int argc, char* argv[])
> > {
> > int someVariable = 10;
> >
> > MPI_Info info;
> > MPI_Comm workercomm;
> > MPI_Request request;
> > MPI_Status status;
> >
> > MPI_Init( &argc, &argv );
> >
> > fprintf( stdout, "In Master - someVariable = %i \n", someVariable ); 
> > fflush( stdout );
> >
> > MPI_Info_create( &info );
> > MPI_Info_set( info, "host", "Computer2" ); MPI_Comm_spawn( 
> > "C:\\MPIworker\\Debug\\MPIworker.exe",
> > MPI_ARGV_NULL,
> > 1, info, 0, MPI_COMM_SELF, &workercomm, MPI_ERRCODES_IGNORE );
> >
> > MPI_Info_free( &info );
> >
> > MPI_Isend( &someVariable, 1, MPI_INT, 0, 0, workercomm,
> > &(request) );
> > MPI_Waitall( 1, request, status );
> >
> > fprintf(stdout,"Done sending\n");
> > fflush(stdout);
> >
> > MPI_Finalize();
> > return 0;
> > }
> > The worker code follows here:
> > #include <mpi.h>
> > #include <stdio.h>
> >
> > int main (int argc, char* argv[])
> > {
> > int someVariable = 0;
> > MPI_Comm parentcomm;
> > MPI_Request request;
> > MPI_Status status;
> >
> > MPI_Init( &argc, &argv );
> >
> > fprintf(stdout, "In Worker: Before receive - someVariable = %i 
> > \n",someVariable); fflush( stdout );
> >
> > MPI_Comm_get_parent( &parentcomm );
> > MPI_Irecv( &someVariable, 1, MPI_INT, 0, 0, parentcomm, &request );
> >
> > MPI_Wait( &request, &status );
> > fprintf( stdout, "After receive - someVariable = %i\n", someVariable 
> > ); fflush( stdout );
> >
> > MPI_Finalize();
> > return 0;
> > }
> >
> > When I run this code I get the following results:
> > C:\MPImanager\Debug\> mpiexec -n 1 MPImanager.exe In Master - 
> > someVariable = 10 Fatal error in MPI_Init: Other MPI error, error 
> > stack:
> > MPIR_Init_thread<294>............................:Initialization
> > failed
> > MPID_Init<242>...................................:Spawned process 
> > group was unable to connect back to parent on port <tag=0 port=8673 
> > description=computer1.usask.ca ifname=192.168.5.100>
> > MPID_Comm_connect<187>...........................:
> > MPIDI_Comm_connect<369>..........................:
> > MPIDI_Create_inter_root_communicator_connect<133>:
> > MPIDI_CH3I_Connect_to_root_sock<289>.............:
> > MPIDU_Sock_post_connect<1228>....................: unable to connect 
> > to computer1.usask.ca on port 8673, exhuasted all endpoints <errno 
> > -1>
> > MPIDU_Sock_post_connect<1244>....................: gethostbyname 
> > failed, The requested name is valid, but no data of the requested 
> > type was found. <errno 11004>
> >
> > Job aborted:
> > rank: node: exit code[: error message]
> > 0: computer2: 1: fatal error in MPI_Init: other MPI error, error
> > stack:
> > MPIR_Init_thread<294>............................:Initialization
> > failed
> > MPID_Init<242>...................................:Spawned process 
> > group was unable to connect back to parent on port <tag=0 port=8673 
> > description=computer1.usask.ca ifname=192.168.5.100>
> > MPID_Comm_connect<187>...........................:
> > MPIDI_Comm_connect<369>..........................:
> > MPIDI_Create_inter_root_communicator_connect<133>:
> > MPIDI_CH3I_Connect_to_root_sock<289>.............:
> > MPIDU_Sock_post_connect<1228>....................: unable to connect 
> > to computer1.usask.ca on port 8673, exhuasted all endpoints <errno 
> > -1>
> > MPIDU_Sock_post_connect<1244>....................: gethostbyname 
> > failed, The requested name is valid, but no data of the requested 
> > type was found. <errno 11004>
> >
> > (Here I waited several minutes before pressing ctrl+c)
> >
> > mpiexec aborting job ...
> >
> > (Here I waited several more minutes before pressing ctrl+c and 
> > returning to the command prompt)
> >
> > So the program is able to spawn a process on the worker, but then 
> > when the worker is unable to contact the manager node MPI_Init 
> > fails. The error stack shows that it has the correct IP address and 
> > tries to use port 8673. At first I thought the problem might be that 
> > it was appending the domain name (usask.ca) from their old network, 
> > but the IP address is still correct so now I'm not sure.
> >
> > If I change the code so Computer2 is the manager and Computer1 is 
> > the worker the results are the same. But just like cpi.exe if I 
> > confine both the worker and the manager to the local host it 
> > performs perfectly. I assume this is an issue with either the way 
> > I've set up my network, or the way I've set up MPICH2 on the 
> > computers. Does anyone know what would cause an error like this?
> >
>

--
Tony Bathgate
BEng Engineering Physics
BSc Computer Science

Master's Candidate
University of Saskatchewan
116 Science Place
Saskatoon SK,
S7N 5E2
966-6452

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080828/68ff3a7c/attachment.htm>


More information about the mpich-discuss mailing list