[mpich-discuss] child node can't contact parent?

Tony Bathgate tony.bathgate at usask.ca
Thu Aug 28 15:39:21 CDT 2008


Hi,

Thanks for the help! Everything I've tried is working without a hitch 
now. To solve the DNS problem I simply rejoined the computers with the 
rest of our work network. Earlier while it was on the network I was 
having permission issues but to fix that I registered the local 
Administrator account with mpiexec, but still log on and make mpi calls 
through my network account (which I didn't realize you could do). I'm 
still a little wary about leaving holes in the firewall while its on the 
work network (just a set of open ports using the MPICH_PORT_RANGE 
variable), but its working and that's more important to me right now.

Thanks again,
Tony

Jayesh Krishna wrote:
>
> Hi,
> The description of error code 11004 in MS docs is,
>
> ================================================================
> The requested name is valid and was found in the database, but it does 
> not have the correct associated data being resolved for. The usual 
> example for this is a host name-to-address translation attempt (using 
> gethostbyname or WSAAsyncGetHostByName) which uses the DNS (Domain 
> Name Server). An MX record is returned but no A record—indicating the 
> host itself exists, but is not directly reachable.
> ================================================================
>
> Looks like the DNS server for your machine does not have information 
> about the computers/hosts in your setup.
>
> Regards,
> Jayesh
>
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Bathgate
> Sent: Wednesday, August 27, 2008 6:28 PM
> To: Jayesh Krishna
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] child node can't contact parent?
>
> Hi,
>
> Thanks for the reply.
> I had actually already pinged Computer1 from Computer2 and vice versa.
> The ping works fine. I tried using hostname with mpiexec like you 
> suggested and it works fine too, from both computers. Now I'm baffled.
> I also tried your Hello World program and it crashed. Here's the error 
> messages I got:
>
> C:\helloworld\Debug\> mpiexec -hosts 2 192.168.5.100 192.168.5.200 
> helloworld.exe Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize<255>............: MPI_Finalize failed
> MPI_Finalize<154>............:
> MPID_Finalize<94>............:
> MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002>
> MPIR_Barrier<77>.............:
> MPIC_Sendrecv<120>...........:
> MPID_Isend<103>..............: failure occured while attempting to 
> send an eager message
> MPIDI_CH3_iSend<172>.........:
> MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 0 using 
> business card <port=8673 description=computer1.usask.ca 
> ifname=192.168.5.100>
> MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested 
> name is valid, but no data of the requested type was found. <errno 
> 11004> job aborted:
> rank: node: exit code[: error message]
> 0: 192.168.5.100: 1
> 1: 192.168.5.200: 1: Fatal error in MPI_Finalize: Other MPI error, error
> stack:
> MPI_Finalize<255>............: MPI_Finalize failed
> MPI_Finalize<154>............:
> MPID_Finalize<94>............:
> MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002>
> MPIR_Barrier<77>.............:
> MPIC_Sendrecv<120>...........:
> MPID_Isend<103>..............: failure occured while attempting to 
> send an eager message
> MPIDI_CH3_iSend<172>.........:
> MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 1 using 
> business card <port=8673 description=computer1.usask.ca 
> ifname=192.168.5.100>
> MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested 
> name is valid, but no data of the requested type was found. <errno 11004>
>
> So it seems to me that it can execute programs remotely but not when 
> the program relies on the MPICH2 c implementation libraries. Does that 
> make sense, and how could it be remedied?
>
> Thanks again,
> Tony
>
> Jayesh Krishna wrote:
> >
> > Hi,
> > Looks like something is wrong with the setup of your machines.
> >
> > # Can you ping from one machine to the other ?
> >
> > - From Computer1 try pinging Computer2
> > - From Computer2 try pinging Computer1
> >
> > # Start debugging by running a non-MPI program (like hostname)
> >
> > mpiexec -hosts 2 IPAddress_Of_Computer1 IPAddress_Of_Computer2
> > hostname
> >
> > # Then debug with a simple hello world program (don't debug your setup
> > with a complex program)
> >
> > ----------------- hello world --------------- #include <stdio.h>
> > #include "mpi.h"
> >
> > int main(int argc, char *argv[]){
> > int rank=-1;
> > MPI_Init(&argc, &argv);
> > MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> > printf("[%d] Hello world\n", rank);
> > MPI_Finalize();
> > }
> > ----------------- hello world ---------------
> >
> > Let us know the results.
> >
> > Regards,
> > Jayesh
> >
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Bathgate
> > Sent: Wednesday, August 27, 2008 3:31 PM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [mpich-discuss] child node can't contact parent?
> >
> > Hi All,
> >
> > I apologize in advance for the length of this email; I'm new to the
> > world of MPI and I want to include everything that might be relevant.
> > I have the Win32 IA32 binary of MPICH2 installed on two machines.
> > They are running Windows XP Pro. x64 Edition with Service Pack 2 and
> > they each have an Intel Xeon processor. To simplify things I took
> > them off our network, gave them their own router, and dropped their
> > Windows firewalls. I have assigned the machines static IP's with the
> > router (192.168.5.100 for Computer1, and 192.168.5.200 for Computer2).
> > I've registered the local Administrator accounts (which have identical
> > passwords and credentials) with mpiexec on each machine. And
> > everything below was attempted from the Administrator account.
> >
> > I've tried running the cpi.exe example but it just hangs:
> >
> > C:\Program Files (x86)\MPICH2\examples> mpiexec -hosts 2 Computer1
> > Computer2 .\cpi.exe
> > Enter the number of intervals: (0 quits) 1
> >
> > (here I waited about 20 minutes, then Ctrl+C)
> >
> > mpiexec aborting job
> >
> > job aborted:
> > rank: node: exit code[: error message]
> > 0: Computer1: 123: mpiexec aborting job
> > 1: Computer2: 123
> >
> > It runs perfectly fine if I have it execute it with the -localonly tag.
> > To explore this issue I wrote a simple program that uses
> > MPI_Comm_spawn to spawn a worker program. The master then sends the
> > worker a message and they both exit. The manager node runs the code
> > that follows here:
> > #include <mpi.h>
> > #include <stdio.h>
> >
> > int main (int argc, char* argv[])
> > {
> > int someVariable = 10;
> >
> > MPI_Info info;
> > MPI_Comm workercomm;
> > MPI_Request request;
> > MPI_Status status;
> >
> > MPI_Init( &argc, &argv );
> >
> > fprintf( stdout, "In Master - someVariable = %i \n",
> > someVariable );
> > fflush( stdout );
> >
> > MPI_Info_create( &info );
> > MPI_Info_set( info, "host", "Computer2" );
> > MPI_Comm_spawn( "C:\\MPIworker\\Debug\\MPIworker.exe",
> > MPI_ARGV_NULL,
> > 1, info, 0, MPI_COMM_SELF, &workercomm,
> > MPI_ERRCODES_IGNORE );
> >
> > MPI_Info_free( &info );
> >
> > MPI_Isend( &someVariable, 1, MPI_INT, 0, 0, workercomm,
> > &(request) );
> > MPI_Waitall( 1, request, status );
> >
> > fprintf(stdout,"Done sending\n");
> > fflush(stdout);
> >
> > MPI_Finalize();
> > return 0;
> > }
> > The worker code follows here:
> > #include <mpi.h>
> > #include <stdio.h>
> >
> > int main (int argc, char* argv[])
> > {
> > int someVariable = 0;
> > MPI_Comm parentcomm;
> > MPI_Request request;
> > MPI_Status status;
> >
> > MPI_Init( &argc, &argv );
> >
> > fprintf(stdout, "In Worker: Before receive - someVariable = %i
> > \n",someVariable);
> > fflush( stdout );
> >
> > MPI_Comm_get_parent( &parentcomm );
> > MPI_Irecv( &someVariable, 1, MPI_INT, 0, 0, parentcomm,
> > &request );
> >
> > MPI_Wait( &request, &status );
> > fprintf( stdout, "After receive - someVariable = %i\n",
> > someVariable );
> > fflush( stdout );
> >
> > MPI_Finalize();
> > return 0;
> > }
> >
> > When I run this code I get the following results:
> > C:\MPImanager\Debug\> mpiexec -n 1 MPImanager.exe
> > In Master - someVariable = 10
> > Fatal error in MPI_Init: Other MPI error, error stack:
> > MPIR_Init_thread<294>............................:Initialization
> > failed
> > MPID_Init<242>...................................:Spawned process
> > group was unable to connect back to parent on port <tag=0 port=8673
> > description=computer1.usask.ca ifname=192.168.5.100>
> > MPID_Comm_connect<187>...........................:
> > MPIDI_Comm_connect<369>..........................:
> > MPIDI_Create_inter_root_communicator_connect<133>:
> > MPIDI_CH3I_Connect_to_root_sock<289>.............:
> > MPIDU_Sock_post_connect<1228>....................: unable to
> > connect to computer1.usask.ca on port 8673, exhuasted all endpoints
> > <errno -1>
> > MPIDU_Sock_post_connect<1244>....................: gethostbyname
> > failed, The requested name is valid, but no data of the requested type
> > was found. <errno 11004>
> >
> > Job aborted:
> > rank: node: exit code[: error message]
> > 0: computer2: 1: fatal error in MPI_Init: other MPI error, error
> > stack:
> > MPIR_Init_thread<294>............................:Initialization
> > failed
> > MPID_Init<242>...................................:Spawned process
> > group was unable to connect back to parent on port <tag=0 port=8673
> > description=computer1.usask.ca ifname=192.168.5.100>
> > MPID_Comm_connect<187>...........................:
> > MPIDI_Comm_connect<369>..........................:
> > MPIDI_Create_inter_root_communicator_connect<133>:
> > MPIDI_CH3I_Connect_to_root_sock<289>.............:
> > MPIDU_Sock_post_connect<1228>....................: unable to
> > connect to computer1.usask.ca on port 8673, exhuasted all endpoints
> > <errno -1>
> > MPIDU_Sock_post_connect<1244>....................: gethostbyname
> > failed, The requested name is valid, but no data of the requested type
> > was found. <errno 11004>
> >
> > (Here I waited several minutes before pressing ctrl+c)
> >
> > mpiexec aborting job ...
> >
> > (Here I waited several more minutes before pressing ctrl+c and
> > returning to the command prompt)
> >
> > So the program is able to spawn a process on the worker, but then when
> > the worker is unable to contact the manager node MPI_Init fails. The
> > error stack shows that it has the correct IP address and tries to use
> > port 8673. At first I thought the problem might be that it was
> > appending the domain name (usask.ca) from their old network, but the
> > IP address is still correct so now I'm not sure.
> >
> > If I change the code so Computer2 is the manager and Computer1 is the
> > worker the results are the same. But just like cpi.exe if I confine
> > both the worker and the manager to the local host it performs
> > perfectly. I assume this is an issue with either the way I've set up
> > my network, or the way I've set up MPICH2 on the computers. Does
> > anyone know what would cause an error like this?
> >
>

-- 
Tony Bathgate
BEng Engineering Physics
BSc Computer Science

Master's Candidate
University of Saskatchewan
116 Science Place
Saskatoon SK,
S7N 5E2
966-6452

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tony_bathgate.vcf
Type: text/x-vcard
Size: 264 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080828/0010ed7d/attachment.vcf>


More information about the mpich-discuss mailing list