[mpich-discuss] child node can't contact parent?
Tony Bathgate
tony.bathgate at usask.ca
Wed Aug 27 18:27:49 CDT 2008
Hi,
Thanks for the reply.
I had actually already pinged Computer1 from Computer2 and vice versa.
The ping works fine. I tried using hostname with mpiexec like you
suggested and it works fine too, from both computers. Now I'm baffled.
I also tried your Hello World program and it crashed. Here's the error
messages I got:
C:\helloworld\Debug\> mpiexec -hosts 2 192.168.5.100 192.168.5.200
helloworld.exe
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize<255>............: MPI_Finalize failed
MPI_Finalize<154>............:
MPID_Finalize<94>............:
MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002>
MPIR_Barrier<77>.............:
MPIC_Sendrecv<120>...........:
MPID_Isend<103>..............: failure occured while attempting to send
an eager message
MPIDI_CH3_iSend<172>.........:
MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 0 using
business card <port=8673 description=computer1.usask.ca
ifname=192.168.5.100>
MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested name
is valid, but no data of the requested type was found. <errno 11004>
job aborted:
rank: node: exit code[: error message]
0: 192.168.5.100: 1
1: 192.168.5.200: 1: Fatal error in MPI_Finalize: Other MPI error, error
stack:
MPI_Finalize<255>............: MPI_Finalize failed
MPI_Finalize<154>............:
MPID_Finalize<94>............:
MPI_Barrier<406>.............: MPI_Barrier <comm=0x44000002>
MPIR_Barrier<77>.............:
MPIC_Sendrecv<120>...........:
MPID_Isend<103>..............: failure occured while attempting to send
an eager message
MPIDI_CH3_iSend<172>.........:
MPIDI_CH3I_Sock_connect<1191>: unable to connect to rank 1 using
business card <port=8673 description=computer1.usask.ca
ifname=192.168.5.100>
MPIDU_Sock_post_connect<1244>: gethostbyname failed, The requested name
is valid, but no data of the requested type was found. <errno 11004>
So it seems to me that it can execute programs remotely but not when the
program relies on the MPICH2 c implementation libraries. Does that make
sense, and how could it be remedied?
Thanks again,
Tony
Jayesh Krishna wrote:
>
> Hi,
> Looks like something is wrong with the setup of your machines.
>
> # Can you ping from one machine to the other ?
>
> - From Computer1 try pinging Computer2
> - From Computer2 try pinging Computer1
>
> # Start debugging by running a non-MPI program (like hostname)
>
> mpiexec -hosts 2 IPAddress_Of_Computer1 IPAddress_Of_Computer2 hostname
>
> # Then debug with a simple hello world program (don't debug your setup
> with a complex program)
>
> ----------------- hello world ---------------
> #include <stdio.h>
> #include "mpi.h"
>
> int main(int argc, char *argv[]){
> int rank=-1;
> MPI_Init(&argc, &argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> printf("[%d] Hello world\n", rank);
> MPI_Finalize();
> }
> ----------------- hello world ---------------
>
> Let us know the results.
>
> Regards,
> Jayesh
>
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Tony Bathgate
> Sent: Wednesday, August 27, 2008 3:31 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] child node can't contact parent?
>
> Hi All,
>
> I apologize in advance for the length of this email; I'm new to the
> world of MPI and I want to include everything that might be relevant.
> I have the Win32 IA32 binary of MPICH2 installed on two machines.
> They are running Windows XP Pro. x64 Edition with Service Pack 2 and
> they each have an Intel Xeon processor. To simplify things I took
> them off our network, gave them their own router, and dropped their
> Windows firewalls. I have assigned the machines static IP's with the
> router (192.168.5.100 for Computer1, and 192.168.5.200 for
> Computer2). I've registered the local Administrator accounts (which
> have identical passwords and credentials) with mpiexec on each
> machine. And everything below was attempted from the Administrator
> account.
>
> I've tried running the cpi.exe example but it just hangs:
>
> C:\Program Files (x86)\MPICH2\examples> mpiexec -hosts 2 Computer1
> Computer2 .\cpi.exe
> Enter the number of intervals: (0 quits) 1
>
> (here I waited about 20 minutes, then Ctrl+C)
>
> mpiexec aborting job
>
> job aborted:
> rank: node: exit code[: error message]
> 0: Computer1: 123: mpiexec aborting job
> 1: Computer2: 123
>
> It runs perfectly fine if I have it execute it with the -localonly tag.
> To explore this issue I wrote a simple program that uses
> MPI_Comm_spawn to spawn a worker program. The master then sends the
> worker a message and they both exit. The manager node runs the code
> that follows here:
> #include <mpi.h>
> #include <stdio.h>
>
> int main (int argc, char* argv[])
> {
> int someVariable = 10;
>
> MPI_Info info;
> MPI_Comm workercomm;
> MPI_Request request;
> MPI_Status status;
>
> MPI_Init( &argc, &argv );
>
> fprintf( stdout, "In Master - someVariable = %i \n",
> someVariable );
> fflush( stdout );
>
> MPI_Info_create( &info );
> MPI_Info_set( info, "host", "Computer2" );
> MPI_Comm_spawn( "C:\\MPIworker\\Debug\\MPIworker.exe",
> MPI_ARGV_NULL,
> 1, info, 0, MPI_COMM_SELF, &workercomm, MPI_ERRCODES_IGNORE );
>
> MPI_Info_free( &info );
>
> MPI_Isend( &someVariable, 1, MPI_INT, 0, 0, workercomm,
> &(request) );
> MPI_Waitall( 1, request, status );
>
> fprintf(stdout,"Done sending\n");
> fflush(stdout);
>
> MPI_Finalize();
> return 0;
> }
> The worker code follows here:
> #include <mpi.h>
> #include <stdio.h>
>
> int main (int argc, char* argv[])
> {
> int someVariable = 0;
> MPI_Comm parentcomm;
> MPI_Request request;
> MPI_Status status;
>
> MPI_Init( &argc, &argv );
>
> fprintf(stdout, "In Worker: Before receive - someVariable = %i
> \n",someVariable);
> fflush( stdout );
>
> MPI_Comm_get_parent( &parentcomm );
> MPI_Irecv( &someVariable, 1, MPI_INT, 0, 0, parentcomm,
> &request );
>
> MPI_Wait( &request, &status );
> fprintf( stdout, "After receive - someVariable = %i\n",
> someVariable );
> fflush( stdout );
>
> MPI_Finalize();
> return 0;
> }
>
> When I run this code I get the following results:
> C:\MPImanager\Debug\> mpiexec -n 1 MPImanager.exe
> In Master - someVariable = 10
> Fatal error in MPI_Init: Other MPI error, error stack:
> MPIR_Init_thread<294>............................:Initialization
> failed
> MPID_Init<242>...................................:Spawned process
> group was unable to connect back to parent on port <tag=0 port=8673
> description=computer1.usask.ca ifname=192.168.5.100>
> MPID_Comm_connect<187>...........................:
> MPIDI_Comm_connect<369>..........................:
> MPIDI_Create_inter_root_communicator_connect<133>:
> MPIDI_CH3I_Connect_to_root_sock<289>.............:
> MPIDU_Sock_post_connect<1228>....................: unable to
> connect to computer1.usask.ca on port 8673, exhuasted all endpoints
> <errno -1>
> MPIDU_Sock_post_connect<1244>....................: gethostbyname
> failed, The requested name is valid, but no data of the requested type
> was found. <errno 11004>
>
> Job aborted:
> rank: node: exit code[: error message]
> 0: computer2: 1: fatal error in MPI_Init: other MPI error, error
> stack:
> MPIR_Init_thread<294>............................:Initialization
> failed
> MPID_Init<242>...................................:Spawned process
> group was unable to connect back to parent on port <tag=0 port=8673
> description=computer1.usask.ca ifname=192.168.5.100>
> MPID_Comm_connect<187>...........................:
> MPIDI_Comm_connect<369>..........................:
> MPIDI_Create_inter_root_communicator_connect<133>:
> MPIDI_CH3I_Connect_to_root_sock<289>.............:
> MPIDU_Sock_post_connect<1228>....................: unable to
> connect to computer1.usask.ca on port 8673, exhuasted all endpoints
> <errno -1>
> MPIDU_Sock_post_connect<1244>....................: gethostbyname
> failed, The requested name is valid, but no data of the requested type
> was found. <errno 11004>
>
> (Here I waited several minutes before pressing ctrl+c)
>
> mpiexec aborting job ...
>
> (Here I waited several more minutes before pressing ctrl+c and
> returning to the command prompt)
>
> So the program is able to spawn a process on the worker, but then when
> the worker is unable to contact the manager node MPI_Init fails. The
> error stack shows that it has the correct IP address and tries to use
> port 8673. At first I thought the problem might be that it was
> appending the domain name (usask.ca) from their old network, but the
> IP address is still correct so now I'm not sure.
>
> If I change the code so Computer2 is the manager and Computer1 is the
> worker the results are the same. But just like cpi.exe if I confine
> both the worker and the manager to the local host it performs
> perfectly. I assume this is an issue with either the way I've set up
> my network, or the way I've set up MPICH2 on the computers. Does
> anyone know what would cause an error like this?
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tony_bathgate.vcf
Type: text/x-vcard
Size: 264 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080827/73c75c2e/attachment.vcf>
More information about the mpich-discuss
mailing list