[MPICH2-dev] MPICH2 spawn problem
Rajeev Thakur
thakur at mcs.anl.gov
Tue Feb 13 14:49:32 CST 2007
I am able to reproduce this error. We will look into it.
Rajeev
_____
From: owner-mpich2-dev at mcs.anl.gov [mailto:owner-mpich2-dev at mcs.anl.gov] On
Behalf Of Florin Isaila
Sent: Tuesday, February 13, 2007 7:37 AM
To: mpich2-dev at mcs.anl.gov
Subject: [MPICH2-dev] MPICH2 spawn problem
Hi
could someone help he with a dynamic processes problem?
The problem occurs with the example provided in the MPICH2 distribution
(parent/child). The example works fine when using local communication, but
blocks when I run the program on two machines with
mpiexec -n 4 parent
The "child" processes are actually launched (can be seen with ps command on
both machines), but they block all in MPI_Init.
I further simplified the process/child program in order to isolate the
error. I use just one parent and one child on different machines. (I attach
the two programs at the end). And a strange error occurs:
When the program is modified such as the child is the first one to send
data to the parent it works!!!
But if the parent sends first with MPI_Send, it gets:
[florin at compute-0-0 simple_parent_child]$ mpiexec -n 1 parent
[cli_0]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173)...............................: MPI_Send(buf=0x808d647,
count=4, MPI_CHAR, dest=0, tag=0, comm=0x84000000) failed
MPIDI_CH3_Progress_wait(212)................: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(772)..:
MPIDI_CH3_Sockconn_handle_connect_event(589): [ch3:sock] failed to connnect
to remote process
MPIDU_Socki_handle_connect(791).............: connection failure
(set=0,sock=1,errno=111:Connection refused)
rank 0 in job 20 compute-0-0.local_41161 caused collective abort of all
ranks
exit status of rank 0: return code 1
I am using mpich2-1.0.5p2. The cpi example runs with OK with several nodes.
My cluster has 8 nodes, all PIV with a Centos distribution, python 2.4 and
gcc 3.4
Here are the programs (fail when running with mpiexec -n 1 parent with the
message from above, work when changing the communication order):
PARENT:
#include <stdio.h>
#include "mpi.h"
int main( int argc, char *argv[] )
{
char str[10];
int err=0, errcodes[256], rank, nprocs;
MPI_Comm intercomm;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name,&namelen);
//printf("I am parent nr %d\n", rank);
err = MPI_Comm_spawn("child", MPI_ARGV_NULL, nprocs,
MPI_INFO_NULL, 0, MPI_COMM_WORLD,
&intercomm, errcodes);
if (err) printf("Error in MPI_Comm_spawn\n");
err = MPI_Send("bye", 4, MPI_CHAR, rank, 0, intercomm);
err = MPI_Recv(str, 3, MPI_CHAR, rank, 0, intercomm, MPI_STATUS_IGNORE);
printf("Parent %d on %s received from child: %s\n",rank, processor_name,
str);
fflush(stdout);
MPI_Finalize();
return 0;
}
CHILD:
#include <stdio.h>
#include " mpi.h"
int main( int argc, char *argv[] )
{
MPI_Comm intercomm;
char str[10];
int err, rank;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name,&namelen);
//printf("I am child nr %d\n", rank);
MPI_Comm_get_parent(&intercomm);
//Child rank send and recvs from parent rank
err = MPI_Recv(str, 4, MPI_CHAR, rank, 0, intercomm, MPI_STATUS_IGNORE);
printf("Child %d on %s received from parent: %s\n",rank, processor_name,
str);
fflush(stdout);
err = MPI_Send("hi", 3, MPI_CHAR, rank, 0, intercomm);
MPI_Finalize();
return 0;
}
Thanks a lot
Florin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20070213/a258b340/attachment.htm>
More information about the mpich2-dev
mailing list