[MPICH2-dev] MPICH2 spawn problem
Rajeev Thakur
thakur at mcs.anl.gov
Thu Feb 15 16:55:23 CST 2007
Florin,
You can fix this problem in 1.0.5p2 by making the following change in
the file src/pm/mpd/mpdman.py. At line 164, add the following two lines
elif k.startswith('MPICH_INTERFACE_HOSTNAME'):
continue ## already put it in above
So the if statement should look like this:
if k.startswith('MPI_APPNUM'):
self.appnum = self.clientPgmEnv[k] # don't put in
application env
elif k.startswith('MPICH_INTERFACE_HOSTNAME'):
continue ## already put it in above
else:
cli_env[k] = self.clientPgmEnv[k]
Then run make, make install, and start mpd again.
Rajeev
________________________________
From: owner-mpich2-dev at mcs.anl.gov
[mailto:owner-mpich2-dev at mcs.anl.gov] On Behalf Of Florin Isaila
Sent: Tuesday, February 13, 2007 7:37 AM
To: mpich2-dev at mcs.anl.gov
Subject: [MPICH2-dev] MPICH2 spawn problem
Hi
could someone help he with a dynamic processes problem?
The problem occurs with the example provided in the MPICH2
distribution (parent/child). The example works fine when using local
communication, but blocks when I run the program on two machines with
mpiexec -n 4 parent
The "child" processes are actually launched (can be seen with ps
command on both machines), but they block all in MPI_Init.
I further simplified the process/child program in order to isolate
the error. I use just one parent and one child on different machines. (I
attach the two programs at the end). And a strange error occurs:
When the program is modified such as the child is the first one to
send data to the parent it works!!!
But if the parent sends first with MPI_Send, it gets:
[florin at compute-0-0 simple_parent_child]$ mpiexec -n 1 parent
[cli_0]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173)...............................:
MPI_Send(buf=0x808d647, count=4, MPI_CHAR, dest=0, tag=0, comm=0x84000000)
failed
MPIDI_CH3_Progress_wait(212)................: an error occurred
while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(772)..:
MPIDI_CH3_Sockconn_handle_connect_event(589): [ch3:sock] failed to
connnect to remote process
MPIDU_Socki_handle_connect(791).............: connection failure
(set=0,sock=1,errno=111:Connection refused)
rank 0 in job 20 compute-0-0.local_41161 caused collective abort
of all ranks
exit status of rank 0: return code 1
I am using mpich2-1.0.5p2. The cpi example runs with OK with several
nodes.
My cluster has 8 nodes, all PIV with a Centos distribution, python
2.4 and gcc 3.4
Here are the programs (fail when running with mpiexec -n 1 parent
with the message from above, work when changing the communication order):
PARENT:
#include <stdio.h>
#include "mpi.h"
int main( int argc, char *argv[] )
{
char str[10];
int err=0, errcodes[256], rank, nprocs;
MPI_Comm intercomm;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name,&namelen);
//printf("I am parent nr %d\n", rank);
err = MPI_Comm_spawn("child", MPI_ARGV_NULL, nprocs,
MPI_INFO_NULL, 0, MPI_COMM_WORLD,
&intercomm, errcodes);
if (err) printf("Error in MPI_Comm_spawn\n");
err = MPI_Send("bye", 4, MPI_CHAR, rank, 0, intercomm);
err = MPI_Recv(str, 3, MPI_CHAR, rank, 0, intercomm,
MPI_STATUS_IGNORE);
printf("Parent %d on %s received from child: %s\n",rank,
processor_name, str);
fflush(stdout);
MPI_Finalize();
return 0;
}
CHILD:
#include <stdio.h>
#include " mpi.h"
int main( int argc, char *argv[] )
{
MPI_Comm intercomm;
char str[10];
int err, rank;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name,&namelen);
//printf("I am child nr %d\n", rank);
MPI_Comm_get_parent(&intercomm);
//Child rank send and recvs from parent rank
err = MPI_Recv(str, 4, MPI_CHAR, rank, 0, intercomm,
MPI_STATUS_IGNORE);
printf("Child %d on %s received from parent: %s\n",rank,
processor_name, str);
fflush(stdout);
err = MPI_Send("hi", 3, MPI_CHAR, rank, 0, intercomm);
MPI_Finalize();
return 0;
}
Thanks a lot
Florin
More information about the mpich2-dev
mailing list