[MPICH2-dev] MPICH2 spawn problem
Rajeev Thakur
thakur at mcs.anl.gov
Fri Feb 16 12:50:31 CST 2007
We have also issued a patch release, 1.0.5p3, which fixes this problem.
Rajeev
> -----Original Message-----
> From: Rajeev Thakur [mailto:thakur at mcs.anl.gov]
> Sent: Thursday, February 15, 2007 4:55 PM
> To: 'Florin Isaila'; 'mpich2-dev at mcs.anl.gov'
> Subject: RE: [MPICH2-dev] MPICH2 spawn problem
>
> Florin,
> You can fix this problem in 1.0.5p2 by making the
> following change in the file src/pm/mpd/mpdman.py. At line
> 164, add the following two lines
>
> elif k.startswith('MPICH_INTERFACE_HOSTNAME'):
> continue ## already put it in above
>
> So the if statement should look like this:
>
> if k.startswith('MPI_APPNUM'):
> self.appnum = self.clientPgmEnv[k] # don't
> put in application env
> elif k.startswith('MPICH_INTERFACE_HOSTNAME'):
> continue ## already put it in above
> else:
> cli_env[k] = self.clientPgmEnv[k]
>
> Then run make, make install, and start mpd again.
>
> Rajeev
>
>
> ________________________________
>
> From: owner-mpich2-dev at mcs.anl.gov
> [mailto:owner-mpich2-dev at mcs.anl.gov] On Behalf Of Florin Isaila
> Sent: Tuesday, February 13, 2007 7:37 AM
> To: mpich2-dev at mcs.anl.gov
> Subject: [MPICH2-dev] MPICH2 spawn problem
>
>
>
> Hi
>
> could someone help he with a dynamic processes problem?
>
> The problem occurs with the example provided in the
> MPICH2 distribution (parent/child). The example works fine
> when using local communication, but blocks when I run
> the program on two machines with
>
> mpiexec -n 4 parent
>
> The "child" processes are actually launched (can be
> seen with ps command on both machines), but they block all in
> MPI_Init.
>
> I further simplified the process/child program in order
> to isolate the error. I use just one parent and one child on
> different machines. (I attach the two programs at the end).
> And a strange error occurs:
>
> When the program is modified such as the child is the
> first one to send data to the parent it works!!!
> But if the parent sends first with MPI_Send, it gets:
>
> [florin at compute-0-0 simple_parent_child]$ mpiexec -n
> 1 parent
> [cli_0]: aborting job:
> Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(173)...............................:
> MPI_Send(buf=0x808d647, count=4, MPI_CHAR, dest=0, tag=0,
> comm=0x84000000) failed
> MPIDI_CH3_Progress_wait(212)................: an error
> occurred while handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(772)..:
> MPIDI_CH3_Sockconn_handle_connect_event(589):
> [ch3:sock] failed to connnect to remote process
> MPIDU_Socki_handle_connect(791).............:
> connection failure (set=0,sock=1,errno=111:Connection refused)
> rank 0 in job 20 compute-0-0.local_41161 caused
> collective abort of all ranks
> exit status of rank 0: return code 1
>
> I am using mpich2-1.0.5p2. The cpi example runs with OK
> with several nodes.
>
> My cluster has 8 nodes, all PIV with a Centos
> distribution, python 2.4 and gcc 3.4
>
> Here are the programs (fail when running with mpiexec
> -n 1 parent with the message from above, work when changing
> the communication order):
>
> PARENT:
>
> #include <stdio.h>
> #include "mpi.h"
>
> int main( int argc, char *argv[] )
> {
> char str[10];
> int err=0, errcodes[256], rank, nprocs;
> MPI_Comm intercomm;
> int namelen;
> char processor_name[MPI_MAX_PROCESSOR_NAME];
>
> MPI_Init(&argc, &argv);
>
> MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Get_processor_name(processor_name,&namelen);
>
> //printf("I am parent nr %d\n", rank);
>
> err = MPI_Comm_spawn("child", MPI_ARGV_NULL, nprocs,
> MPI_INFO_NULL, 0, MPI_COMM_WORLD,
> &intercomm, errcodes);
> if (err) printf("Error in MPI_Comm_spawn\n");
>
>
> err = MPI_Send("bye", 4, MPI_CHAR, rank, 0, intercomm);
>
> err = MPI_Recv(str, 3, MPI_CHAR, rank, 0,
> intercomm, MPI_STATUS_IGNORE);
> printf("Parent %d on %s received from child:
> %s\n",rank, processor_name, str);
> fflush(stdout);
>
> MPI_Finalize();
>
> return 0;
> }
>
> CHILD:
> #include <stdio.h>
> #include " mpi.h"
>
> int main( int argc, char *argv[] )
> {
> MPI_Comm intercomm;
> char str[10];
> int err, rank;
> int namelen;
> char processor_name[MPI_MAX_PROCESSOR_NAME];
>
> MPI_Init(&argc, &argv);
>
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Get_processor_name(processor_name,&namelen);
> //printf("I am child nr %d\n", rank);
>
> MPI_Comm_get_parent(&intercomm);
>
> //Child rank send and recvs from parent rank
>
>
> err = MPI_Recv(str, 4, MPI_CHAR, rank, 0,
> intercomm, MPI_STATUS_IGNORE);
> printf("Child %d on %s received from parent:
> %s\n",rank, processor_name, str);
> fflush(stdout);
> err = MPI_Send("hi", 3, MPI_CHAR, rank, 0, intercomm);
>
>
> MPI_Finalize();
> return 0;
> }
>
>
> Thanks a lot
> Florin
>
>
>
>
More information about the mpich2-dev
mailing list