[MPICH2-dev] MPICH2 spawn problem

Rajeev Thakur thakur at mcs.anl.gov
Thu Feb 15 16:55:23 CST 2007


Florin,
       You can fix this problem in 1.0.5p2 by making the following change in
the file src/pm/mpd/mpdman.py. At line 164, add the following two lines

            elif k.startswith('MPICH_INTERFACE_HOSTNAME'):
                continue    ## already put it in above
 
So the if statement should look like this:

            if k.startswith('MPI_APPNUM'):
                self.appnum = self.clientPgmEnv[k]    # don't put in
application env
            elif k.startswith('MPICH_INTERFACE_HOSTNAME'):
                continue    ## already put it in above
            else:
                cli_env[k] = self.clientPgmEnv[k]

Then run make, make install, and start mpd again.

Rajeev


________________________________

	From: owner-mpich2-dev at mcs.anl.gov
[mailto:owner-mpich2-dev at mcs.anl.gov] On Behalf Of Florin Isaila
	Sent: Tuesday, February 13, 2007 7:37 AM
	To: mpich2-dev at mcs.anl.gov
	Subject: [MPICH2-dev] MPICH2 spawn problem
	
	
	
	Hi

	could someone help he with a  dynamic processes problem?
	
	The problem occurs with the example provided in the MPICH2
distribution (parent/child).  The example works fine when using local
communication,  but  blocks  when  I  run  the  program on two machines with

	
	mpiexec -n 4 parent
	
	The "child" processes are actually launched (can be seen with ps
command on both machines), but they block all in MPI_Init.   
	
	I further simplified the process/child program in order to isolate
the error. I use just one parent and one child on different machines.  (I
attach the two programs at the end). And a strange error occurs:
	
	When the program is modified such as the child is the first  one  to
send data to the parent it works!!!
	But if the parent sends first with MPI_Send, it gets: 
	
	  [florin at compute-0-0 simple_parent_child]$ mpiexec -n 1  parent 
	[cli_0]: aborting job:
	Fatal error in MPI_Send: Other MPI error, error stack:
	MPI_Send(173)...............................:
MPI_Send(buf=0x808d647, count=4, MPI_CHAR, dest=0, tag=0, comm=0x84000000)
failed
	MPIDI_CH3_Progress_wait(212)................: an error occurred
while handling an event returned by MPIDU_Sock_Wait() 
	MPIDI_CH3I_Progress_handle_sock_event(772)..:
	MPIDI_CH3_Sockconn_handle_connect_event(589): [ch3:sock] failed to
connnect to remote process
	MPIDU_Socki_handle_connect(791).............: connection failure
(set=0,sock=1,errno=111:Connection refused) 
	rank 0 in job 20  compute-0-0.local_41161   caused collective abort
of all ranks
	  exit status of rank 0: return code 1
	 
	I am using mpich2-1.0.5p2. The cpi example runs with OK with several
nodes. 
	
	My cluster has 8 nodes, all PIV with a Centos distribution, python
2.4 and gcc 3.4
	
	Here are the programs (fail when running with mpiexec -n 1 parent
with the message from above, work when changing the communication order): 
	
	PARENT:
	
	#include <stdio.h>
	#include "mpi.h"
	
	int main( int argc, char *argv[] )
	{
	    char str[10];
	    int err=0, errcodes[256], rank, nprocs;
	    MPI_Comm intercomm;
	    int    namelen;
	    char   processor_name[MPI_MAX_PROCESSOR_NAME];
	
	    MPI_Init(&argc, &argv);
	
	    MPI_Comm_size(MPI_COMM_WORLD,&nprocs); 
	    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	    MPI_Get_processor_name(processor_name,&namelen); 
	
	    //printf("I am parent nr %d\n", rank);
	
	    err = MPI_Comm_spawn("child", MPI_ARGV_NULL, nprocs,
	                         MPI_INFO_NULL, 0, MPI_COMM_WORLD,
	                         &intercomm, errcodes);  
	    if (err) printf("Error in MPI_Comm_spawn\n");
	
	        
	    err = MPI_Send("bye", 4, MPI_CHAR, rank, 0, intercomm); 
	    
	    err = MPI_Recv(str, 3, MPI_CHAR, rank, 0, intercomm,
MPI_STATUS_IGNORE); 
	    printf("Parent %d on %s received from child: %s\n",rank,
processor_name, str);
	    fflush(stdout);
	
	    MPI_Finalize();
	
	    return 0;
	}
	
	CHILD: 
	#include <stdio.h>
	#include " mpi.h"
	
	int main( int argc, char *argv[] )
	{
	    MPI_Comm intercomm;
	    char str[10];
	    int err, rank;
	    int    namelen;
	    char   processor_name[MPI_MAX_PROCESSOR_NAME];
	
	    MPI_Init(&argc, &argv); 
	
	    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	    MPI_Get_processor_name(processor_name,&namelen);
	    //printf("I am child nr %d\n", rank);
	    
	    MPI_Comm_get_parent(&intercomm); 
	
	    //Child rank send and recvs from parent rank
	
	        
	    err = MPI_Recv(str, 4, MPI_CHAR, rank, 0, intercomm,
MPI_STATUS_IGNORE);
	    printf("Child %d on %s received from parent: %s\n",rank,
processor_name, str); 
	    fflush(stdout);
	    err = MPI_Send("hi", 3, MPI_CHAR, rank, 0, intercomm);
	    
	
	    MPI_Finalize();
	    return 0;
	}
	
	
	Thanks a lot
	Florin
	






More information about the mpich2-dev mailing list