[MPICH2-dev] MPICH2 spawn problem

Rajeev Thakur thakur at mcs.anl.gov
Fri Feb 16 12:50:31 CST 2007


We have also issued a patch release, 1.0.5p3, which fixes this problem.

Rajeev
 

> -----Original Message-----
> From: Rajeev Thakur [mailto:thakur at mcs.anl.gov] 
> Sent: Thursday, February 15, 2007 4:55 PM
> To: 'Florin Isaila'; 'mpich2-dev at mcs.anl.gov'
> Subject: RE: [MPICH2-dev] MPICH2 spawn problem
> 
> Florin,
>        You can fix this problem in 1.0.5p2 by making the 
> following change in the file src/pm/mpd/mpdman.py. At line 
> 164, add the following two lines
> 
>             elif k.startswith('MPICH_INTERFACE_HOSTNAME'):
>                 continue    ## already put it in above
>  
> So the if statement should look like this:
> 
>             if k.startswith('MPI_APPNUM'):
>                 self.appnum = self.clientPgmEnv[k]    # don't 
> put in application env
>             elif k.startswith('MPICH_INTERFACE_HOSTNAME'):
>                 continue    ## already put it in above
>             else:
>                 cli_env[k] = self.clientPgmEnv[k]
> 
> Then run make, make install, and start mpd again.
> 
> Rajeev
> 
> 
> ________________________________
> 
> 	From: owner-mpich2-dev at mcs.anl.gov 
> [mailto:owner-mpich2-dev at mcs.anl.gov] On Behalf Of Florin Isaila
> 	Sent: Tuesday, February 13, 2007 7:37 AM
> 	To: mpich2-dev at mcs.anl.gov
> 	Subject: [MPICH2-dev] MPICH2 spawn problem
> 	
> 	
> 	
> 	Hi
> 
> 	could someone help he with a  dynamic processes problem?
> 	
> 	The problem occurs with the example provided in the 
> MPICH2 distribution (parent/child).  The example works fine 
> when using local communication,  but  blocks  when  I  run  
> the  program on two machines with 
> 	
> 	mpiexec -n 4 parent
> 	
> 	The "child" processes are actually launched (can be 
> seen with ps command on both machines), but they block all in 
> MPI_Init.   
> 	
> 	I further simplified the process/child program in order 
> to isolate the error. I use just one parent and one child on 
> different machines.  (I attach the two programs at the end). 
> And a strange error occurs:
> 	
> 	When the program is modified such as the child is the 
> first  one  to send data to the parent it works!!!
> 	But if the parent sends first with MPI_Send, it gets: 
> 	
> 	  [florin at compute-0-0 simple_parent_child]$ mpiexec -n 
> 1  parent 
> 	[cli_0]: aborting job:
> 	Fatal error in MPI_Send: Other MPI error, error stack:
> 	MPI_Send(173)...............................: 
> MPI_Send(buf=0x808d647, count=4, MPI_CHAR, dest=0, tag=0, 
> comm=0x84000000) failed
> 	MPIDI_CH3_Progress_wait(212)................: an error 
> occurred while handling an event returned by MPIDU_Sock_Wait() 
> 	MPIDI_CH3I_Progress_handle_sock_event(772)..:
> 	MPIDI_CH3_Sockconn_handle_connect_event(589): 
> [ch3:sock] failed to connnect to remote process
> 	MPIDU_Socki_handle_connect(791).............: 
> connection failure (set=0,sock=1,errno=111:Connection refused) 
> 	rank 0 in job 20  compute-0-0.local_41161   caused 
> collective abort of all ranks
> 	  exit status of rank 0: return code 1
> 	 
> 	I am using mpich2-1.0.5p2. The cpi example runs with OK 
> with several nodes. 
> 	
> 	My cluster has 8 nodes, all PIV with a Centos 
> distribution, python 2.4 and gcc 3.4
> 	
> 	Here are the programs (fail when running with mpiexec 
> -n 1 parent with the message from above, work when changing 
> the communication order): 
> 	
> 	PARENT:
> 	
> 	#include <stdio.h>
> 	#include "mpi.h"
> 	
> 	int main( int argc, char *argv[] )
> 	{
> 	    char str[10];
> 	    int err=0, errcodes[256], rank, nprocs;
> 	    MPI_Comm intercomm;
> 	    int    namelen;
> 	    char   processor_name[MPI_MAX_PROCESSOR_NAME];
> 	
> 	    MPI_Init(&argc, &argv);
> 	
> 	    MPI_Comm_size(MPI_COMM_WORLD,&nprocs); 
> 	    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> 	    MPI_Get_processor_name(processor_name,&namelen); 
> 	
> 	    //printf("I am parent nr %d\n", rank);
> 	
> 	    err = MPI_Comm_spawn("child", MPI_ARGV_NULL, nprocs,
> 	                         MPI_INFO_NULL, 0, MPI_COMM_WORLD,
> 	                         &intercomm, errcodes);  
> 	    if (err) printf("Error in MPI_Comm_spawn\n");
> 	
> 	        
> 	    err = MPI_Send("bye", 4, MPI_CHAR, rank, 0, intercomm); 
> 	    
> 	    err = MPI_Recv(str, 3, MPI_CHAR, rank, 0, 
> intercomm, MPI_STATUS_IGNORE); 
> 	    printf("Parent %d on %s received from child: 
> %s\n",rank, processor_name, str);
> 	    fflush(stdout);
> 	
> 	    MPI_Finalize();
> 	
> 	    return 0;
> 	}
> 	
> 	CHILD: 
> 	#include <stdio.h>
> 	#include " mpi.h"
> 	
> 	int main( int argc, char *argv[] )
> 	{
> 	    MPI_Comm intercomm;
> 	    char str[10];
> 	    int err, rank;
> 	    int    namelen;
> 	    char   processor_name[MPI_MAX_PROCESSOR_NAME];
> 	
> 	    MPI_Init(&argc, &argv); 
> 	
> 	    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> 	    MPI_Get_processor_name(processor_name,&namelen);
> 	    //printf("I am child nr %d\n", rank);
> 	    
> 	    MPI_Comm_get_parent(&intercomm); 
> 	
> 	    //Child rank send and recvs from parent rank
> 	
> 	        
> 	    err = MPI_Recv(str, 4, MPI_CHAR, rank, 0, 
> intercomm, MPI_STATUS_IGNORE);
> 	    printf("Child %d on %s received from parent: 
> %s\n",rank, processor_name, str); 
> 	    fflush(stdout);
> 	    err = MPI_Send("hi", 3, MPI_CHAR, rank, 0, intercomm);
> 	    
> 	
> 	    MPI_Finalize();
> 	    return 0;
> 	}
> 	
> 	
> 	Thanks a lot
> 	Florin
> 	
> 
> 
> 




More information about the mpich2-dev mailing list