[mpich-discuss] spawning processes with MPI_Comm_spawn: Error in MPI_Finalize

Jitendra Kumar jkumar at ncsu.edu
Mon Feb 9 09:29:23 CST 2009


Hi,
I am trying to start slave process in my code using MPI_Comm_spawn. The 
process starts and runs fine except it fails at MPI_Finalize stage.  I 
am including the excerpts from the parent and slave codes for launching 
and terminating the process. I am using several MPI_Send/Recv calls 
using the inter communicator which are getting completed properly.

Parent program:
MPI_Info_create(&hostinfo);
MPI_Info_set(hostinfo, "file", "machinefile");
error = MPI_Comm_spawn(command, arg, spawn_size, hostinfo, 0, 
MPI_COMM_SELF, &slaveworld, MPI_ERRCODES_IGNORE);
-------
-------
MPI_Comm_free(&slaveworld);
MPI_FInalize();


Slave program (get the parent communicator):
MPI_Comm_get_parent(&parentcomm);
---------------
---------------
MPI_Comm_free(&parentcomm);
MPI_Finalize();

I get following errors at the end...

rank 0 in job 2674  master_4268   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9
0,1: [cli_0]: aborting job:
0,1: Fatal error in MPI_Finalize: Other MPI error, error stack:
0,1: MPI_Finalize(220).........................: MPI_Finalize failed
0,1: MPI_Finalize(146).........................:
0,1: MPID_Finalize(206)........................: an error occurred while 
the devicewas waiting for all open connections to close
0,1: MPIDI_CH3I_Progress(161)..................: handle_sock_op failed
0,1: MPIDI_CH3I_Progress_handle_sock_event(175):
0,1: MPIDU_Socki_handle_read(649)..............: connection failure 
(set=0,sock=1,errno=104:(strerror() not found))
rank 0 in job 9  node2_32773   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9

mpich2version:
Version:           1.0.3
Device:            ch3:ssm
Configure Options: --with-device=ch3:ssm --enable-f77 --enable-f90 
--enable-cxx --prefix=/usr/local/mpich2-1.0.3-pathscale-k8

Am I doing something wrong in releasing the communicators. I even tried 
using MPI_Comm_disconnect in place of MPI_Comm_free in both parent and 
slave codes, but the same error. Any pointers to the problem would be of 
great help.

Thanks,
Jitu




More information about the mpich-discuss mailing list