[mpich-discuss] spawning processes with MPI_Comm_spawn: Error in MPI_Finalize
Jitendra Kumar
jkumar at ncsu.edu
Mon Feb 9 09:29:23 CST 2009
Hi,
I am trying to start slave process in my code using MPI_Comm_spawn. The
process starts and runs fine except it fails at MPI_Finalize stage. I
am including the excerpts from the parent and slave codes for launching
and terminating the process. I am using several MPI_Send/Recv calls
using the inter communicator which are getting completed properly.
Parent program:
MPI_Info_create(&hostinfo);
MPI_Info_set(hostinfo, "file", "machinefile");
error = MPI_Comm_spawn(command, arg, spawn_size, hostinfo, 0,
MPI_COMM_SELF, &slaveworld, MPI_ERRCODES_IGNORE);
-------
-------
MPI_Comm_free(&slaveworld);
MPI_FInalize();
Slave program (get the parent communicator):
MPI_Comm_get_parent(&parentcomm);
---------------
---------------
MPI_Comm_free(&parentcomm);
MPI_Finalize();
I get following errors at the end...
rank 0 in job 2674 master_4268 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
0,1: [cli_0]: aborting job:
0,1: Fatal error in MPI_Finalize: Other MPI error, error stack:
0,1: MPI_Finalize(220).........................: MPI_Finalize failed
0,1: MPI_Finalize(146).........................:
0,1: MPID_Finalize(206)........................: an error occurred while
the devicewas waiting for all open connections to close
0,1: MPIDI_CH3I_Progress(161)..................: handle_sock_op failed
0,1: MPIDI_CH3I_Progress_handle_sock_event(175):
0,1: MPIDU_Socki_handle_read(649)..............: connection failure
(set=0,sock=1,errno=104:(strerror() not found))
rank 0 in job 9 node2_32773 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
mpich2version:
Version: 1.0.3
Device: ch3:ssm
Configure Options: --with-device=ch3:ssm --enable-f77 --enable-f90
--enable-cxx --prefix=/usr/local/mpich2-1.0.3-pathscale-k8
Am I doing something wrong in releasing the communicators. I even tried
using MPI_Comm_disconnect in place of MPI_Comm_free in both parent and
slave codes, but the same error. Any pointers to the problem would be of
great help.
Thanks,
Jitu
More information about the mpich-discuss
mailing list