[mpich-discuss] spawning processes with MPI_Comm_spawn: Error inMPI_Finalize

Jayesh Krishna jayesh at mcs.anl.gov
Mon Feb 9 09:53:11 CST 2009


 Hi,
  Try out the latest stable release of MPICH2 (1.0.8 - available at
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=down
loads). The default channel in 1.0.8 is nemesis which will provide you
with better performance than ssm (& uses shared mem for comm across local
procs & tcp for comm across non-local procs like ssm).

Regards,
Jayesh

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Jitendra Kumar
Sent: Monday, February 09, 2009 9:29 AM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] spawning processes with MPI_Comm_spawn: Error
inMPI_Finalize

Hi,
I am trying to start slave process in my code using MPI_Comm_spawn. The
process starts and runs fine except it fails at MPI_Finalize stage.  I am
including the excerpts from the parent and slave codes for launching and
terminating the process. I am using several MPI_Send/Recv calls using the
inter communicator which are getting completed properly.

Parent program:
MPI_Info_create(&hostinfo);
MPI_Info_set(hostinfo, "file", "machinefile"); error =
MPI_Comm_spawn(command, arg, spawn_size, hostinfo, 0, MPI_COMM_SELF,
&slaveworld, MPI_ERRCODES_IGNORE);
-------
-------
MPI_Comm_free(&slaveworld);
MPI_FInalize();


Slave program (get the parent communicator):
MPI_Comm_get_parent(&parentcomm);
---------------
---------------
MPI_Comm_free(&parentcomm);
MPI_Finalize();

I get following errors at the end...

rank 0 in job 2674  master_4268   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9
0,1: [cli_0]: aborting job:
0,1: Fatal error in MPI_Finalize: Other MPI error, error stack:
0,1: MPI_Finalize(220).........................: MPI_Finalize failed
0,1: MPI_Finalize(146).........................:
0,1: MPID_Finalize(206)........................: an error occurred while 
the devicewas waiting for all open connections to close
0,1: MPIDI_CH3I_Progress(161)..................: handle_sock_op failed
0,1: MPIDI_CH3I_Progress_handle_sock_event(175):
0,1: MPIDU_Socki_handle_read(649)..............: connection failure 
(set=0,sock=1,errno=104:(strerror() not found))
rank 0 in job 9  node2_32773   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9

mpich2version:
Version:           1.0.3
Device:            ch3:ssm
Configure Options: --with-device=ch3:ssm --enable-f77 --enable-f90 
--enable-cxx --prefix=/usr/local/mpich2-1.0.3-pathscale-k8

Am I doing something wrong in releasing the communicators. I even tried 
using MPI_Comm_disconnect in place of MPI_Comm_free in both parent and 
slave codes, but the same error. Any pointers to the problem would be of 
great help.

Thanks,
Jitu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090209/dc203a72/attachment.htm>


More information about the mpich-discuss mailing list