[MPICH] Accept failure from forked process
John Robinson
jr at vertica.com
Thu Nov 3 09:22:21 CST 2005
Okay, fixed the subject line.
Also, see additional comment below about the forked processes...
John Robinson wrote:
> Good morning everyone,
>
> New to this list, so please forgive if this is not quite the right forum
> (and redirect me) - thanks!.
>
> I have a setup where the cluster is running a long-lived server process
> that uses mpi_comm_accept to receive new client connections, and
> single-process client processes that call mpi_comm_connect to submit work.
>
> The problem happens after the first client process has completed and a
> new one tries to connect. I get the following fatal error from the
> server processes:
>
> MPI_Comm_accept(116): MPI_Comm_accept(port="port#35267$description#jr$",
> MPI_INFO_NULL, root=0, comm=0x84000001, newcomm=0xbf89b370) failed
> MPID_Comm_accept(29):
> MPIDI_CH3_Comm_accept(598):
> MPIDI_CH3I_Add_to_bizcard_cache(58): business card in cache:
> port#35268$description#jr$, business card passed:
> port#35269$description#jr$
>
> The accept code looks like:
>
> impl->acceptComm = MPI::COMM_WORLD.Dup( ); // Intracomm for new
> connections
> MPI::Intercomm clientComm; // result of Accept() call
> clientComm = impl->acceptComm.Accept(impl->serverPort, MPI_INFO_NULL,
> 0);
>
> The connect is:
>
> MPI::Intercomm clientComm; // result of Connect() call
> clientComm = MPI::COMM_SELF.Connect( impl->serverPort,
> MPI_INFO_NULL, 0);
>
> One detail that may be relevant is that the client process is started on
> demand from a master process that forks the process that eventually does
> the connect(). So the original mpd job (the master process) is
> long-lived, but more than one separate forked processes are calling
> MPI_Init/Finalize. Does each of these need to be a separate MPI job?
If I stop and restart the master process between connects, the problem
does not happen.
/jr
---
> MPICH version: mpich2-1.0.2p1
>
> thanks,
> /jr
>
More information about the mpich-discuss
mailing list