[mpich-discuss] Problem with dynamic spawn
Einar Sørheim
einar at ife.no
Tue Sep 13 04:09:51 CDT 2011
Hi
We have a FEM model where we have started using trilinos' ML multigrids
solver as an alternative to our CG solver and
Intel MKL pardiso. Trilinos uses MPI for running in paralell, the rest
of our program uses OpenMP and the whole structure of our software is
based on the OpenMP paradigm where one creates new threads when needed.
(Our platform is Windows XP 64, Ms c++, Intel Fortran)
MPI also has such a mechanism through the use of MPI_Comm_spawn.
This seems to work well for 4 processes, going down to 2 it fails. We
have tried updating to the latest MPICH
version (1.4.1p1) but we get the same error.
To get more specific, the mother process together with the newly spawned
processes makes up the pool available to trilinos. When spawning only
one additional process to make up a pool of 2 we get the following error:
Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(274)................................: MPI_Waitall(count=1,
req_array=0000000003354100, status_array=00000000005FF9C0) failed
MPIR_Waitall_impl(121)...........................:
MPIDI_CH3I_Progress(402).........................:
MPID_nem_mpich2_test_recv(747)...................:
MPID_nem_newtcp_module_poll(37)..................:
MPID_nem_newtcp_module_connpoll(2656)............:
MPID_nem_newtcp_module_recv_success_handler(2339):
MPID_nem_newtcp_module_post_readv_ex(330)........:
MPIU_SOCKW_Readv_ex(392).........................: read from socket
failed, An operation on a socket could not be performed because the
system lacked sufficient buffer space or because a queue was full.
(errno 10055)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the
remote host.(10054)
The spawning and merging is done by the following code:
mother process:
MPI_Initialized(&initialized);
if (initialized &&(numparalells > 1))
{
numworkers= numparalells-1;
MPI_Comm_dup(MPI_COMM_SELF,&selfcomm);
MPI_Comm_spawn( "trilin_slave", MPI_ARGV_NULL, numworkers,
MPI_INFO_NULL,
0, selfcomm, &workercomm, MPI_ERRCODES_IGNORE );
MPI_Intercomm_merge(workercomm,0,&fullcomm);
Epetra_MpiComm Comm( fullcomm );
trilin_solve(Nnodes, diag_el, offdiag_el,nu_rows, rowstart,
column_index,
X, B, solvertype, maxit, relerror,
smoothertype, smoothersweeps, mgsolvertype, coarsegridmax,
aggregationtype, smootherdampingfactor,
r_xcoord_, r_ycoord_, r_zcoord_, Comm );
the slave process:
int main(int argc, char *argv[])
{
int initialized, iget;
MPI_Comm boss_comm,comm;
MPI_Initialized(&initialized);
if (!initialized) {
MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE, &iget);
MPI_Comm_get_parent(&boss_comm);
if (boss_comm != MPI_COMM_NULL) {
// inside the spawned tasks
MPI_Intercomm_merge(boss_comm,1,&comm);
Epetra_MpiComm Comm( comm );
trilin_solve(0, NULL, NULL ,0, NULL, NULL,
NULL, NULL, 0, 0, 0.0,
0, 0, 0, 0,
0, 0.0,
NULL, NULL, NULL, Comm );
}
}
MPI_Comm_disconnect(&boss_comm);
More information about the mpich-discuss
mailing list