[mpich-discuss] Problem with dynamic spawn

Einar Sørheim einar at ife.no
Tue Sep 13 04:09:51 CDT 2011


Hi
We have a FEM model where we have started using trilinos' ML multigrids 
solver as an alternative to our CG solver and
Intel MKL pardiso. Trilinos uses MPI for running in paralell, the rest 
of our program uses OpenMP and the whole structure of our software is 
based on the OpenMP paradigm where one creates new threads when needed.
(Our platform is Windows XP 64, Ms c++, Intel Fortran)
MPI also has such a mechanism through the use of MPI_Comm_spawn.
This seems to work well for 4 processes, going down to 2 it fails. We 
have tried updating to the latest MPICH
version (1.4.1p1) but we get the same error.

To get more specific, the mother process together with the newly spawned 
processes makes up the pool available to trilinos. When spawning only 
one additional process to make up a pool of 2 we get the following error:

Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(274)................................: MPI_Waitall(count=1, 
req_array=0000000003354100, status_array=00000000005FF9C0) failed
MPIR_Waitall_impl(121)...........................:
MPIDI_CH3I_Progress(402).........................:
MPID_nem_mpich2_test_recv(747)...................:
MPID_nem_newtcp_module_poll(37)..................:
MPID_nem_newtcp_module_connpoll(2656)............:
MPID_nem_newtcp_module_recv_success_handler(2339):
MPID_nem_newtcp_module_post_readv_ex(330)........:
MPIU_SOCKW_Readv_ex(392).........................: read from socket 
failed, An operation on a socket could not be performed because the 
system lacked sufficient buffer space or because a queue was full.

  (errno 10055)
unable to read the cmd header on the pmi context, Error = -1
.
Error posting readv, An existing connection was forcibly closed by the 
remote host.(10054)

The spawning and merging is done by the following code:
mother process:
   MPI_Initialized(&initialized);
   if (initialized &&(numparalells > 1))
     {
       numworkers= numparalells-1;
       MPI_Comm_dup(MPI_COMM_SELF,&selfcomm);
       MPI_Comm_spawn( "trilin_slave", MPI_ARGV_NULL, numworkers,
               MPI_INFO_NULL,
               0, selfcomm, &workercomm, MPI_ERRCODES_IGNORE );
       MPI_Intercomm_merge(workercomm,0,&fullcomm);
       Epetra_MpiComm Comm( fullcomm );
       trilin_solve(Nnodes, diag_el, offdiag_el,nu_rows, rowstart,  
column_index,
            X,  B, solvertype,  maxit,  relerror,
            smoothertype, smoothersweeps,  mgsolvertype,  coarsegridmax,
            aggregationtype,  smootherdampingfactor,
            r_xcoord_, r_ycoord_, r_zcoord_, Comm );

the slave process:

int main(int argc, char *argv[])
{
   int initialized, iget;
   MPI_Comm boss_comm,comm;

   MPI_Initialized(&initialized);
    if (!initialized) {
     MPI_Init_thread(&argc,&argv,MPI_THREAD_MULTIPLE, &iget);
     MPI_Comm_get_parent(&boss_comm);
       if (boss_comm != MPI_COMM_NULL) {
       // inside the spawned tasks
          MPI_Intercomm_merge(boss_comm,1,&comm);
       Epetra_MpiComm Comm( comm );
       trilin_solve(0, NULL, NULL ,0, NULL,  NULL,
            NULL,  NULL, 0,  0,  0.0,
            0, 0,  0,  0,
            0,  0.0,
            NULL, NULL, NULL, Comm );
     }
   }
   MPI_Comm_disconnect(&boss_comm);


More information about the mpich-discuss mailing list