[MPICH2-dev] MPICH2 spawn problem

Rajeev Thakur thakur at mcs.anl.gov
Tue Feb 13 14:49:32 CST 2007


I am able to reproduce this error. We will look into it.
 
Rajeev
 


  _____  

From: owner-mpich2-dev at mcs.anl.gov [mailto:owner-mpich2-dev at mcs.anl.gov] On
Behalf Of Florin Isaila
Sent: Tuesday, February 13, 2007 7:37 AM
To: mpich2-dev at mcs.anl.gov
Subject: [MPICH2-dev] MPICH2 spawn problem




Hi

could someone help he with a  dynamic processes problem?

The problem occurs with the example provided in the MPICH2 distribution
(parent/child).  The example works fine when using local communication,  but
blocks  when  I  run  the  program on two machines with 

mpiexec -n 4 parent

The "child" processes are actually launched (can be seen with ps command on
both machines), but they block all in MPI_Init.   

I further simplified the process/child program in order to isolate the
error. I use just one parent and one child on different machines.  (I attach
the two programs at the end). And a strange error occurs:

When the program is modified such as the child is the first  one  to send
data to the parent it works!!!
But if the parent sends first with MPI_Send, it gets: 

  [florin at compute-0-0 simple_parent_child]$ mpiexec -n 1  parent 
[cli_0]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173)...............................: MPI_Send(buf=0x808d647,
count=4, MPI_CHAR, dest=0, tag=0, comm=0x84000000) failed
MPIDI_CH3_Progress_wait(212)................: an error occurred while
handling an event returned by MPIDU_Sock_Wait() 
MPIDI_CH3I_Progress_handle_sock_event(772)..:
MPIDI_CH3_Sockconn_handle_connect_event(589): [ch3:sock] failed to connnect
to remote process
MPIDU_Socki_handle_connect(791).............: connection failure
(set=0,sock=1,errno=111:Connection refused) 
rank 0 in job 20  compute-0-0.local_41161   caused collective abort of all
ranks
  exit status of rank 0: return code 1
 
I am using mpich2-1.0.5p2. The cpi example runs with OK with several nodes. 

My cluster has 8 nodes, all PIV with a Centos distribution, python 2.4 and
gcc 3.4

Here are the programs (fail when running with mpiexec -n 1 parent with the
message from above, work when changing the communication order): 

PARENT:

#include <stdio.h>
#include "mpi.h"

int main( int argc, char *argv[] )
{
    char str[10];
    int err=0, errcodes[256], rank, nprocs;
    MPI_Comm intercomm;
    int    namelen;
    char   processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD,&nprocs); 
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name,&namelen); 

    //printf("I am parent nr %d\n", rank);

    err = MPI_Comm_spawn("child", MPI_ARGV_NULL, nprocs,
                         MPI_INFO_NULL, 0, MPI_COMM_WORLD,
                         &intercomm, errcodes);  
    if (err) printf("Error in MPI_Comm_spawn\n");

        
    err = MPI_Send("bye", 4, MPI_CHAR, rank, 0, intercomm); 
    
    err = MPI_Recv(str, 3, MPI_CHAR, rank, 0, intercomm, MPI_STATUS_IGNORE);

    printf("Parent %d on %s received from child: %s\n",rank, processor_name,
str);
    fflush(stdout);

    MPI_Finalize();

    return 0;
}

CHILD: 
#include <stdio.h>
#include " mpi.h"

int main( int argc, char *argv[] )
{
    MPI_Comm intercomm;
    char str[10];
    int err, rank;
    int    namelen;
    char   processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc, &argv); 

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name,&namelen);
    //printf("I am child nr %d\n", rank);
    
    MPI_Comm_get_parent(&intercomm); 

    //Child rank send and recvs from parent rank

        
    err = MPI_Recv(str, 4, MPI_CHAR, rank, 0, intercomm, MPI_STATUS_IGNORE);
    printf("Child %d on %s received from parent: %s\n",rank, processor_name,
str); 
    fflush(stdout);
    err = MPI_Send("hi", 3, MPI_CHAR, rank, 0, intercomm);
    

    MPI_Finalize();
    return 0;
}


Thanks a lot
Florin



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20070213/a258b340/attachment.htm>


More information about the mpich2-dev mailing list