[mpich-discuss] The problem with MPICH2

王睿 wangraying at gmail.com
Sat Dec 4 22:07:18 CST 2010


Hi,



I'm in my second year of my graduate education. And now I'm learning MPI.

I have learned from file 'CHANGES' in mpich2-1.3.1 package,  that the newest
version of MPICH2 will not cause the whole job to abort if a process failure
occurs.

The following is quoted from file 'CHANGES':



OVERALL: Improved tolerance to process and communication failures
when error handler is set to MPI_ERRORS_RETURN. If a communication
operation fails (e.g., due to a process failure) MPICH2 will return
an error, and further communication to that process is not
possible. However, communication with other processes will still
proceed normally. Note, however, that the behavior collective
operations on communicators containing the failed process is
undefined, and may give incorrect results or hang some processes.



I have done some simple tests, but the results confused me.



The following is my source code,



#include ...

...



void recover(MPI_Comm *comm, int *err_code,...)
{
 printf("Pid%d: in recovery...\n",getpid());
}

int main(int argc,char* argv[])
{
 int rank,size;
 int tag = 99;
 char buf[20]="";
 MPI_Comm comm;
 MPI_Status status;
 MPI_Errhandler errh;
 MPI_Init(&argc,&argv);
 MPI_Comm_size(MPI_COMM_WORLD,&size);
 MPI_Comm_rank(MPI_COMM_WORLD,&rank);
 MPI_Comm_dup(MPI_COMM_WORLD, &comm);

 MPI_Errhandler_create(recover, &errh);
     MPI_Errhandler_set(comm,errh);

 printf("P%d: pid = %d\n", rank, getpid());
 if(rank == 0)
 {
  strcpy( buf, "haha!\n");
  MPI_Send(buf, 10, MPI_CHAR, 1, tag , comm);
  strcpy(buf, "hehe\n");
  MPI_Send(buf, 10, MPI_CHAR, 2, tag, comm);



  MPI_Recv(buf, 20, MPI_CHAR, 2, tag , comm, &status);
 }
 else
 {
  sleep(40);
  MPI_Recv(buf, 10, MPI_CHAR, 0, tag , comm, &status);



  if(rank == 2)
  {
   strcat(buf, " by P2!\n");
   MPI_Send(buf, 20, MPI_CHAR, 0, tag, comm);



  }
 }
 printf("P%d: %s\n",rank, buf);
 MPI_Finalize();
 return 1;
}

I kill process P2 by kill command and its pid, but the results of the job
is:

mpirun -np 3 ./hello
P0: pid = 17157
P2: pid = 17159
P1: pid = 17158
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)



Where does the problem lie? Looking forward eagerly for your reply.



Yours sincerely,

Rui


------

Rui Wang

Institute of Computing technology, Chinese Academy of Science, Beijing, PR.
China
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101205/d93ed354/attachment.htm>


More information about the mpich-discuss mailing list