[mpich-discuss] MPI_Comm_Connect bug?

Biddiscombe, John A. biddisco at cscs.ch
Fri Oct 15 04:10:19 CDT 2010


I don't think I explained very well what I meant ....

I call 
>  int error_code = MPI_Comm_connect(this->DsmMasterHostName, MPI_INFO_NULL, 0, this->Comm, &this->InterComm);
but internally, MPI detects an error (there is no process to connect to) and aborts the operation, returning (via the error handler). But only rank 0 aborts the operation. the other ranks wait for ever for nothing to happen. Because I want to gracefully handle the abort, I have set the error handler to MPI_ERRORS_RETURN, but ranks 1 -> N never return.

I would like MPI to detect an error on all ranks - not just rank 0. I don't want the app to exit at all. If I use the error handler MPI_ERRORS_ARE_FATAL, then all is fine, the app terminates as expected.

I am trying to determine if there is a bug in the MPI_Comm_connect routine, because rank 0, detects that there is nobody to connect to, but the other ranks do not and the app hangs for ever. I wondered if inside mpich (I'm using 1.3rc2 on win32) rank 0, should somehow tell ranks 1-N that the connect has failed and they could also abort and return to the user code.

Hope I explained it better this time

JB

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius Buntinas
Sent: 14 October 2010 20:23
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] MPI_Comm_Connect bug?


What version of MPICH2 are you using?  What command-line parameters did you use for mpiexec?

Normally if one process exits without calling MPI_Finalize, the process manager will abort all processes.  There is an option for (the Hydra version of) mpiexec that disables this behavior.

If you want all processes to abort, you should call MPI_Abort(MPI_COMM_WORLD, errorcode) to abort all processes.

-d

On Oct 14, 2010, at 1:11 PM, Biddiscombe, John A. wrote:

> To try to catch a problem that occurs when MPI_Comm_connect fails, I wrapped the call with an error handler with the aim of gracefully exiting.
> 
> rank 0, detects an error, aborts and displays the message. But other ranks hang waiting for something to happen. I think that when rank 0 aborts, it should first signal the other ranks to also abort. 
> 
> Am I doing it wrong, or is this a bug?
> 
> thanks. snippet below
> 
> JB
> 
>  MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>  int error_code = MPI_Comm_connect(this->DsmMasterHostName, MPI_INFO_NULL, 0, this->Comm, &this->InterComm);
>  if (error_code== MPI_SUCCESS) {
>    H5FDdsmDebug("Id = " << this->Id << " MPI_Comm_connect returned SUCCESS");
>    isConnected = H5FD_DSM_SUCCESS;
>  } else {
>   char error_string[1024];
>   int length_of_error_string;
>   MPI_Error_string(error_code, error_string, &length_of_error_string);
>   H5FDdsmError("\nMPI_Comm_connect failed with error : \n" << error_string << "\n\n");
>  }
>  // reset to MPI_ERRORS_ARE_FATAL for normal debug purposes
>  MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
> 
> 
> -- 
> John Biddiscombe,                            email:biddisco @ cscs.ch
> http://www.cscs.ch/
> CSCS, Swiss National Supercomputing Centre  | Tel:  +41 (91) 610.82.07
> Via Cantonale, 6928 Manno, Switzerland      | Fax:  +41 (91) 610.82.82
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list