[mpich-discuss] MPI_Comm_Connect bug?

Darius Buntinas buntinas at mcs.anl.gov
Fri Oct 15 10:43:53 CDT 2010


OK, I think I understand now.  Although the standard doesn't specify what to do in an error case, I agree that having all processes return an error would be a good thing to do.

I've created a ticket for this issue.  You can add yourself in the cc list if you want to be notified on the progress.

https://trac.mcs.anl.gov/projects/mpich2/ticket/1119

Thanks for reporting this.

-d


On Oct 15, 2010, at 4:10 AM, Biddiscombe, John A. wrote:

> I don't think I explained very well what I meant ....
> 
> I call 
>> int error_code = MPI_Comm_connect(this->DsmMasterHostName, MPI_INFO_NULL, 0, this->Comm, &this->InterComm);
> but internally, MPI detects an error (there is no process to connect to) and aborts the operation, returning (via the error handler). But only rank 0 aborts the operation. the other ranks wait for ever for nothing to happen. Because I want to gracefully handle the abort, I have set the error handler to MPI_ERRORS_RETURN, but ranks 1 -> N never return.
> 
> I would like MPI to detect an error on all ranks - not just rank 0. I don't want the app to exit at all. If I use the error handler MPI_ERRORS_ARE_FATAL, then all is fine, the app terminates as expected.
> 
> I am trying to determine if there is a bug in the MPI_Comm_connect routine, because rank 0, detects that there is nobody to connect to, but the other ranks do not and the app hangs for ever. I wondered if inside mpich (I'm using 1.3rc2 on win32) rank 0, should somehow tell ranks 1-N that the connect has failed and they could also abort and return to the user code.
> 
> Hope I explained it better this time
> 
> JB
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius Buntinas
> Sent: 14 October 2010 20:23
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] MPI_Comm_Connect bug?
> 
> 
> What version of MPICH2 are you using?  What command-line parameters did you use for mpiexec?
> 
> Normally if one process exits without calling MPI_Finalize, the process manager will abort all processes.  There is an option for (the Hydra version of) mpiexec that disables this behavior.
> 
> If you want all processes to abort, you should call MPI_Abort(MPI_COMM_WORLD, errorcode) to abort all processes.
> 
> -d
> 
> On Oct 14, 2010, at 1:11 PM, Biddiscombe, John A. wrote:
> 
>> To try to catch a problem that occurs when MPI_Comm_connect fails, I wrapped the call with an error handler with the aim of gracefully exiting.
>> 
>> rank 0, detects an error, aborts and displays the message. But other ranks hang waiting for something to happen. I think that when rank 0 aborts, it should first signal the other ranks to also abort. 
>> 
>> Am I doing it wrong, or is this a bug?
>> 
>> thanks. snippet below
>> 
>> JB
>> 
>> MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>> int error_code = MPI_Comm_connect(this->DsmMasterHostName, MPI_INFO_NULL, 0, this->Comm, &this->InterComm);
>> if (error_code== MPI_SUCCESS) {
>>   H5FDdsmDebug("Id = " << this->Id << " MPI_Comm_connect returned SUCCESS");
>>   isConnected = H5FD_DSM_SUCCESS;
>> } else {
>>  char error_string[1024];
>>  int length_of_error_string;
>>  MPI_Error_string(error_code, error_string, &length_of_error_string);
>>  H5FDdsmError("\nMPI_Comm_connect failed with error : \n" << error_string << "\n\n");
>> }
>> // reset to MPI_ERRORS_ARE_FATAL for normal debug purposes
>> MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
>> 
>> 
>> -- 
>> John Biddiscombe,                            email:biddisco @ cscs.ch
>> http://www.cscs.ch/
>> CSCS, Swiss National Supercomputing Centre  | Tel:  +41 (91) 610.82.07
>> Via Cantonale, 6928 Manno, Switzerland      | Fax:  +41 (91) 610.82.82
>> 
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list