[MPICH] Possible Race condition between Test() and Cancel
William Gropp
gropp at mcs.anl.gov
Thu Feb 2 08:49:15 CST 2006
At 01:30 PM 2/1/2006, Rajeev Thakur wrote:
>This fix should probably be ok.
Actually, I think that this is not the right answer, and the real problem
is more painful. The MPI standard says that it is the responsibility of
the user to ensure that if multiple threads are using the same MPI object
(such as a request), there are no race conditions. The user is supposed to
use their own synchronization mechanisms if they need to guarantee a
particular order of execution.
In this case, MPI_Cancel is required to signal an error if it is presented
with an MPI_REQUEST_NULL handle (its our bug if it SEGVs instead), as the
MPI standard generally requires returning an error on any null handle,
except where specifically allowed (as it is for MPI_Waitall etc. and error
handlers on MPI_FILE_NULL). This is the same requirement made by Posix.
Now, I'm not sure how to implement the action David describes - a waitall
with a separate thread allowed to cancel requests. It may require using
generalized requests to provide a hook to implement the required thread
synchronization. Alternately, setting the error handler to
MPI_ERRORS_RETURN on MPI_COMM_WORLD (and any necessary fix to our code to
correctly invoke the error handler on MPI_REQUEST_NULL) would allow you to
try to cancel the request and ignore any error.
Bill
>Rajeev
>
> > -----Original Message-----
> > From: David Minor [mailto:david-m at orbotech.com]
> > Sent: Wednesday, February 01, 2006 12:00 AM
> > To: Rajeev Thakur; mpich-discuss at mcs.anl.gov
> > Subject: RE: [MPICH] Possible Race condition between Test() and Cancel
> >
> > Hi Rajeev,
> > I tried that earlier but it didn't help because you still have the
> > problem that between the time you check for MPI_REQUEST_NULL and the
> > time you do cancel the wait can complete. The check needs to be under
> > the general mutex for wait and test. I snuck into the source code for
> > cancel.c and added the following lines and the problem is fixed for my
> > application. But of course I didn't fix it properly and
> > probably caused
> > other bugs. I just wanted to see if indeed that was the problem.
> >
> > 79:
> > if (*request == MPI_REQUEST_NULL) {
> > mpi_errno = MPI_SUCCESS;
> > goto fn_exit;
> >
> > }
> > if (*request_ptr->cc_ptr == 0) {
> > mpi_errno = MPI_SUCCESS;
> > goto fn_exit;
> >
> > }
> >
> > Is there any reason why this could or should not be fixed?
> >
> > David
> >
> >
> >
> >
> > -----Original Message-----
> > From: Rajeev Thakur [mailto:thakur at mcs.anl.gov]
> > Sent: Tuesday, January 31, 2006 7:30 PM
> > To: David Minor; mpich-discuss at mcs.anl.gov
> > Subject: RE: [MPICH] Possible Race condition between Test() and Cancel
> >
> > If the request is completed by a test or wait, it is set to
> > MPI_REQUEST_NULL. See if adding an "if (request != MPI_REQUEST_NULL)"
> > around
> > the MPI_Cancel helps.
> >
> > Rajeev
> >
> >
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of David Minor
> > > Sent: Tuesday, January 31, 2006 1:22 AM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [MPICH] Possible Race condition between Test() and Cancel
> > >
> > > There appears to be a problem with MPI_Cancel. At least under
> > > Red Hat 9
> > > with the the g++ 3.4.3 compiler.
> > >
> > > If you Cancel a completed receive request, you will get an
> > > MPI abort or
> > > seg fault.
> > > But if you Test() the request before calling cancel on it there is
> > > always the possibility that between the Test() and the Cancel() the
> > > request will be completed thus causing an abort. What is the
> > > solution?
> > > Shouldn't Cancel() simply return an error if the request is already
> > > completed?
> > >
> > > My specific problem is:
> > >
> > > I'w waiting with WaitAll on a set of receive requests. I
> > want to wait
> > > until either 1) They all complete or 2) another thread
> > > decides to cancel
> > > the requests.
> > >
> > > The problem is that the thread that cancels the requests
> > has no way of
> > > assuring that it doesn't call Cancel() on an already
> > > completed request.
> > >
> > > Please advise,
> > >
> > > Regards,
> > > David Minor Orbotech
> > >
> > >
> >
> >
William Gropp
http://www.mcs.anl.gov/~gropp
More information about the mpich-discuss
mailing list