[MPICH] Possible Race condition between Test() and Cancel

Sun Feb 5 02:54:40 CST 2006

Hi Bill,
Yes, I also read that standard on this and it seems to preclude
race-condition free waitall/cancel operations. Essentially the problem
is that the test has to be under the same mutex as the cancel and users
have no way of doing this. The easiest solution is a Cancel() that
returns an error code but doesn't generate an abort. A CancelAll
function that implements the test/cancel protocol under the general
mutex would also work, but is of course, not standard MPI. I didn't
check changing the error handler, you seem to imply that it won't work
in the current version, is this correct? At any rate it's a complicated
answer to a simple problem that, frankly, seems to have been in
oversight in the design of the standard for multi-threaded operations.
Regards,
David 

-----Original Message-----
From: William Gropp [mailto:gropp at mcs.anl.gov] 
Sent: Thursday, February 02, 2006 4:49 PM
To: Rajeev Thakur
Cc: David Minor; mpich-discuss at mcs.anl.gov
Subject: RE: [MPICH] Possible Race condition between Test() and Cancel

At 01:30 PM 2/1/2006, Rajeev Thakur wrote:
>This fix should probably be ok.

Actually, I think that this is not the right answer, and the real
problem 
is more painful.  The MPI standard says that it is the responsibility of

the user to ensure that if multiple threads are using the same MPI
object 
(such as a request), there are no race conditions.  The user is supposed
to 
use their own synchronization mechanisms if they need to guarantee a 
particular order of execution.

In this case, MPI_Cancel is required to signal an error if it is
presented 
with an MPI_REQUEST_NULL handle (its our bug if it SEGVs instead), as
the 
MPI standard generally requires returning an error on any null handle, 
except where specifically allowed (as it is for MPI_Waitall etc. and
error 
handlers on MPI_FILE_NULL).  This is the same requirement made by Posix.

Now, I'm not sure how to implement the action David describes - a
waitall 
with a separate thread allowed to cancel requests.  It may require using

generalized requests to provide a hook to implement the required thread 
synchronization.  Alternately, setting the error handler to 
MPI_ERRORS_RETURN on MPI_COMM_WORLD (and any necessary fix to our code
to 
correctly invoke the error handler on MPI_REQUEST_NULL) would allow you
to 
try to cancel the request and ignore any error.

Bill

>Rajeev
>
> > -----Original Message-----
> > From: David Minor [mailto:david-m at orbotech.com]
> > Sent: Wednesday, February 01, 2006 12:00 AM
> > To: Rajeev Thakur; mpich-discuss at mcs.anl.gov
> > Subject: RE: [MPICH] Possible Race condition between Test() and
Cancel
> >
> > Hi Rajeev,
> > I tried that earlier but it didn't help because you still have the
> > problem that between the time you check for MPI_REQUEST_NULL and the
> > time you do cancel the wait can complete. The check needs to be
under
> > the general mutex for wait and test. I snuck into the source code
for
> > cancel.c and added the following lines and the problem is fixed for
my
> > application. But of course I didn't fix it properly and
> > probably caused
> > other bugs. I just wanted to see if indeed that was the problem.
> >
> > 79:
> >     if (*request == MPI_REQUEST_NULL) {
> >         mpi_errno = MPI_SUCCESS;
> >         goto fn_exit;
> >
> >     }
> >     if (*request_ptr->cc_ptr == 0) {
> >         mpi_errno = MPI_SUCCESS;
> >         goto fn_exit;
> >
> >     }
> >
> > Is there any reason why this could or should not be fixed?
> >
> > David
> >
> >
> >
> >
> > -----Original Message-----
> > From: Rajeev Thakur [mailto:thakur at mcs.anl.gov]
> > Sent: Tuesday, January 31, 2006 7:30 PM
> > To: David Minor; mpich-discuss at mcs.anl.gov
> > Subject: RE: [MPICH] Possible Race condition between Test() and
Cancel
> >
> > If the request is completed by a test or wait, it is set to
> > MPI_REQUEST_NULL. See if adding an "if (request !=
MPI_REQUEST_NULL)"
> > around
> > the MPI_Cancel helps.
> >
> > Rajeev
> >
> >
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of David Minor
> > > Sent: Tuesday, January 31, 2006 1:22 AM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [MPICH] Possible Race condition between Test() and Cancel
> > >
> > > There appears to be a problem with MPI_Cancel. At least under
> > > Red Hat 9
> > > with the the g++ 3.4.3 compiler.
> > >
> > > If you Cancel a completed receive request, you will get an
> > > MPI abort or
> > > seg fault.
> > > But if you Test() the request before calling cancel on it there is
> > > always the possibility that between the Test() and the Cancel()
the
> > > request will be completed thus causing an abort.  What is the
> > > solution?
> > > Shouldn't Cancel() simply return an error if the request is
already
> > > completed?
> > >
> > > My specific problem is:
> > >
> > > I'w waiting with WaitAll on a set of receive requests. I
> > want to wait
> > > until either 1) They all complete or 2) another thread
> > > decides to cancel
> > > the requests.
> > >
> > > The problem is that the thread that cancels the requests
> > has no way of
> > > assuring that it doesn't call Cancel() on an already
> > > completed request.
> > >
> > > Please advise,
> > >
> > > Regards,
> > > David Minor Orbotech
> > >
> > >
> >
> >

William Gropp
http://www.mcs.anl.gov/~gropp