FW: [MPICH2 Req #3217] [MPICH] Multi-threaded cancel problems

David Minor david-m at orbotech.com
Tue Mar 13 10:13:30 CDT 2007



Hi Bill,
The situation is this. A process issues a set if Irecv commands and then saves the requests. It starts a thread that does a WaitAll on those requests. Now how can it cancel the transaction before the WaitAll has completed? If it goes through the list of requests and cancels each one, it's in danger of cancelling an already completed request. If it tests each one first, between the Test() and the Cancel() the request could complete. The user cannot manage a mutex over this because he has no access to the underlying mutex that allows messages to complete (mutexes aren't composable!). It seems to me there is a problem here in the standard. What is really needed is a CancelAll() command which would mutex the completions. Barring that I'm not sure what a possible solution is. I admit my solution violates the standard because it allows for Cancel() on a completed request but it also allows my application to work, which is necessary. :-) I'm preparing a comprehensive test of all these problems between WaitAll, Test and Cancel that I'll post as soon as it's done.
Regards,
David

-----Original Message-----
From: William Gropp [mailto:gropp at mcs.anl.gov] 
Sent: Thursday, February 22, 2007 10:58 PM
To: David Minor
Cc: mpich2-maint at mcs.anl.gov
Subject: Re: [MPICH2 Req #3217] [MPICH] Ooops... forgot to include cancel.c in previous post...

> 
> The current version of MPICH2 has a race condition. If you try to  
> cancel a set of outstanding receive requests. It's possible that in  
> the middle of cancelling one of them will complete. Cancelling a  
> completed request results in an abort level failure. Checking for  
> completion before cancelling doesn't help because between the time  
> you checked and the time you cancel the request could have completed.  
> It seems the standard didn't really think about this problem,  
> otherwise it would have added a cancelAll operation that would work  
> on a set of requests and be able to do the cancellation inside an  
> internal mutex. I've done a patch on cancel.c that corrects this by  
> not generating an error on canceling an already completed request.  
> Does the standard allow this (in letter if not in spirit)?  Enclosed  
> is my fix. Search for "dminor" in the file to see the patch. Let's re- 
> open the discussion and fix this problem in the next release.
> 


Hmm.  The test on *request == MPI_REQUEST_NULL isn't correct (with respect to
the standard) because the standard makes null handles invalid inputs except for
a few situations.

I'm not sure what the exact situation is that you are seeing.  If this is a
multithreaded one, where one thread may be completing the request and another
might be canceling it, then that is a user-error (the user is required to
manage the mutex).  If it is a single-threaded situation (by that, I mean a
single user thread, even if the implementation contains several internal
threads), then can you send a sample program so that I can see what the
operations are?  Thanks!

Bill




More information about the mpich-discuss mailing list