[MPICH2 Req #3217] [MPICH] Multi-threaded cancel problems
David Minor
david-m at orbotech.com
Wed Mar 14 02:58:44 CDT 2007
Hi Bill,
I'll see if I can bring this up in the 2.1 discussion. I think a CancelAll() instruction that could take a mixed set of completed/uncompleted requests and return with all of them completed would be a better idea, but there may be things in the code that make this difficult. Is there someone who would be willing to work with me on creating a specification and prototype implementation?
Regards,
David
-----Original Message-----
From: William Gropp [mailto:gropp at mcs.anl.gov]
Sent: Tuesday, March 13, 2007 6:14 PM
To: David Minor
Cc: mpich2-maint at mcs.anl.gov
Subject: Re: [MPICH2 Req #3217] [MPICH] Multi-threaded cancel problems
Ah. That is an interesting case, but as you note, it violates the
standard. Since the MPI 2.1 process is getting started, it might be
best to raise the issue their; we can try prototyping solutions (such
as a MPIX_WaitallWithCancel) in MPICH2.
Bill
On Mar 13, 2007, at 2:15 AM, David Minor wrote:
> Hi Bill,
> The situation is this. A process issues a set if Irecv commands and
> then saves the requests. It starts a thread that does a WaitAll on
> those requests. Now how can it cancel the transaction before the
> WaitAll has completed? If it goes through the list of requests and
> cancels each one, it's in danger of cancelling an already completed
> request. If it tests each one first, between the Test() and the
> Cancel() the request could complete. The user cannot manage a mutex
> over this because he has no access to the underlying mutex that
> allows messages to complete (mutexes aren't composable!). It seems
> to me there is a problem here in the standard. What is really
> needed is a CancelAll() command which would mutex the completions.
> Barring that I'm not sure what a possible solution is. I admit my
> solution violates the standard because it allows for Cancel() on a
> completed request but it also allows my application to work, which
> is necessary. :-) I'm preparing a comprehensive test of all these
> problems between WaitAll, Test and Cancel that I'll post as soon as
> it's done.
> Regards,
> David
>
> -----Original Message-----
> From: William Gropp [mailto:gropp at mcs.anl.gov]
> Sent: Thursday, February 22, 2007 10:58 PM
> To: David Minor
> Cc: mpich2-maint at mcs.anl.gov
> Subject: Re: [MPICH2 Req #3217] [MPICH] Ooops... forgot to include
> cancel.c in previous post...
>
>>
>> The current version of MPICH2 has a race condition. If you try to
>> cancel a set of outstanding receive requests. It's possible that in
>> the middle of cancelling one of them will complete. Cancelling a
>> completed request results in an abort level failure. Checking for
>> completion before cancelling doesn't help because between the time
>> you checked and the time you cancel the request could have completed.
>> It seems the standard didn't really think about this problem,
>> otherwise it would have added a cancelAll operation that would work
>> on a set of requests and be able to do the cancellation inside an
>> internal mutex. I've done a patch on cancel.c that corrects this by
>> not generating an error on canceling an already completed request.
>> Does the standard allow this (in letter if not in spirit)? Enclosed
>> is my fix. Search for "dminor" in the file to see the patch. Let's
>> re-
>> open the discussion and fix this problem in the next release.
>>
>
>
> Hmm. The test on *request == MPI_REQUEST_NULL isn't correct (with
> respect to
> the standard) because the standard makes null handles invalid
> inputs except for
> a few situations.
>
> I'm not sure what the exact situation is that you are seeing. If
> this is a
> multithreaded one, where one thread may be completing the request
> and another
> might be canceling it, then that is a user-error (the user is
> required to
> manage the mutex). If it is a single-threaded situation (by that,
> I mean a
> single user thread, even if the implementation contains several
> internal
> threads), then can you send a sample program so that I can see what
> the
> operations are? Thanks!
>
> Bill
>
More information about the mpich-discuss
mailing list