[MPICH2 Req #3217] [MPICH] Multi-threaded cancel problems
David Minor
david-m at orbotech.com
Tue Mar 20 02:00:18 CDT 2007
Hello again,
I would think that you could change the handler before entering the cancel loop to MPI::ERRORS_RETURN and then change it back after the cancel loop to MPI::ERROR_ARE_FATAL. This could/should have the result that cancel will return with an error (but cancel() returns void) harmlessly and the program could continue. The problem is that MPICH only check for valid request handles if the HAVE_ERROR_CHECKING variable is set in compilation. In addition, if you compile with this it still throws an exception for some reason instead of returning. Of course the standard doesn't guarantee continuance after and error so setting the error handler wouldn't be a perfect solution event if it worked. Enclosed is a program that demonstrates this problem, you may have to run it a couple of times before failure depending on your environment. We're moving over to mvapich2 because of our infiniband needs so I'm going to continue this thread over there and perhaps in the mpi 2.1 forum.
Thanks,
David
-----Original Message-----
From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of David Minor
Sent: Wednesday, March 14, 2007 9:59 AM
To: William Gropp
Cc: mpich-discuss at mcs.anl.gov
Subject: RE: [MPICH2 Req #3217] [MPICH] Multi-threaded cancel problems
Hi Bill,
I'll see if I can bring this up in the 2.1 discussion. I think a CancelAll() instruction that could take a mixed set of completed/uncompleted requests and return with all of them completed would be a better idea, but there may be things in the code that make this difficult. Is there someone who would be willing to work with me on creating a specification and prototype implementation?
Regards,
David
-----Original Message-----
From: William Gropp [mailto:gropp at mcs.anl.gov]
Sent: Tuesday, March 13, 2007 6:14 PM
To: David Minor
Cc: mpich2-maint at mcs.anl.gov
Subject: Re: [MPICH2 Req #3217] [MPICH] Multi-threaded cancel problems
Ah. That is an interesting case, but as you note, it violates the
standard. Since the MPI 2.1 process is getting started, it might be
best to raise the issue their; we can try prototyping solutions (such
as a MPIX_WaitallWithCancel) in MPICH2.
Bill
On Mar 13, 2007, at 2:15 AM, David Minor wrote:
> Hi Bill,
> The situation is this. A process issues a set if Irecv commands and
> then saves the requests. It starts a thread that does a WaitAll on
> those requests. Now how can it cancel the transaction before the
> WaitAll has completed? If it goes through the list of requests and
> cancels each one, it's in danger of cancelling an already completed
> request. If it tests each one first, between the Test() and the
> Cancel() the request could complete. The user cannot manage a mutex
> over this because he has no access to the underlying mutex that
> allows messages to complete (mutexes aren't composable!). It seems
> to me there is a problem here in the standard. What is really
> needed is a CancelAll() command which would mutex the completions.
> Barring that I'm not sure what a possible solution is. I admit my
> solution violates the standard because it allows for Cancel() on a
> completed request but it also allows my application to work, which
> is necessary. :-) I'm preparing a comprehensive test of all these
> problems between WaitAll, Test and Cancel that I'll post as soon as
> it's done.
> Regards,
> David
>
> -----Original Message-----
> From: William Gropp [mailto:gropp at mcs.anl.gov]
> Sent: Thursday, February 22, 2007 10:58 PM
> To: David Minor
> Cc: mpich2-maint at mcs.anl.gov
> Subject: Re: [MPICH2 Req #3217] [MPICH] Ooops... forgot to include
> cancel.c in previous post...
>
>>
>> The current version of MPICH2 has a race condition. If you try to
>> cancel a set of outstanding receive requests. It's possible that in
>> the middle of cancelling one of them will complete. Cancelling a
>> completed request results in an abort level failure. Checking for
>> completion before cancelling doesn't help because between the time
>> you checked and the time you cancel the request could have completed.
>> It seems the standard didn't really think about this problem,
>> otherwise it would have added a cancelAll operation that would work
>> on a set of requests and be able to do the cancellation inside an
>> internal mutex. I've done a patch on cancel.c that corrects this by
>> not generating an error on canceling an already completed request.
>> Does the standard allow this (in letter if not in spirit)? Enclosed
>> is my fix. Search for "dminor" in the file to see the patch. Let's
>> re-
>> open the discussion and fix this problem in the next release.
>>
>
>
> Hmm. The test on *request == MPI_REQUEST_NULL isn't correct (with
> respect to
> the standard) because the standard makes null handles invalid
> inputs except for
> a few situations.
>
> I'm not sure what the exact situation is that you are seeing. If
> this is a
> multithreaded one, where one thread may be completing the request
> and another
> might be canceling it, then that is a user-error (the user is
> required to
> manage the mutex). If it is a single-threaded situation (by that,
> I mean a
> single user thread, even if the implementation contains several
> internal
> threads), then can you send a sample program so that I can see what
> the
> operations are? Thanks!
>
> Bill
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CancelBug.cpp
Type: application/octet-stream
Size: 3463 bytes
Desc: CancelBug.cpp
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070320/839fa24f/attachment.obj>
More information about the mpich-discuss
mailing list