[MPICH] outstanding multi-threaded race condition

David Minor david-m at orbotech.com
Wed Feb 21 01:47:17 CST 2007


The current version of MPICH2 has a race condition. If you try to cancel a set of outstanding receive requests. It's possible that in the middle of cancelling one of them will complete. Cancelling a completed request results in an abort level failure. Checking for completion before cancelling doesn't help because between the time you checked and the time you cancel the request could have completed. It seems the standard didn't really think about this problem, otherwise it would have added a cancelAll operation that would work on a set of requests and be able to do the cancellation inside an internal mutex. I've done a patch on cancel.c that corrects this by not generating an error on canceling an already completed request. Does the standard allow this (in letter if not in spirit)?  Enclosed is my fix. Search for "dminor" in the file to see the patch. Let's re-open the discussion and fix this problem in the next release.

Regards,
David Minor




More information about the mpich-discuss mailing list