[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Thu Jun 28 17:42:55 CDT 2007

STOP.  DO NOT reply to this email.

reply instead via a comment in bugzilla.

(do I sound like Ben yet? ;)

Ioan,

My understanding is that Mihael pointed out 2 clear unsynchronized race 
conditions from his review of the Falkon provider code.

Do you agree or disagree?  If you agree, have you fixed the race?  If not, do we 
need to discuss it further among more experts to get to an decision we believe 
is correct?

I dont want to sermonize, but will do so anyways:

<soapbox>

- mutex/synchronization problems are devilishly subtle

- to make mutex code work right, you need *both* code review, extensive testing, 
and ideally a lot of code asserts to make sure you are (locked) where you think 
you are.

- if we are arguing about the obvious its probably not obvious to everyone
(so f2f tabletop code review is helpful here, for both education and verification)

- to get mutex code right you need to make sure you have the tasks and shared 
data structures (and hence access patterns) clearly identified

- then you need tons of testing. not just live tests, but carefully contrived 
artificial tests to stress test various mutex situations and potential race and 
deadlock conditions.

</soapbox>

I dont think we should stop testing to do a code review, but we certainly will 
need to do one before we can expect very high reliability.

I'd like to ask you, Ioan that since it its your code and project, that you work 
out a schedule that works for everyone, and organize a review.  I understand 
that the core Falkpon code needs some simple cosmetic cleanup (mainly removing 
fossil code) and then posting in SVN.

:) Mike

Mihael Hategan wrote, On 6/28/2007 4:41 PM:
> On Thu, 2007-06-28 at 16:36 -0500, Ioan Raicu wrote:
>> There is an option to have a pool of threads work on these data
>> structures, but the pool size is set to 1.
> 
> Right, but the submit() method was called from different threads. Can we
> stop arguing about the obvious?
> 
>>   Point is well taken, we have fixed this, but I am not convinced this
>> is where the problem was.  We'll see after we do another run with all
>> the extra logging.
> 
> Can you commit the updates to svn?
> 
>> Ioan
>>
>> Mihael Hategan wrote: 
>>>>> - did Mihael discover an error in Falkon mutex code?
>>>>>
>>>>>   
>>>>>       
>>>> We are not sure, but we are adding extra synchronization in several 
>>>> parts of the Falkon provider.  The reason we are saying that we are not 
>>>> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
>>>> provider and Falkon itself over and over again, and we never encountered 
>>>> this.  Now, we have a workflow that has an average of 1 task/sec, I find 
>>>> it hard to beleive that a synchronization issue that never surfaced 
>>>> before under stress testing is surfacing now under such a light load.
>>>>     
>>> ?!?
>>> You are mutating maps and list from concurrent threads without
>>> synchronization. That is a problem regardless of any other
>>> considerations.
>>>
>>> Mihael
>>>
>>>
>>>
>>>
>>>   
>> -- 
>> ============================================
>> Ioan Raicu
>> Ph.D. Student
>> ============================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ============================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>>        http://dsl.cs.uchicago.edu/
>> ============================================
>> ============================================
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 

-- 
Mike Wilde
Computation Institute, University of Chicago
Math & Computer Science Division
Argonne National Laboratory
Argonne, IL   60439    USA
tel 630-252-7497 fax 630-252-1997