[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Thu Jun 28 16:32:35 CDT 2007

No, just the Falkon provider (~500 lines of code), as far as I know. 

The Falkon service is around 10K lines of code, and the Falkon executor 
is another 3K, so they will likely take longer than a few days for a 
code review of everything in Falkon.

Ioan

Ian Foster wrote:
> Did we do a complete code review?
>  
> Sent via BlackBerry from T-Mobile
>
> -----Original Message-----
> From: Ioan Raicu <iraicu at cs.uchicago.edu>
>
> Date: Thu, 28 Jun 2007 16:27:04 
> To:bugzilla-daemon at mcs.anl.gov
> Cc:swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
>
>
>
>
> bugzilla-daemon at mcs.anl.gov wrote:
>   
>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>
>>
>>
>>
>>
>> ------- Comment #1 from wilde at mcs.anl.gov  2007-06-28 16:12 -------
>> Ive reviewed this email thread on this bug, and am moving this discussion to
>> bugzilla. 
>>
>> I and am uncertain about the following - can people involved (Nika, Ioan,
>> Mihael) clarify:
>>
>> - did Mihael discover an error in Falkon mutex code?
>>
>>   
>>     
> We are not sure, but we are adding extra synchronization in several 
> parts of the Falkon provider.  The reason we are saying that we are not 
> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
> provider and Falkon itself over and over again, and we never encountered 
> this.  Now, we have a workflow that has an average of 1 task/sec, I find 
> it hard to beleive that a synchronization issue that never surfaced 
> before under stress testing is surfacing now under such a light load.  
> We are also verifying that we are handling all exceptions correctly 
> within the Falkon provider.
>   
>> - if so was it fixed, and did it correct the problem of missed completion
>> notifications?
>>   
>>     
> We don't know, the problems are reproducible over short runs, and only 
> seem to pop up with longer runs.  For example, we completed the 100 mol 
> run just fine, which had 10K jobs.  We have to rerun the 244 mol run to 
> verify things.
>   
>> - whats the state of the "unable to write output file" problem?
>>
>> - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
>> was that reported? (This raises interesting issues in troubleshooting and
>> trouble workaround)
>>   
>>     
> I reported it, but help at tg claims the node is working fine.  They claim 
> that once in a while, it is normal for this to happen, and my argument 
> that all other nodes behaved perfectly with the exception of this one 
> isn't enough for them.  For now, if we get this node again, we can 
> manually kill the Falkon worker there so Falkon won't use it anymore.
>   
>> - do we have a plan for how to run this WF at scale? Meaning how to get 244
>> nodes for several days, whether we can scale up beyond
>> 1-processor-per-molecule, what the expected runtime is, how to deal with
>> errors/restarts, etc? (Should detail this here in bugz).
>>   
>>     
> There is still work I need to do to ensure that a task that is running 
> when the resource lease expires is correctly handled and Swift is 
> notified that it failed.  I have the code written and in Falkon already, 
> but I have yet to test it.  We need to make sure this works before we 
> try to get say 24 hour resource allocations when we know the experiment 
> will likely take several days.  Also, I think the larger part of the 
> workflow could benefit from more than 1 node per molecule, so if we 
> could get more, it should improve the end-to-end time. 
>
> Ioan
>   
>>   
>>     
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070628/9d32efc1/attachment.html>