[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
Ioan Raicu
iraicu at cs.uchicago.edu
Thu Jun 28 16:32:35 CDT 2007
No, just the Falkon provider (~500 lines of code), as far as I know.
The Falkon service is around 10K lines of code, and the Falkon executor
is another 3K, so they will likely take longer than a few days for a
code review of everything in Falkon.
Ioan
Ian Foster wrote:
> Did we do a complete code review?
>
> Sent via BlackBerry from T-Mobile
>
> -----Original Message-----
> From: Ioan Raicu <iraicu at cs.uchicago.edu>
>
> Date: Thu, 28 Jun 2007 16:27:04
> To:bugzilla-daemon at mcs.anl.gov
> Cc:swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
>
>
>
>
> bugzilla-daemon at mcs.anl.gov wrote:
>
>> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>>
>>
>>
>>
>>
>> ------- Comment #1 from wilde at mcs.anl.gov 2007-06-28 16:12 -------
>> Ive reviewed this email thread on this bug, and am moving this discussion to
>> bugzilla.
>>
>> I and am uncertain about the following - can people involved (Nika, Ioan,
>> Mihael) clarify:
>>
>> - did Mihael discover an error in Falkon mutex code?
>>
>>
>>
> We are not sure, but we are adding extra synchronization in several
> parts of the Falkon provider. The reason we are saying that we are not
> sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon
> provider and Falkon itself over and over again, and we never encountered
> this. Now, we have a workflow that has an average of 1 task/sec, I find
> it hard to beleive that a synchronization issue that never surfaced
> before under stress testing is surfacing now under such a light load.
> We are also verifying that we are handling all exceptions correctly
> within the Falkon provider.
>
>> - if so was it fixed, and did it correct the problem of missed completion
>> notifications?
>>
>>
> We don't know, the problems are reproducible over short runs, and only
> seem to pop up with longer runs. For example, we completed the 100 mol
> run just fine, which had 10K jobs. We have to rerun the 244 mol run to
> verify things.
>
>> - whats the state of the "unable to write output file" problem?
>>
>> - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
>> was that reported? (This raises interesting issues in troubleshooting and
>> trouble workaround)
>>
>>
> I reported it, but help at tg claims the node is working fine. They claim
> that once in a while, it is normal for this to happen, and my argument
> that all other nodes behaved perfectly with the exception of this one
> isn't enough for them. For now, if we get this node again, we can
> manually kill the Falkon worker there so Falkon won't use it anymore.
>
>> - do we have a plan for how to run this WF at scale? Meaning how to get 244
>> nodes for several days, whether we can scale up beyond
>> 1-processor-per-molecule, what the expected runtime is, how to deal with
>> errors/restarts, etc? (Should detail this here in bugz).
>>
>>
> There is still work I need to do to ensure that a task that is running
> when the resource lease expires is correctly handled and Swift is
> notified that it failed. I have the code written and in Falkon already,
> but I have yet to test it. We need to make sure this works before we
> try to get say 24 hour resource allocations when we know the experiment
> will likely take several days. Also, I think the larger part of the
> workflow could benefit from more than 1 node per molecule, so if we
> could get more, it should improve the end-to-end time.
>
> Ioan
>
>>
>>
>
>
--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dsl.cs.uchicago.edu/
============================================
============================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070628/9d32efc1/attachment.html>
More information about the Swift-devel
mailing list