<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
No, just the Falkon provider (~500 lines of code), as far as I know. <br>
<br>
The Falkon service is around 10K lines of code, and the Falkon executor
is another 3K, so they will likely take longer than a few days for a
code review of everything in Falkon.<br>
<br>
Ioan<br>
<br>
Ian Foster wrote:
<blockquote
cite="mid:1895472852-1183066171-cardhu_decombobulator_blackberry.rim.net-826324017-@bxe006.bisx.prod.on.blackberry"
type="cite">
<pre wrap="">Did we do a complete code review?
Sent via BlackBerry from T-Mobile
-----Original Message-----
From: Ioan Raicu <a class="moz-txt-link-rfc2396E" href="mailto:iraicu@cs.uchicago.edu"><iraicu@cs.uchicago.edu></a>
Date: Thu, 28 Jun 2007 16:27:04
<a class="moz-txt-link-abbreviated" href="mailto:To:bugzilla-daemon@mcs.anl.gov">To:bugzilla-daemon@mcs.anl.gov</a>
<a class="moz-txt-link-abbreviated" href="mailto:Cc:swift-devel@ci.uchicago.edu">Cc:swift-devel@ci.uchicago.edu</a>
Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
<a class="moz-txt-link-abbreviated" href="mailto:bugzilla-daemon@mcs.anl.gov">bugzilla-daemon@mcs.anl.gov</a> wrote:
</pre>
<blockquote type="cite">
<pre wrap=""><a class="moz-txt-link-freetext" href="http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72">http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72</a>
------- Comment #1 from <a class="moz-txt-link-abbreviated" href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a> 2007-06-28 16:12 -------
Ive reviewed this email thread on this bug, and am moving this discussion to
bugzilla.
I and am uncertain about the following - can people involved (Nika, Ioan,
Mihael) clarify:
- did Mihael discover an error in Falkon mutex code?
</pre>
</blockquote>
<pre wrap=""><!---->We are not sure, but we are adding extra synchronization in several
parts of the Falkon provider. The reason we are saying that we are not
sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon
provider and Falkon itself over and over again, and we never encountered
this. Now, we have a workflow that has an average of 1 task/sec, I find
it hard to beleive that a synchronization issue that never surfaced
before under stress testing is surfacing now under such a light load.
We are also verifying that we are handling all exceptions correctly
within the Falkon provider.
</pre>
<blockquote type="cite">
<pre wrap="">- if so was it fixed, and did it correct the problem of missed completion
notifications?
</pre>
</blockquote>
<pre wrap=""><!---->We don't know, the problems are reproducible over short runs, and only
seem to pop up with longer runs. For example, we completed the 100 mol
run just fine, which had 10K jobs. We have to rerun the 244 mol run to
verify things.
</pre>
<blockquote type="cite">
<pre wrap="">- whats the state of the "unable to write output file" problem?
- do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
was that reported? (This raises interesting issues in troubleshooting and
trouble workaround)
</pre>
</blockquote>
<pre wrap=""><!---->I reported it, but help@tg claims the node is working fine. They claim
that once in a while, it is normal for this to happen, and my argument
that all other nodes behaved perfectly with the exception of this one
isn't enough for them. For now, if we get this node again, we can
manually kill the Falkon worker there so Falkon won't use it anymore.
</pre>
<blockquote type="cite">
<pre wrap="">- do we have a plan for how to run this WF at scale? Meaning how to get 244
nodes for several days, whether we can scale up beyond
1-processor-per-molecule, what the expected runtime is, how to deal with
errors/restarts, etc? (Should detail this here in bugz).
</pre>
</blockquote>
<pre wrap=""><!---->There is still work I need to do to ensure that a task that is running
when the resource lease expires is correctly handled and Swift is
notified that it failed. I have the code written and in Falkon already,
but I have yet to test it. We need to make sure this works before we
try to get say 24 hour resource allocations when we know the experiment
will likely take several days. Also, I think the larger part of the
workflow could benefit from more than 1 node per molecule, so if we
could get more, it should improve the end-to-end time.
Ioan
</pre>
<blockquote type="cite">
<pre wrap="">
</pre>
</blockquote>
<pre wrap=""><!---->
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a>
Web: <a class="moz-txt-link-freetext" href="http://www.cs.uchicago.edu/~iraicu">http://www.cs.uchicago.edu/~iraicu</a>
<a class="moz-txt-link-freetext" href="http://dsl.cs.uchicago.edu/">http://dsl.cs.uchicago.edu/</a>
============================================
============================================
</pre>
</body>
</html>