<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

No, just the Falkon provider (~500 lines of code), as far as I know.  <br>

<br>

The Falkon service is around 10K lines of code, and the Falkon executor

is another 3K, so they will likely take longer than a few days for a

code review of everything in Falkon.<br>

<br>

Ioan<br>

<br>

Ian Foster wrote:

<blockquote

 cite="mid:1895472852-1183066171-cardhu_decombobulator_blackberry.rim.net-826324017-@bxe006.bisx.prod.on.blackberry"

 type="cite">

  <pre wrap="">Did we do a complete code review?

Sent via BlackBerry from T-Mobile

-----Original Message-----

From: Ioan Raicu <a class="moz-txt-link-rfc2396E" href="mailto:iraicu@cs.uchicago.edu"><iraicu@cs.uchicago.edu></a>

Date: Thu, 28 Jun 2007 16:27:04 

<a class="moz-txt-link-abbreviated" href="mailto:To:bugzilla-daemon@mcs.anl.gov">To:bugzilla-daemon@mcs.anl.gov</a>

<a class="moz-txt-link-abbreviated" href="mailto:Cc:swift-devel@ci.uchicago.edu">Cc:swift-devel@ci.uchicago.edu</a>

Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

<a class="moz-txt-link-abbreviated" href="mailto:bugzilla-daemon@mcs.anl.gov">bugzilla-daemon@mcs.anl.gov</a> wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap=""><a class="moz-txt-link-freetext" href="http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72">http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72</a>

------- Comment #1 from <a class="moz-txt-link-abbreviated" href="mailto:wilde@mcs.anl.gov">wilde@mcs.anl.gov</a>  2007-06-28 16:12 -------

Ive reviewed this email thread on this bug, and am moving this discussion to

bugzilla. 

I and am uncertain about the following - can people involved (Nika, Ioan,

Mihael) clarify:

- did Mihael discover an error in Falkon mutex code?

    </pre>

  </blockquote>

  <pre wrap=""><!---->We are not sure, but we are adding extra synchronization in several 

parts of the Falkon provider.  The reason we are saying that we are not 

sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 

provider and Falkon itself over and over again, and we never encountered 

this.  Now, we have a workflow that has an average of 1 task/sec, I find 

it hard to beleive that a synchronization issue that never surfaced 

before under stress testing is surfacing now under such a light load.  

We are also verifying that we are handling all exceptions correctly 

within the Falkon provider.

  </pre>

  <blockquote type="cite">

    <pre wrap="">- if so was it fixed, and did it correct the problem of missed completion

notifications?

    </pre>

  </blockquote>

  <pre wrap=""><!---->We don't know, the problems are reproducible over short runs, and only 

seem to pop up with longer runs.  For example, we completed the 100 mol 

run just fine, which had 10K jobs.  We have to rerun the 244 mol run to 

verify things.

  </pre>

  <blockquote type="cite">

    <pre wrap="">- whats the state of the "unable to write output file" problem?

- do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,

was that reported? (This raises interesting issues in troubleshooting and

trouble workaround)

    </pre>

  </blockquote>

  <pre wrap=""><!---->I reported it, but help@tg claims the node is working fine.  They claim 

that once in a while, it is normal for this to happen, and my argument 

that all other nodes behaved perfectly with the exception of this one 

isn't enough for them.  For now, if we get this node again, we can 

manually kill the Falkon worker there so Falkon won't use it anymore.

  </pre>

  <blockquote type="cite">

    <pre wrap="">- do we have a plan for how to run this WF at scale? Meaning how to get 244

nodes for several days, whether we can scale up beyond

1-processor-per-molecule, what the expected runtime is, how to deal with

errors/restarts, etc? (Should detail this here in bugz).

    </pre>

  </blockquote>

  <pre wrap=""><!---->There is still work I need to do to ensure that a task that is running 

when the resource lease expires is correctly handled and Swift is 

notified that it failed.  I have the code written and in Falkon already, 

but I have yet to test it.  We need to make sure this works before we 

try to get say 24 hour resource allocations when we know the experiment 

will likely take several days.  Also, I think the larger part of the 

workflow could benefit from more than 1 node per molecule, so if we 

could get more, it should improve the end-to-end time. 

Ioan

  </pre>

  <blockquote type="cite">

    <pre wrap="">  

    </pre>

  </blockquote>

  <pre wrap=""><!---->

  </pre>

</blockquote>

<br>

<pre class="moz-signature" cols="72">-- 

============================================

Ioan Raicu

Ph.D. Student

============================================

Distributed Systems Laboratory

Computer Science Department

University of Chicago

1100 E. 58th Street, Ryerson Hall

Chicago, IL 60637

============================================

Email: <a class="moz-txt-link-abbreviated" href="mailto:iraicu@cs.uchicago.edu">iraicu@cs.uchicago.edu</a>

Web:   <a class="moz-txt-link-freetext" href="http://www.cs.uchicago.edu/~iraicu">http://www.cs.uchicago.edu/~iraicu</a>

       <a class="moz-txt-link-freetext" href="http://dsl.cs.uchicago.edu/">http://dsl.cs.uchicago.edu/</a>

============================================

============================================

</pre>

</body>

</html>