[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
Ioan Raicu
iraicu at cs.uchicago.edu
Thu Jun 28 16:27:04 CDT 2007
bugzilla-daemon at mcs.anl.gov wrote:
> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>
>
>
>
>
> ------- Comment #1 from wilde at mcs.anl.gov 2007-06-28 16:12 -------
> Ive reviewed this email thread on this bug, and am moving this discussion to
> bugzilla.
>
> I and am uncertain about the following - can people involved (Nika, Ioan,
> Mihael) clarify:
>
> - did Mihael discover an error in Falkon mutex code?
>
>
We are not sure, but we are adding extra synchronization in several
parts of the Falkon provider. The reason we are saying that we are not
sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon
provider and Falkon itself over and over again, and we never encountered
this. Now, we have a workflow that has an average of 1 task/sec, I find
it hard to beleive that a synchronization issue that never surfaced
before under stress testing is surfacing now under such a light load.
We are also verifying that we are handling all exceptions correctly
within the Falkon provider.
> - if so was it fixed, and did it correct the problem of missed completion
> notifications?
>
We don't know, the problems are reproducible over short runs, and only
seem to pop up with longer runs. For example, we completed the 100 mol
run just fine, which had 10K jobs. We have to rerun the 244 mol run to
verify things.
> - whats the state of the "unable to write output file" problem?
>
> - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
> was that reported? (This raises interesting issues in troubleshooting and
> trouble workaround)
>
I reported it, but help at tg claims the node is working fine. They claim
that once in a while, it is normal for this to happen, and my argument
that all other nodes behaved perfectly with the exception of this one
isn't enough for them. For now, if we get this node again, we can
manually kill the Falkon worker there so Falkon won't use it anymore.
> - do we have a plan for how to run this WF at scale? Meaning how to get 244
> nodes for several days, whether we can scale up beyond
> 1-processor-per-molecule, what the expected runtime is, how to deal with
> errors/restarts, etc? (Should detail this here in bugz).
>
There is still work I need to do to ensure that a task that is running
when the resource lease expires is correctly handled and Swift is
notified that it failed. I have the code written and in Falkon already,
but I have yet to test it. We need to make sure this works before we
try to get say 24 hour resource allocations when we know the experiment
will likely take several days. Also, I think the larger part of the
workflow could benefit from more than 1 node per molecule, so if we
could get more, it should improve the end-to-end time.
Ioan
>
>
--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dsl.cs.uchicago.edu/
============================================
============================================
More information about the Swift-devel
mailing list