[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Thu Jun 28 16:27:04 CDT 2007


bugzilla-daemon at mcs.anl.gov wrote:
> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
>
>
>
>
>
> ------- Comment #1 from wilde at mcs.anl.gov  2007-06-28 16:12 -------
> Ive reviewed this email thread on this bug, and am moving this discussion to
> bugzilla. 
>
> I and am uncertain about the following - can people involved (Nika, Ioan,
> Mihael) clarify:
>
> - did Mihael discover an error in Falkon mutex code?
>
>   
We are not sure, but we are adding extra synchronization in several 
parts of the Falkon provider.  The reason we are saying that we are not 
sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
provider and Falkon itself over and over again, and we never encountered 
this.  Now, we have a workflow that has an average of 1 task/sec, I find 
it hard to beleive that a synchronization issue that never surfaced 
before under stress testing is surfacing now under such a light load.  
We are also verifying that we are handling all exceptions correctly 
within the Falkon provider.
> - if so was it fixed, and did it correct the problem of missed completion
> notifications?
>   
We don't know, the problems are reproducible over short runs, and only 
seem to pop up with longer runs.  For example, we completed the 100 mol 
run just fine, which had 10K jobs.  We have to rerun the 244 mol run to 
verify things.
> - whats the state of the "unable to write output file" problem?
>
> - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
> was that reported? (This raises interesting issues in troubleshooting and
> trouble workaround)
>   
I reported it, but help at tg claims the node is working fine.  They claim 
that once in a while, it is normal for this to happen, and my argument 
that all other nodes behaved perfectly with the exception of this one 
isn't enough for them.  For now, if we get this node again, we can 
manually kill the Falkon worker there so Falkon won't use it anymore.
> - do we have a plan for how to run this WF at scale? Meaning how to get 244
> nodes for several days, whether we can scale up beyond
> 1-processor-per-molecule, what the expected runtime is, how to deal with
> errors/restarts, etc? (Should detail this here in bugz).
>   
There is still work I need to do to ensure that a task that is running 
when the resource lease expires is correctly handled and Swift is 
notified that it failed.  I have the code written and in Falkon already, 
but I have yet to test it.  We need to make sure this works before we 
try to get say 24 hour resource allocations when we know the experiment 
will likely take several days.  Also, I think the larger part of the 
workflow could benefit from more than 1 node per molecule, so if we 
could get more, it should improve the end-to-end time. 

Ioan
>
>   

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================