[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

Mihael Hategan hategan at mcs.anl.gov
Thu Jun 28 16:35:33 CDT 2007


I was waiting for it to be cleaned up and put into SVN, as we agreed.

On Thu, 2007-06-28 at 16:32 -0500, Ioan Raicu wrote:
> No, just the Falkon provider (~500 lines of code), as far as I know.  
> 
> The Falkon service is around 10K lines of code, and the Falkon
> executor is another 3K, so they will likely take longer than a few
> days for a code review of everything in Falkon.
> 
> Ioan
> 
> Ian Foster wrote: 
> > Did we do a complete code review?
> >  
> > Sent via BlackBerry from T-Mobile
> > 
> > -----Original Message-----
> > From: Ioan Raicu <iraicu at cs.uchicago.edu>
> > 
> > Date: Thu, 28 Jun 2007 16:27:04 
> > To:bugzilla-daemon at mcs.anl.gov
> > Cc:swift-devel at ci.uchicago.edu
> > Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
> > 
> > 
> > 
> > 
> > bugzilla-daemon at mcs.anl.gov wrote:
> >   
> > > http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
> > > 
> > > 
> > > 
> > > 
> > > 
> > > ------- Comment #1 from wilde at mcs.anl.gov  2007-06-28 16:12 -------
> > > Ive reviewed this email thread on this bug, and am moving this discussion to
> > > bugzilla. 
> > > 
> > > I and am uncertain about the following - can people involved (Nika, Ioan,
> > > Mihael) clarify:
> > > 
> > > - did Mihael discover an error in Falkon mutex code?
> > > 
> > >   
> > >     
> > We are not sure, but we are adding extra synchronization in several 
> > parts of the Falkon provider.  The reason we are saying that we are not 
> > sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon 
> > provider and Falkon itself over and over again, and we never encountered 
> > this.  Now, we have a workflow that has an average of 1 task/sec, I find 
> > it hard to beleive that a synchronization issue that never surfaced 
> > before under stress testing is surfacing now under such a light load.  
> > We are also verifying that we are handling all exceptions correctly 
> > within the Falkon provider.
> >   
> > > - if so was it fixed, and did it correct the problem of missed completion
> > > notifications?
> > >   
> > >     
> > We don't know, the problems are reproducible over short runs, and only 
> > seem to pop up with longer runs.  For example, we completed the 100 mol 
> > run just fine, which had 10K jobs.  We have to rerun the 244 mol run to 
> > verify things.
> >   
> > > - whats the state of the "unable to write output file" problem?
> > > 
> > > - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
> > > was that reported? (This raises interesting issues in troubleshooting and
> > > trouble workaround)
> > >   
> > >     
> > I reported it, but help at tg claims the node is working fine.  They claim 
> > that once in a while, it is normal for this to happen, and my argument 
> > that all other nodes behaved perfectly with the exception of this one 
> > isn't enough for them.  For now, if we get this node again, we can 
> > manually kill the Falkon worker there so Falkon won't use it anymore.
> >   
> > > - do we have a plan for how to run this WF at scale? Meaning how to get 244
> > > nodes for several days, whether we can scale up beyond
> > > 1-processor-per-molecule, what the expected runtime is, how to deal with
> > > errors/restarts, etc? (Should detail this here in bugz).
> > >   
> > >     
> > There is still work I need to do to ensure that a task that is running 
> > when the resource lease expires is correctly handled and Swift is 
> > notified that it failed.  I have the code written and in Falkon already, 
> > but I have yet to test it.  We need to make sure this works before we 
> > try to get say 24 hour resource allocations when we know the experiment 
> > will likely take several days.  Also, I think the larger part of the 
> > workflow could benefit from more than 1 node per molecule, so if we 
> > could get more, it should improve the end-to-end time. 
> > 
> > Ioan
> >   
> > 
> >   
> 
> -- 
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
>        http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel




More information about the Swift-devel mailing list