[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
Mihael Hategan
hategan at mcs.anl.gov
Thu Jun 28 16:35:33 CDT 2007
I was waiting for it to be cleaned up and put into SVN, as we agreed.
On Thu, 2007-06-28 at 16:32 -0500, Ioan Raicu wrote:
> No, just the Falkon provider (~500 lines of code), as far as I know.
>
> The Falkon service is around 10K lines of code, and the Falkon
> executor is another 3K, so they will likely take longer than a few
> days for a code review of everything in Falkon.
>
> Ioan
>
> Ian Foster wrote:
> > Did we do a complete code review?
> >
> > Sent via BlackBerry from T-Mobile
> >
> > -----Original Message-----
> > From: Ioan Raicu <iraicu at cs.uchicago.edu>
> >
> > Date: Thu, 28 Jun 2007 16:27:04
> > To:bugzilla-daemon at mcs.anl.gov
> > Cc:swift-devel at ci.uchicago.edu
> > Subject: Re: [Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules
> >
> >
> >
> >
> > bugzilla-daemon at mcs.anl.gov wrote:
> >
> > > http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72
> > >
> > >
> > >
> > >
> > >
> > > ------- Comment #1 from wilde at mcs.anl.gov 2007-06-28 16:12 -------
> > > Ive reviewed this email thread on this bug, and am moving this discussion to
> > > bugzilla.
> > >
> > > I and am uncertain about the following - can people involved (Nika, Ioan,
> > > Mihael) clarify:
> > >
> > > - did Mihael discover an error in Falkon mutex code?
> > >
> > >
> > >
> > We are not sure, but we are adding extra synchronization in several
> > parts of the Falkon provider. The reason we are saying that we are not
> > sure is that we stress tested (pushing 50~100 tasks/sec) both the Falkon
> > provider and Falkon itself over and over again, and we never encountered
> > this. Now, we have a workflow that has an average of 1 task/sec, I find
> > it hard to beleive that a synchronization issue that never surfaced
> > before under stress testing is surfacing now under such a light load.
> > We are also verifying that we are handling all exceptions correctly
> > within the Falkon provider.
> >
> > > - if so was it fixed, and did it correct the problem of missed completion
> > > notifications?
> > >
> > >
> > We don't know, the problems are reproducible over short runs, and only
> > seem to pop up with longer runs. For example, we completed the 100 mol
> > run just fine, which had 10K jobs. We have to rerun the 244 mol run to
> > verify things.
> >
> > > - whats the state of the "unable to write output file" problem?
> > >
> > > - do we still have a bad node in UC-Teraport w/ NSF Stale file handles? If so,
> > > was that reported? (This raises interesting issues in troubleshooting and
> > > trouble workaround)
> > >
> > >
> > I reported it, but help at tg claims the node is working fine. They claim
> > that once in a while, it is normal for this to happen, and my argument
> > that all other nodes behaved perfectly with the exception of this one
> > isn't enough for them. For now, if we get this node again, we can
> > manually kill the Falkon worker there so Falkon won't use it anymore.
> >
> > > - do we have a plan for how to run this WF at scale? Meaning how to get 244
> > > nodes for several days, whether we can scale up beyond
> > > 1-processor-per-molecule, what the expected runtime is, how to deal with
> > > errors/restarts, etc? (Should detail this here in bugz).
> > >
> > >
> > There is still work I need to do to ensure that a task that is running
> > when the resource lease expires is correctly handled and Swift is
> > notified that it failed. I have the code written and in Falkon already,
> > but I have yet to test it. We need to make sure this works before we
> > try to get say 24 hour resource allocations when we know the experiment
> > will likely take several days. Also, I think the larger part of the
> > workflow could benefit from more than 1 node per molecule, so if we
> > could get more, it should improve the end-to-end time.
> >
> > Ioan
> >
> >
> >
>
> --
> ============================================
> Ioan Raicu
> Ph.D. Student
> ============================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ============================================
> Email: iraicu at cs.uchicago.edu
> Web: http://www.cs.uchicago.edu/~iraicu
> http://dsl.cs.uchicago.edu/
> ============================================
> ============================================
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list