[Swift-devel] recent error on beagle

Tim Armstrong tim.g.armstrong at gmail.com
Thu Jun 2 15:24:04 CDT 2011


Any word on this bug?  I have a nice use-case for SwiftR where it would be
very handy to take advantage of Swift's dynamic resource procurement.

- Tim

On Thu, May 26, 2011 at 3:41 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Given that this has now been reported a number of times, it may make
> sense to backport the fix from trunk and make a patch release for 0.92.
>
> Objections?
>
> On Thu, 2011-05-26 at 14:59 -0500, Tim Armstrong wrote:
> > Hi,
> >   I've encountered this issue with SwiftR, running release 0.92 from
> > the svn repository.  The issue occurs when
> > GLOBUS::maxWallTime="03:55:00" in tc and maxTime is 4 hours in
> > sites.xml.  After 5 minutes (or whatever the difference is between the
> > two times), I get the exception copied below.  A tarball is attached
> > with the logs, script, etc.  replicate.sh shows how to replicate the
> > issue on PADS.
> >
> > Assuming that my problem is the same as the others, it would be good
> > if the fix could be merged to release 0.92, as I'm trying to bundle
> > stable swift releases with SwiftR.
> >
> > - Tim
> >
> >
> > Swift svn swift-r4336 cog-r3096 (cog modified locally)
> >
> > RunID: 20110526-1317-2c8ybi10
> > Progress:
> > SwiftScript trace: top of loop: rserver waiting for input
> > on, /tmp/nbest/SwiftR/swift.0827/requestpipe
> > Progress:  Active:1
> > Progress:  Finished successfully:1
> > SwiftScript trace: rserver: got
> > dir, /tmp/nbest/SwiftR/requests.P09626/R0000007
> > Progress:  uninitialized:1  Finished successfully:1
> > Progress:  Submitted:1  Finished successfully:1
> > Progress:  Active:1  Finished successfully:1
> > Progress:  Active:1  Finished successfully:1
> > Progress:  Active:1  Finished successfully:1
> > Progress:  Active:1  Finished successfully:1
> > Progress:  Active:1  Finished successfully:1
> > Progress:  Active:1  Finished successfully:1
> > Progress:  Active:1  Finished successfully:1
> > Progress:  Active:1  Finished successfully:1
> > Progress:  Active:1  Finished successfully:1
> > queuedsize > 0 but no job dequeued. Queued: {}
> > java.lang.Throwable
> >         at
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:252)
> >         at
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:520)
> >         at
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)
> > queuedsize > 0 but no job dequeued. Queued: {}
> > java.lang.Throwable
> >         at
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:252)
> >         at
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:520)
> >         at
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)
> > Progress:  Finished successfully:1 Failed but can retry:1
> >
> >
> > On Sun, May 22, 2011 at 1:51 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> >         The second one looks to me like a coaster problem. Can't say
> >         much about
> >         the first issue.
> >
> >         Can you try with plain pbs if you want to test the pbs
> >         provider?
> >
> >         Mihael
> >
> >
> >         On Sun, 2011-05-22 at 08:39 -0500, ketan wrote:
> >         > I can confirm that the trunk is not usable for pbs provider.
> >         I am using
> >         > trunk for submitting jobs on beagle and I see a few
> >         unexpected things:
> >         >
> >         > 1. The stderr is showing inconsistent messages: The results
> >         are getting
> >         > written to the output even though stderr doesn't report any.
> >         > 2. qsub jobs being cancelled inadvertantly: I submitted 40
> >         of them
> >         > yesterday, however, only 2 survived today. The log is here:
> >         >
> >         >
> >
> http://www.ci.uchicago.edu/~ketan/files/ftdock-20110521-0337-pokpgg89.log
> >         >
> >         > In addition, the ssh-pbs provider does not seem to be
> >         working for large
> >         > runs (it worked for a small number of test runs): Getting
> >         unexpected
> >         > stdouts. Following is the stdout:
> >         >
> >         > http://www.ci.uchicago.edu/~ketan/files/ssh-pbs.stdout
> >         >
> >         > Following is the log file for the above run:
> >         >
> >         >
> >
> http://www.ci.uchicago.edu/~ketan/files/ftdock-20110521-1750-b0cot9sa.log
> >         >
> >         >
> >         > Ketan
> >         >
> >         > On 5/21/11 5:12 PM, Michael Wilde wrote:
> >         > >
> >         > > ----- Original Message -----
> >         > >> On Sat, 2011-05-21 at 17:06 -0400, Glen Hocky wrote:
> >         > >>> as I mentioned, I've been running with Mike's swift
> >         which was
> >         > >>> patched
> >         > >>> for beagle. are all the things that make running on
> >         beagle work in
> >         > >>> trunk?
> >         > >> No idea.
> >         > >>
> >         > >> Mike?
> >         > > Justin, working with Ketan, just applied changes to trunk
> >         which should make it work now on Beagle (or any Cray XT5+ or
> >         XE).  This uses a different set of sites.xml tags than the
> >         prototype in the current Beagle swift 0.92.1 module. Justin
> >         has a note on this at:
> >         > >    https://sites.google.com/site/swiftdevel/sites/pbs/cray
> >         > >
> >         > > It was working before for one-node worker jobs; now it
> >         should work for multi-node worker jobs as well.
> >         > >
> >         > > Justin and Ketan should comment on the state of testing
> >         and readiness of this trunk feature.  Don't try trunk on
> >         Beagle till they give the go-ahead.
> >         > >
> >         > > - Mike
> >         > >
> >         > >>>   If so i'll update to the latest and test. I don't
> >         think I'm
> >         > >>> using stable...
> >         > >> Ok
> >         > >>
> >         > >> Mihael
> >         > _______________________________________________
> >         > Swift-devel mailing list
> >         > Swift-devel at ci.uchicago.edu
> >         > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >
> >         _______________________________________________
> >         Swift-devel mailing list
> >         Swift-devel at ci.uchicago.edu
> >         http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110602/c3a0edca/attachment.html>


More information about the Swift-devel mailing list