[Swift-devel] recent error on beagle
Tim Armstrong
tim.g.armstrong at gmail.com
Thu May 26 14:59:46 CDT 2011
Hi,
I've encountered this issue with SwiftR, running release 0.92 from the svn
repository. The issue occurs when GLOBUS::maxWallTime="03:55:00" in tc and
maxTime is 4 hours in sites.xml. After 5 minutes (or whatever the
difference is between the two times), I get the exception copied below. A
tarball is attached with the logs, script, etc. replicate.sh shows how to
replicate the issue on PADS.
Assuming that my problem is the same as the others, it would be good if the
fix could be merged to release 0.92, as I'm trying to bundle stable swift
releases with SwiftR.
- Tim
Swift svn swift-r4336 cog-r3096 (cog modified locally)
RunID: 20110526-1317-2c8ybi10
Progress:
SwiftScript trace: top of loop: rserver waiting for input on,
/tmp/nbest/SwiftR/swift.0827/requestpipe
Progress: Active:1
Progress: Finished successfully:1
SwiftScript trace: rserver: got dir,
/tmp/nbest/SwiftR/requests.P09626/R0000007
Progress: uninitialized:1 Finished successfully:1
Progress: Submitted:1 Finished successfully:1
Progress: Active:1 Finished successfully:1
Progress: Active:1 Finished successfully:1
Progress: Active:1 Finished successfully:1
Progress: Active:1 Finished successfully:1
Progress: Active:1 Finished successfully:1
Progress: Active:1 Finished successfully:1
Progress: Active:1 Finished successfully:1
Progress: Active:1 Finished successfully:1
Progress: Active:1 Finished successfully:1
queuedsize > 0 but no job dequeued. Queued: {}
java.lang.Throwable
at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:252)
at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:520)
at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)
queuedsize > 0 but no job dequeued. Queued: {}
java.lang.Throwable
at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:252)
at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:520)
at
org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:109)
Progress: Finished successfully:1 Failed but can retry:1
On Sun, May 22, 2011 at 1:51 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> The second one looks to me like a coaster problem. Can't say much about
> the first issue.
>
> Can you try with plain pbs if you want to test the pbs provider?
>
> Mihael
>
> On Sun, 2011-05-22 at 08:39 -0500, ketan wrote:
> > I can confirm that the trunk is not usable for pbs provider. I am using
> > trunk for submitting jobs on beagle and I see a few unexpected things:
> >
> > 1. The stderr is showing inconsistent messages: The results are getting
> > written to the output even though stderr doesn't report any.
> > 2. qsub jobs being cancelled inadvertantly: I submitted 40 of them
> > yesterday, however, only 2 survived today. The log is here:
> >
> >
> http://www.ci.uchicago.edu/~ketan/files/ftdock-20110521-0337-pokpgg89.log
> >
> > In addition, the ssh-pbs provider does not seem to be working for large
> > runs (it worked for a small number of test runs): Getting unexpected
> > stdouts. Following is the stdout:
> >
> > http://www.ci.uchicago.edu/~ketan/files/ssh-pbs.stdout
> >
> > Following is the log file for the above run:
> >
> >
> http://www.ci.uchicago.edu/~ketan/files/ftdock-20110521-1750-b0cot9sa.log
> >
> >
> > Ketan
> >
> > On 5/21/11 5:12 PM, Michael Wilde wrote:
> > >
> > > ----- Original Message -----
> > >> On Sat, 2011-05-21 at 17:06 -0400, Glen Hocky wrote:
> > >>> as I mentioned, I've been running with Mike's swift which was
> > >>> patched
> > >>> for beagle. are all the things that make running on beagle work in
> > >>> trunk?
> > >> No idea.
> > >>
> > >> Mike?
> > > Justin, working with Ketan, just applied changes to trunk which should
> make it work now on Beagle (or any Cray XT5+ or XE). This uses a different
> set of sites.xml tags than the prototype in the current Beagle swift 0.92.1
> module. Justin has a note on this at:
> > > https://sites.google.com/site/swiftdevel/sites/pbs/cray
> > >
> > > It was working before for one-node worker jobs; now it should work for
> multi-node worker jobs as well.
> > >
> > > Justin and Ketan should comment on the state of testing and readiness
> of this trunk feature. Don't try trunk on Beagle till they give the
> go-ahead.
> > >
> > > - Mike
> > >
> > >>> If so i'll update to the latest and test. I don't think I'm
> > >>> using stable...
> > >> Ok
> > >>
> > >> Mihael
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110526/88d7f17e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: swiftR-fail.tgz
Type: application/x-gzip
Size: 23917 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110526/88d7f17e/attachment.bin>
More information about the Swift-devel
mailing list