[Swift-devel] queuedsize > 0 but no job dequeued
Mihael Hategan
hategan at mcs.anl.gov
Fri Sep 2 15:35:25 CDT 2011
I added some code to better deal with the situation (cog r3254). It now
issues warnings in the log for jobs that exceed their walltime.
On Thu, 2011-09-01 at 16:16 -0500, Ketan Maheshwari wrote:
> Mihael,
>
>
> That is likely. The walltime is 20 mins and most jobs as far as I know
> are less than 10 mins. However, there could be outliers. These are
> about 120k jobs.
>
>
> Ketan
>
> On Thu, Sep 1, 2011 at 1:43 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> Is there any chance that some of your jobs run longer than
> their
> requested walltime?
>
>
> On Wed, 2011-08-31 at 09:04 -0500, Ketan Maheshwari wrote:
> > Mihael,
> >
> >
> > I did the run with the debug enabled on coasters. Please
> find the logs
> > etc, for this run here:
> >
> >
> > http://www.ci.uchicago.edu/~ketan/run25.tgz
> >
> >
> >
> >
> > Note that the run went well and ran upto 20k jobs without
> issues.
> > After that I did not get nodes so I stopped it and resumed
> it this
> > morning. It ran for about 1000+ jobs and crashed with the
> same error
> > message.
> >
> >
> >
> >
> > Regards,
> > Ketan
> >
> > On Tue, Aug 30, 2011 at 3:05 PM, Mihael Hategan
> <hategan at mcs.anl.gov>
> > wrote:
> > Any chance you can re-run this with debug enabled on
> coasters
> >
> (log4j.logger.org.globus.cog.abstraction.coaster=DEBUG)?
> >
> >
> > On Mon, 2011-08-29 at 20:55 -0700, Mihael Hategan
> wrote:
> > > My bad. The info is in the swift log.
> > >
> > > On Mon, 2011-08-29 at 20:59 -0500, Ketan
> Maheshwari wrote:
> > > > This is on Beagle. I am running local:pbs
> from /lustre.
> > > >
> > > > On Mon, Aug 29, 2011 at 8:30 PM, Mihael Hategan
> > <hategan at mcs.anl.gov>
> > > > wrote:
> > > > On Mon, 2011-08-29 at 19:52 -0500, Ketan
> > Maheshwari wrote:
> > > > > Mihael,
> > > > >
> > > > >
> > > > > This run was with automatic coasters.
> I do not
> > see any
> > > > specific
> > > > > coasters.log file written during this
> run
> > in .globus/coaster
> > > > nor in
> > > > > the run's work dir.
> > > >
> > > >
> > > > It's on the remote site
> in .globus/coasters.
> > > >
> > > > >
> > > > >
> > > > > Ketan
> > > > >
> > > > > On Mon, Aug 29, 2011 at 7:16 PM,
> Mihael Hategan
> > > > <hategan at mcs.anl.gov>
> > > > > wrote:
> > > > > Can I have the coasters log
> please?
> > > > >
> > > > >
> > > > > On Sun, 2011-08-28 at 16:47
> -0500, Ketan
> > Maheshwari
> > > > wrote:
> > > > > > Hello,
> > > > > >
> > > > > >
> > > > > > I remember this error
> happened in the
> > past with
> > > > Glen's and
> > > > > Sheri's
> > > > > > runs. I saw this today again
> on Beagle
> > with 0.93
> > > > while
> > > > > running the
> > > > > > DSSAT run.
> > > > > >
> > > > > >
> > > > > > The run stops with the
> following
> > complete message:
> > > > > >
> > > > > >
> > > > > > queuedsize > 0 but no job
> dequeued.
> > Queued: {}
> > > > > > java.lang.Throwable
> > > > > > at
> > > > > >
> > > > >
> > > >
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269)
> > > > > > at
> > > > > >
> > > > >
> > > >
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539)
> > > > > > at
> > > > > >
> > > > >
> > > >
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110)
> > > > > > queuedsize > 0 but no job
> dequeued.
> > Queued: {}
> > > > > > java.lang.Throwable
> > > > > > at
> > > > > >
> > > > >
> > > >
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269)
> > > > > > at
> > > > > >
> > > > >
> > > >
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539)
> > > > > > at
> > > > > >
> > > > >
> > > >
> >
> org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110)
> > > > > > Progress: time: Sun, 28 Aug
> 2011
> > 13:34:26 -0600
> > > > > Submitted:76
> > > > > > Active:23 Checking
> status:1
> > Finished
> > > > successfully:597
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > The logs, properties and
> sources for
> > this run are:
> > > > > >
> > http://www.ci.uchicago.edu/~ketan/run23.tgz
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > --
> > > > > > Ketan
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > >
> > _______________________________________________
> > > > > > Swift-devel mailing list
> > > > > > Swift-devel at ci.uchicago.edu
> > > > > >
> > > > >
> > > >
> >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ketan
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ketan
> > > >
> > > >
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > >
> >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> >
> >
> >
> >
> >
> >
> > --
> > Ketan
> >
> >
> >
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
More information about the Swift-devel
mailing list