[Swift-devel] queuedsize > 0 but no job dequeued

Mihael Hategan hategan at mcs.anl.gov
Fri Sep 2 15:35:25 CDT 2011


I added some code to better deal with the situation (cog r3254). It now
issues warnings in the log for jobs that exceed their walltime.

On Thu, 2011-09-01 at 16:16 -0500, Ketan Maheshwari wrote:
> Mihael,
> 
> 
> That is likely. The walltime is 20 mins and most jobs as far as I know
> are less than 10 mins. However, there could be outliers. These are
> about 120k jobs.
> 
> 
> Ketan
> 
> On Thu, Sep 1, 2011 at 1:43 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>         Is there any chance that some of your jobs run longer than
>         their
>         requested walltime?
>         
>         
>         On Wed, 2011-08-31 at 09:04 -0500, Ketan Maheshwari wrote:
>         > Mihael,
>         >
>         >
>         > I did the run with the debug enabled on coasters. Please
>         find the logs
>         > etc, for this run here:
>         >
>         >
>         > http://www.ci.uchicago.edu/~ketan/run25.tgz
>         >
>         >
>         >
>         >
>         > Note that the run went well and ran upto 20k jobs without
>         issues.
>         > After that I did not get nodes so I stopped it and resumed
>         it this
>         > morning. It ran for about 1000+ jobs and crashed with the
>         same error
>         > message.
>         >
>         >
>         >
>         >
>         > Regards,
>         > Ketan
>         >
>         > On Tue, Aug 30, 2011 at 3:05 PM, Mihael Hategan
>         <hategan at mcs.anl.gov>
>         > wrote:
>         >         Any chance you can re-run this with debug enabled on
>         coasters
>         >
>         (log4j.logger.org.globus.cog.abstraction.coaster=DEBUG)?
>         >
>         >
>         >         On Mon, 2011-08-29 at 20:55 -0700, Mihael Hategan
>         wrote:
>         >         > My bad. The info is in the swift log.
>         >         >
>         >         > On Mon, 2011-08-29 at 20:59 -0500, Ketan
>         Maheshwari wrote:
>         >         > > This is on Beagle. I am running local:pbs
>         from /lustre.
>         >         > >
>         >         > > On Mon, Aug 29, 2011 at 8:30 PM, Mihael Hategan
>         >         <hategan at mcs.anl.gov>
>         >         > > wrote:
>         >         > >         On Mon, 2011-08-29 at 19:52 -0500, Ketan
>         >         Maheshwari wrote:
>         >         > >         > Mihael,
>         >         > >         >
>         >         > >         >
>         >         > >         > This run was with automatic coasters.
>         I do not
>         >         see any
>         >         > >         specific
>         >         > >         > coasters.log file written during this
>         run
>         >         in .globus/coaster
>         >         > >         nor in
>         >         > >         > the run's work dir.
>         >         > >
>         >         > >
>         >         > >         It's on the remote site
>         in .globus/coasters.
>         >         > >
>         >         > >         >
>         >         > >         >
>         >         > >         > Ketan
>         >         > >         >
>         >         > >         > On Mon, Aug 29, 2011 at 7:16 PM,
>         Mihael Hategan
>         >         > >         <hategan at mcs.anl.gov>
>         >         > >         > wrote:
>         >         > >         >         Can I have the coasters log
>         please?
>         >         > >         >
>         >         > >         >
>         >         > >         >         On Sun, 2011-08-28 at 16:47
>         -0500, Ketan
>         >         Maheshwari
>         >         > >         wrote:
>         >         > >         >         > Hello,
>         >         > >         >         >
>         >         > >         >         >
>         >         > >         >         > I remember this error
>         happened in the
>         >         past with
>         >         > >         Glen's and
>         >         > >         >         Sheri's
>         >         > >         >         > runs. I saw this today again
>         on Beagle
>         >         with 0.93
>         >         > >         while
>         >         > >         >         running the
>         >         > >         >         > DSSAT run.
>         >         > >         >         >
>         >         > >         >         >
>         >         > >         >         > The run stops with the
>         following
>         >         complete message:
>         >         > >         >         >
>         >         > >         >         >
>         >         > >         >         > queuedsize > 0 but no job
>         dequeued.
>         >         Queued: {}
>         >         > >         >         > java.lang.Throwable
>         >         > >         >         >     at
>         >         > >         >         >
>         >         > >         >
>         >         > >
>         >
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269)
>         >         > >         >         >     at
>         >         > >         >         >
>         >         > >         >
>         >         > >
>         >
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539)
>         >         > >         >         >     at
>         >         > >         >         >
>         >         > >         >
>         >         > >
>         >
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110)
>         >         > >         >         > queuedsize > 0 but no job
>         dequeued.
>         >         Queued: {}
>         >         > >         >         > java.lang.Throwable
>         >         > >         >         >     at
>         >         > >         >         >
>         >         > >         >
>         >         > >
>         >
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.requeueNonFitting(BlockQueueProcessor.java:269)
>         >         > >         >         >     at
>         >         > >         >         >
>         >         > >         >
>         >         > >
>         >
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.updatePlan(BlockQueueProcessor.java:539)
>         >         > >         >         >     at
>         >         > >         >         >
>         >         > >         >
>         >         > >
>         >
>         org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.run(BlockQueueProcessor.java:110)
>         >         > >         >         > Progress:  time: Sun, 28 Aug
>         2011
>         >         13:34:26 -0600
>         >         > >         >          Submitted:76
>         >         > >         >         >  Active:23  Checking
>         status:1
>         >          Finished
>         >         > >         successfully:597
>         >         > >         >         >
>         >         > >         >         >
>         >         > >         >         >
>         >         > >         >         >
>         >         > >         >         > The logs, properties and
>         sources for
>         >         this run are:
>         >         > >         >         >
>         >         http://www.ci.uchicago.edu/~ketan/run23.tgz
>         >         > >         >         >
>         >         > >         >         >
>         >         > >         >         > Regards,
>         >         > >         >         > --
>         >         > >         >         > Ketan
>         >         > >         >         >
>         >         > >         >         >
>         >         > >         >         >
>         >         > >         >
>         >         > >         >         >
>         >         _______________________________________________
>         >         > >         >         > Swift-devel mailing list
>         >         > >         >         > Swift-devel at ci.uchicago.edu
>         >         > >         >         >
>         >         > >         >
>         >         > >
>         >
>         https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>         >         > >         >
>         >         > >         >
>         >         > >         >
>         >         > >         >
>         >         > >         >
>         >         > >         >
>         >         > >         > --
>         >         > >         > Ketan
>         >         > >         >
>         >         > >         >
>         >         > >         >
>         >         > >
>         >         > >
>         >         > >
>         >         > >
>         >         > >
>         >         > >
>         >         > >
>         >         > > --
>         >         > > Ketan
>         >         > >
>         >         > >
>         >         >
>         >         >
>         >         > _______________________________________________
>         >         > Swift-devel mailing list
>         >         > Swift-devel at ci.uchicago.edu
>         >         >
>         >
>         https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>         >
>         >
>         >
>         >
>         >
>         >
>         >
>         > --
>         > Ketan
>         >
>         >
>         >
>         
>         
>         
> 
> 
> 
> 
> -- 
> Ketan
> 
> 
> 





More information about the Swift-devel mailing list