[Swift-devel] Re: provider-condor submit file generation

Mihael Hategan hategan at mcs.anl.gov
Mon May 17 21:24:21 CDT 2010


On Mon, 2010-05-17 at 20:31 -0500, Allan Espinosa wrote:
> Ah looking at the provider-localscheduler tree, everything makes sense now :)
> 
> I wonder how long before swift starts to remove the completed jobs now?

As soon as the queue is polled and it figures out that the job is done.
So a minimum of zero and a maximum of the poll interval (of 5 seconds by
default) plus whatever time it takes to run condor_q. If not, something
ain't right.

The way it removes jobs is to set LeaveJobInQueue to "FALSE".

> 
> $ condor_q
> 
> -- Submitter: communicado.ci.uchicago.edu : <128.135.125.17:44838> :
> communicado.ci.uchicago.edu
>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>   11.0   aespinosa       5/17 20:05   0+00:04:14 C  0   1.0  bash /panfs/panasa
>   12.0   aespinosa       5/17 20:05   0+00:03:57 C  0   1.0  bash /panfs/panasa
> 
> 0 jobs; 0 idle, 0 running, 0 held
> 
> According to my logs, swift has been polling for around 20 minutes no
> 
> 2010-05-17 20:05:13,841-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:0-12-1-6-1-1
> 2010-05-17 20:05:14,092-0500 INFO  AbstractQueuePoller Active: 0, New:
> 2, Done: 0
> 2010-05-17 20:05:19,163-0500 INFO  AbstractQueuePoller Active: 2, New:
> 0, Done: 0
> ...
> ...
> 2010-05-17 20:25:42,473-0500 INFO  AbstractQueuePoller Active: 2, New:
> 0, Done: 0
> 2010-05-17 20:25:47,548-0500 INFO  AbstractQueuePoller Active: 2, New:
> 0, Done: 0
> 
> 
> A snippet from the *info logs suggests that the jobs have finished
> much much earlier:
> ...
> Progress  2010-05-17 20:05:36.239957000-0500  EXECUTE
> Moving back to workflow directory
> /panfs/panasas/CMS/data/engage-aespinosa/swift/postproc-fireflyg_small
> Progress  2010-05-17 20:07:09.642855000-0500  EXECUTE_DONE
> Job ran successfully
> Progress  2010-05-17 20:07:09.655935000-0500  MOVING_OUTPUTS
> panfs/panasas/CMS/data/engage-aespinosa/swift/158/0/TEST_158_0_subfy.sgt|panfs/panasas/CMS/data/engage-aespinosa/swift/158/0/TEST_158_0_subfx.sgt
> ...
> 
> The gridmanager finished around two minutes after.  This is probably
> the time condor_q reported a 'DONE' status on the jobs:
> ...
> 5/17 20:09:47 [18799] (11.0) doEvaluateState called: gmState
> GM_CHECK_OUTPUT, globusState 8
> 05/17 20:09:49 [18799] (11.0) doEvaluateState called: gmState
> GM_DONE_SAVE, globusState 8
> 05/17 20:09:49 [18799] (11.0) doEvaluateState called: gmState
> GM_DONE_COMMIT, globusState 8
> 05/17 20:09:54 [18799] No jobs left, shutting down
> 05/17 20:09:54 [18799] Got SIGTERM. Performing graceful shutdown.
> 05/17 20:09:54 [18799] **** condor_gridmanager (condor_GRIDMANAGER)
> pid 18799 EXITING WITH STATUS 0
> 
> 
> 
> 2010/5/17 Mihael Hategan <hategan at mcs.anl.gov>:
> > On Mon, 2010-05-17 at 19:40 -0500, Allan Espinosa wrote:
> >> Just to confirm, the provider does the job removal itself?
> >>
> >> > leave_in_queue = TRUE
> >
> > There's a bunch of relevant stuff in QueuePoller.removeDoneJob.
> >
> > There's something else in CondorExecutor:
> > if ("true".equals(spec.getAttribute("holdIsFailure"))) {
> >        wr.write("periodic_remove = JobStatus == 5\n");
> > }
> >
> > Which may perhaps be extended. The thing with letting condor remove the
> > job automatically is that the exit code may not be detected. On the
> > other had there may have been some attempts to use condor log files to
> > process job information rather than polling the queue. I'm not sure to
> > what extent those are in SVN.
> >
> >
> >>
> >> -Allan
> >>
> >> 2010/5/17 Allan Espinosa <aespinosa at cs.uchicago.edu>:
> >> > I was poking around the provider-condor source tree today.
> >> >
> >> > provider-condor/src/org/globus/cog/abstraction/impl/execution/condor/DescriptionFileGenerator.java:33+:
> >> > ...
> >>
> >> /TEST_158_0_subfx.sgt
> >> > extract_sgt_yfile=panfs/panasas/CMS/data/engage-aespinosa/swift/158/0/TEST_158_0_subfy.sgt
> >> > notification = Never
> >> > leave_in_queue = TRUE
> >> > queue
> >> >
> >> > I was at least expecting to the the line to start with '##### ...\n #
> >> > Task : ..." .  Is there another place I should poke around to figure
> >> > out the jobspec to condor submit file? Like where does "jobType=grid"
> >> > get translated to "Universe=grid"?
> >> >




More information about the Swift-devel mailing list