[Swift-devel] Re: provider-condor submit file generation

Allan Espinosa aespinosa at cs.uchicago.edu
Mon May 17 20:31:32 CDT 2010


Ah looking at the provider-localscheduler tree, everything makes sense now :)

I wonder how long before swift starts to remove the completed jobs now?

$ condor_q

-- Submitter: communicado.ci.uchicago.edu : <128.135.125.17:44838> :
communicado.ci.uchicago.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  11.0   aespinosa       5/17 20:05   0+00:04:14 C  0   1.0  bash /panfs/panasa
  12.0   aespinosa       5/17 20:05   0+00:03:57 C  0   1.0  bash /panfs/panasa

0 jobs; 0 idle, 0 running, 0 held

According to my logs, swift has been polling for around 20 minutes no

2010-05-17 20:05:13,841-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:0-12-1-6-1-1
2010-05-17 20:05:14,092-0500 INFO  AbstractQueuePoller Active: 0, New:
2, Done: 0
2010-05-17 20:05:19,163-0500 INFO  AbstractQueuePoller Active: 2, New:
0, Done: 0
...
...
2010-05-17 20:25:42,473-0500 INFO  AbstractQueuePoller Active: 2, New:
0, Done: 0
2010-05-17 20:25:47,548-0500 INFO  AbstractQueuePoller Active: 2, New:
0, Done: 0


A snippet from the *info logs suggests that the jobs have finished
much much earlier:
...
Progress  2010-05-17 20:05:36.239957000-0500  EXECUTE
Moving back to workflow directory
/panfs/panasas/CMS/data/engage-aespinosa/swift/postproc-fireflyg_small
Progress  2010-05-17 20:07:09.642855000-0500  EXECUTE_DONE
Job ran successfully
Progress  2010-05-17 20:07:09.655935000-0500  MOVING_OUTPUTS
panfs/panasas/CMS/data/engage-aespinosa/swift/158/0/TEST_158_0_subfy.sgt|panfs/panasas/CMS/data/engage-aespinosa/swift/158/0/TEST_158_0_subfx.sgt
...

The gridmanager finished around two minutes after.  This is probably
the time condor_q reported a 'DONE' status on the jobs:
...
5/17 20:09:47 [18799] (11.0) doEvaluateState called: gmState
GM_CHECK_OUTPUT, globusState 8
05/17 20:09:49 [18799] (11.0) doEvaluateState called: gmState
GM_DONE_SAVE, globusState 8
05/17 20:09:49 [18799] (11.0) doEvaluateState called: gmState
GM_DONE_COMMIT, globusState 8
05/17 20:09:54 [18799] No jobs left, shutting down
05/17 20:09:54 [18799] Got SIGTERM. Performing graceful shutdown.
05/17 20:09:54 [18799] **** condor_gridmanager (condor_GRIDMANAGER)
pid 18799 EXITING WITH STATUS 0



2010/5/17 Mihael Hategan <hategan at mcs.anl.gov>:
> On Mon, 2010-05-17 at 19:40 -0500, Allan Espinosa wrote:
>> Just to confirm, the provider does the job removal itself?
>>
>> > leave_in_queue = TRUE
>
> There's a bunch of relevant stuff in QueuePoller.removeDoneJob.
>
> There's something else in CondorExecutor:
> if ("true".equals(spec.getAttribute("holdIsFailure"))) {
>        wr.write("periodic_remove = JobStatus == 5\n");
> }
>
> Which may perhaps be extended. The thing with letting condor remove the
> job automatically is that the exit code may not be detected. On the
> other had there may have been some attempts to use condor log files to
> process job information rather than polling the queue. I'm not sure to
> what extent those are in SVN.
>
>
>>
>> -Allan
>>
>> 2010/5/17 Allan Espinosa <aespinosa at cs.uchicago.edu>:
>> > I was poking around the provider-condor source tree today.
>> >
>> > provider-condor/src/org/globus/cog/abstraction/impl/execution/condor/DescriptionFileGenerator.java:33+:
>> > ...
>>
>> /TEST_158_0_subfx.sgt
>> > extract_sgt_yfile=panfs/panasas/CMS/data/engage-aespinosa/swift/158/0/TEST_158_0_subfy.sgt
>> > notification = Never
>> > leave_in_queue = TRUE
>> > queue
>> >
>> > I was at least expecting to the the line to start with '##### ...\n #
>> > Task : ..." .  Is there another place I should poke around to figure
>> > out the jobspec to condor submit file? Like where does "jobType=grid"
>> > get translated to "Universe=grid"?
>> >



More information about the Swift-devel mailing list