[Swift-devel] Re: provider-condor submit file generation
Allan Espinosa
aespinosa at cs.uchicago.edu
Mon May 17 20:31:32 CDT 2010
Ah looking at the provider-localscheduler tree, everything makes sense now :)
I wonder how long before swift starts to remove the completed jobs now?
$ condor_q
-- Submitter: communicado.ci.uchicago.edu : <128.135.125.17:44838> :
communicado.ci.uchicago.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
11.0 aespinosa 5/17 20:05 0+00:04:14 C 0 1.0 bash /panfs/panasa
12.0 aespinosa 5/17 20:05 0+00:03:57 C 0 1.0 bash /panfs/panasa
0 jobs; 0 idle, 0 running, 0 held
According to my logs, swift has been polling for around 20 minutes no
2010-05-17 20:05:13,841-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:0-12-1-6-1-1
2010-05-17 20:05:14,092-0500 INFO AbstractQueuePoller Active: 0, New:
2, Done: 0
2010-05-17 20:05:19,163-0500 INFO AbstractQueuePoller Active: 2, New:
0, Done: 0
...
...
2010-05-17 20:25:42,473-0500 INFO AbstractQueuePoller Active: 2, New:
0, Done: 0
2010-05-17 20:25:47,548-0500 INFO AbstractQueuePoller Active: 2, New:
0, Done: 0
A snippet from the *info logs suggests that the jobs have finished
much much earlier:
...
Progress 2010-05-17 20:05:36.239957000-0500 EXECUTE
Moving back to workflow directory
/panfs/panasas/CMS/data/engage-aespinosa/swift/postproc-fireflyg_small
Progress 2010-05-17 20:07:09.642855000-0500 EXECUTE_DONE
Job ran successfully
Progress 2010-05-17 20:07:09.655935000-0500 MOVING_OUTPUTS
panfs/panasas/CMS/data/engage-aespinosa/swift/158/0/TEST_158_0_subfy.sgt|panfs/panasas/CMS/data/engage-aespinosa/swift/158/0/TEST_158_0_subfx.sgt
...
The gridmanager finished around two minutes after. This is probably
the time condor_q reported a 'DONE' status on the jobs:
...
5/17 20:09:47 [18799] (11.0) doEvaluateState called: gmState
GM_CHECK_OUTPUT, globusState 8
05/17 20:09:49 [18799] (11.0) doEvaluateState called: gmState
GM_DONE_SAVE, globusState 8
05/17 20:09:49 [18799] (11.0) doEvaluateState called: gmState
GM_DONE_COMMIT, globusState 8
05/17 20:09:54 [18799] No jobs left, shutting down
05/17 20:09:54 [18799] Got SIGTERM. Performing graceful shutdown.
05/17 20:09:54 [18799] **** condor_gridmanager (condor_GRIDMANAGER)
pid 18799 EXITING WITH STATUS 0
2010/5/17 Mihael Hategan <hategan at mcs.anl.gov>:
> On Mon, 2010-05-17 at 19:40 -0500, Allan Espinosa wrote:
>> Just to confirm, the provider does the job removal itself?
>>
>> > leave_in_queue = TRUE
>
> There's a bunch of relevant stuff in QueuePoller.removeDoneJob.
>
> There's something else in CondorExecutor:
> if ("true".equals(spec.getAttribute("holdIsFailure"))) {
> wr.write("periodic_remove = JobStatus == 5\n");
> }
>
> Which may perhaps be extended. The thing with letting condor remove the
> job automatically is that the exit code may not be detected. On the
> other had there may have been some attempts to use condor log files to
> process job information rather than polling the queue. I'm not sure to
> what extent those are in SVN.
>
>
>>
>> -Allan
>>
>> 2010/5/17 Allan Espinosa <aespinosa at cs.uchicago.edu>:
>> > I was poking around the provider-condor source tree today.
>> >
>> > provider-condor/src/org/globus/cog/abstraction/impl/execution/condor/DescriptionFileGenerator.java:33+:
>> > ...
>>
>> /TEST_158_0_subfx.sgt
>> > extract_sgt_yfile=panfs/panasas/CMS/data/engage-aespinosa/swift/158/0/TEST_158_0_subfy.sgt
>> > notification = Never
>> > leave_in_queue = TRUE
>> > queue
>> >
>> > I was at least expecting to the the line to start with '##### ...\n #
>> > Task : ..." . Is there another place I should poke around to figure
>> > out the jobspec to condor submit file? Like where does "jobType=grid"
>> > get translated to "Universe=grid"?
>> >
More information about the Swift-devel
mailing list