[Swift-user] Coasters with gt2 and localhost file provider

Andriy Fedorov fedorov at bwh.harvard.edu
Fri Aug 28 13:56:27 CDT 2009


Hi,

I have a gt2:gt2:pbs coaster provider on NCSA Abe with local
filesystem provider:

<pool handle="Abe-GT2-coasters">
  <gridftp  url="local://localhost" />
  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
  url="grid-abe.ncsa.teragrid.org"/>
  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

I have been submitting jobs, which seemed to be stuck in the scheduler
queue for too long, here's the output of swift -v:

Unregistering Command(6, SUBMITJOB)
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
Progress:  Submitted:2  Finished successfully:3
GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
..... many many times ......

Upon investigating this, it turns out that the scheduler delay is not
the source of the problem. By looking at the output of "qstat", I see
a job of 1 hr lenght scheduled, then it gets into the queue, waits,
runs, completes, and immediately a new job of 1 hr lenght is
scheduled. This repeats over and over.

No output of "swift -v" gives me explanation of what is going on.

Looking at the log, I see this:

2009-08-28 08:23:31,703-0500 INFO  AbstractKarajanChannel
GSSSChannel-null(1) REPL: Command(6, SUBMITJOB)
2009-08-28 08:23:31,704-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
identity=urn:0-4-1-1251465736537) setting status to Submitted
2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler
Submission time for Task(type=JOB_SUBMISSION,
identity=urn:0-4-1-1251465736537): 56ms. Score delta:
0.002276923076923077
2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler
multiplyScore(Abe-GT2-coasters:1.606(2.487):2/1 overload: 1,
0.002276923076923077)
2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler Old
score: 1.606, new score: 1.608
2009-08-28 08:23:31,704-0500 INFO  JobSubmissionTaskHandler Submitted
task Task(type=JOB_SUBMISSION, identity=urn:0-4-1-1251465736537). Job
id: urn:1251465736537-1251465756817-1251465756818
2009-08-28 08:23:31,704-0500 INFO  AbstractKarajanChannel
Unregistering Command(6, SUBMITJOB)
2009-08-28 08:27:34,210-0500 INFO  AbstractKarajanChannel
GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
2009-08-28 08:32:34,232-0500 INFO  AbstractKarajanChannel
GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
.... many many times ....
2009-08-28 13:22:34,354-0500 INFO  AbstractKarajanChannel
GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
2009-08-28 13:27:34,359-0500 INFO  AbstractKarajanChannel
GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
2009-08-28 13:32:34,358-0500 INFO  AbstractKarajanChannel
GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
2009-08-28 13:37:34,363-0500 INFO  AbstractKarajanChannel
GSSSChannel-null(1) REQ: Handler(HEARTBEAT)

Look at the timestamps!

Note, that I do see the jobs go from Q to R status, I have no idea
which jobs they are, and what they are doing.

The complete log (after interruption) is attached. I also attach my
simple swift script -- there are no loops, this is single execution of
a component of my application, before which I do "ls" and calculate
md5 sum of the input images.

I have

Swift svn swift-r3100 cog-r2446

What am I doing wrong?

--
Andriy Fedorov, Ph.D.

Research Fellow
Brigham and Women's Hospital
Harvard Medical School
75 Francis Street
Boston, MA 02115 USA
fedorov at bwh.harvard.edu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: RigidRegistration1-20090828-0822-k6o8oqd9.log
Type: text/x-log
Size: 85709 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090828/8f989fce/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: RigidRegistration1.swift
Type: application/octet-stream
Size: 1056 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090828/8f989fce/attachment.obj>


More information about the Swift-user mailing list