[Swift-user] Re: Coasters with gt2 and localhost file provider
Andriy Fedorov
fedorov at bwh.harvard.edu
Fri Aug 28 14:57:40 CDT 2009
Hey, Mihael, we are running! Thanks for the fix! I assume we go
through all these troubles, because earlier you have been working with
the large number of jobs that have very small execution time. New
applications bring new troubles :)
Oh joy -- I got my first successful coasters run with a real
application component!
--
Andriy Fedorov, Ph.D.
Research Fellow
Brigham and Women's Hospital
Harvard Medical School
75 Francis Street
Boston, MA 02115 USA
fedorov at bwh.harvard.edu
On Fri, Aug 28, 2009 at 15:36, Mihael Hategan<hategan at mcs.anl.gov> wrote:
> Right. Workers shut down after 10 seconds of inactivity.
>
> I added an option ("maxWorkerIdleTime", in seconds) and changed the
> default to 2 minutes (cog r2455).
>
> Mihael
>
> On Fri, 2009-08-28 at 14:56 -0400, Andriy Fedorov wrote:
>> Hi,
>>
>> I have a gt2:gt2:pbs coaster provider on NCSA Abe with local
>> filesystem provider:
>>
>> <pool handle="Abe-GT2-coasters">
>> <gridftp url="local://localhost" />
>> <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>> url="grid-abe.ncsa.teragrid.org"/>
>> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>> </pool>
>>
>> I have been submitting jobs, which seemed to be stuck in the scheduler
>> queue for too long, here's the output of swift -v:
>>
>> Unregistering Command(6, SUBMITJOB)
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> Progress: Submitted:2 Finished successfully:3
>> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
>> ..... many many times ......
>>
>> Upon investigating this, it turns out that the scheduler delay is not
>> the source of the problem. By looking at the output of "qstat", I see
>> a job of 1 hr lenght scheduled, then it gets into the queue, waits,
>> runs, completes, and immediately a new job of 1 hr lenght is
>> scheduled. This repeats over and over.
>>
>> No output of "swift -v" gives me explanation of what is going on.
>>
>> Looking at the log, I see this:
>>
>> 2009-08-28 08:23:31,703-0500 INFO AbstractKarajanChannel
>> GSSSChannel-null(1) REPL: Command(6, SUBMITJOB)
>> 2009-08-28 08:23:31,704-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
>> identity=urn:0-4-1-1251465736537) setting status to Submitted
>> 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler
>> Submission time for Task(type=JOB_SUBMISSION,
>> identity=urn:0-4-1-1251465736537): 56ms. Score delta:
>> 0.002276923076923077
>> 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler
>> multiplyScore(Abe-GT2-coasters:1.606(2.487):2/1 overload: 1,
>> 0.002276923076923077)
>> 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler Old
>> score: 1.606, new score: 1.608
>> 2009-08-28 08:23:31,704-0500 INFO JobSubmissionTaskHandler Submitted
>> task Task(type=JOB_SUBMISSION, identity=urn:0-4-1-1251465736537). Job
>> id: urn:1251465736537-1251465756817-1251465756818
>> 2009-08-28 08:23:31,704-0500 INFO AbstractKarajanChannel
>> Unregistering Command(6, SUBMITJOB)
>> 2009-08-28 08:27:34,210-0500 INFO AbstractKarajanChannel
>> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
>> 2009-08-28 08:32:34,232-0500 INFO AbstractKarajanChannel
>> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
>> .... many many times ....
>> 2009-08-28 13:22:34,354-0500 INFO AbstractKarajanChannel
>> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
>> 2009-08-28 13:27:34,359-0500 INFO AbstractKarajanChannel
>> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
>> 2009-08-28 13:32:34,358-0500 INFO AbstractKarajanChannel
>> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
>> 2009-08-28 13:37:34,363-0500 INFO AbstractKarajanChannel
>> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
>>
>> Look at the timestamps!
>>
>> Note, that I do see the jobs go from Q to R status, I have no idea
>> which jobs they are, and what they are doing.
>>
>> The complete log (after interruption) is attached. I also attach my
>> simple swift script -- there are no loops, this is single execution of
>> a component of my application, before which I do "ls" and calculate
>> md5 sum of the input images.
>>
>> I have
>>
>> Swift svn swift-r3100 cog-r2446
>>
>> What am I doing wrong?
>>
>> --
>> Andriy Fedorov, Ph.D.
>>
>> Research Fellow
>> Brigham and Women's Hospital
>> Harvard Medical School
>> 75 Francis Street
>> Boston, MA 02115 USA
>> fedorov at bwh.harvard.edu
>
>
More information about the Swift-user
mailing list