[Swift-user] Re: Coasters with gt2 and localhost file provider

Mihael Hategan hategan at mcs.anl.gov
Fri Aug 28 14:36:51 CDT 2009


Right. Workers shut down after 10 seconds of inactivity.

I added an option ("maxWorkerIdleTime", in seconds) and changed the
default to 2 minutes (cog r2455).

Mihael

On Fri, 2009-08-28 at 14:56 -0400, Andriy Fedorov wrote:
> Hi,
> 
> I have a gt2:gt2:pbs coaster provider on NCSA Abe with local
> filesystem provider:
> 
> <pool handle="Abe-GT2-coasters">
>   <gridftp  url="local://localhost" />
>   <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>   url="grid-abe.ncsa.teragrid.org"/>
>   <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
> </pool>
> 
> I have been submitting jobs, which seemed to be stuck in the scheduler
> queue for too long, here's the output of swift -v:
> 
> Unregistering Command(6, SUBMITJOB)
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> Progress:  Submitted:2  Finished successfully:3
> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
> ..... many many times ......
> 
> Upon investigating this, it turns out that the scheduler delay is not
> the source of the problem. By looking at the output of "qstat", I see
> a job of 1 hr lenght scheduled, then it gets into the queue, waits,
> runs, completes, and immediately a new job of 1 hr lenght is
> scheduled. This repeats over and over.
> 
> No output of "swift -v" gives me explanation of what is going on.
> 
> Looking at the log, I see this:
> 
> 2009-08-28 08:23:31,703-0500 INFO  AbstractKarajanChannel
> GSSSChannel-null(1) REPL: Command(6, SUBMITJOB)
> 2009-08-28 08:23:31,704-0500 DEBUG TaskImpl Task(type=JOB_SUBMISSION,
> identity=urn:0-4-1-1251465736537) setting status to Submitted
> 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler
> Submission time for Task(type=JOB_SUBMISSION,
> identity=urn:0-4-1-1251465736537): 56ms. Score delta:
> 0.002276923076923077
> 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler
> multiplyScore(Abe-GT2-coasters:1.606(2.487):2/1 overload: 1,
> 0.002276923076923077)
> 2009-08-28 08:23:31,704-0500 DEBUG WeightedHostScoreScheduler Old
> score: 1.606, new score: 1.608
> 2009-08-28 08:23:31,704-0500 INFO  JobSubmissionTaskHandler Submitted
> task Task(type=JOB_SUBMISSION, identity=urn:0-4-1-1251465736537). Job
> id: urn:1251465736537-1251465756817-1251465756818
> 2009-08-28 08:23:31,704-0500 INFO  AbstractKarajanChannel
> Unregistering Command(6, SUBMITJOB)
> 2009-08-28 08:27:34,210-0500 INFO  AbstractKarajanChannel
> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
> 2009-08-28 08:32:34,232-0500 INFO  AbstractKarajanChannel
> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
> .... many many times ....
> 2009-08-28 13:22:34,354-0500 INFO  AbstractKarajanChannel
> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
> 2009-08-28 13:27:34,359-0500 INFO  AbstractKarajanChannel
> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
> 2009-08-28 13:32:34,358-0500 INFO  AbstractKarajanChannel
> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
> 2009-08-28 13:37:34,363-0500 INFO  AbstractKarajanChannel
> GSSSChannel-null(1) REQ: Handler(HEARTBEAT)
> 
> Look at the timestamps!
> 
> Note, that I do see the jobs go from Q to R status, I have no idea
> which jobs they are, and what they are doing.
> 
> The complete log (after interruption) is attached. I also attach my
> simple swift script -- there are no loops, this is single execution of
> a component of my application, before which I do "ls" and calculate
> md5 sum of the input images.
> 
> I have
> 
> Swift svn swift-r3100 cog-r2446
> 
> What am I doing wrong?
> 
> --
> Andriy Fedorov, Ph.D.
> 
> Research Fellow
> Brigham and Women's Hospital
> Harvard Medical School
> 75 Francis Street
> Boston, MA 02115 USA
> fedorov at bwh.harvard.edu




More information about the Swift-user mailing list