[Swift-user] Swift/PBS Scheduler Slow to Report "Finished"?

Michael Wilde wilde at mcs.anl.gov
Thu Mar 19 07:26:19 CDT 2009


The Swift developers will need to look into this issue, which seems to 
be with the Karajan PBS provider. I dont think we see this delay on our 
local PBS cluster here.

In the meantime, you might want to try the fairly new "coaster" provider:

http://www.ci.uchicago.edu/swift/guides/userguide.php#coasters

This starts "worker" jobs in the target cluster which stay up for the 
duration of a script, into which Swift sends jobs directly without 
involving the scheduler.

If your scheduler has this 5-minute "linger" setting, the overall script 
will still wait at the end (I think), but all the jobs in the script 
should finish very quickly.

If you're interested, preliminary notes on the design of coasters is at:
http://wiki.cogkit.org/wiki/Coasters

A few cautions:

- coasters are a new feature, code is changing rapidly, and they are not 
yet suffciently tested.

- we'd welcome your help in evaluating them

- you'll want to run from a source release

- I think they are a good base for a lot of interesting projects and 
studies on scheduling and resource allocation algorithms and approaches.

I tested a simple 10-echo foreach loop with this sites.xml file on our 
local pbs cluster:

--

<config>
<pool handle="teraport" >
   <execution provider="coaster" url="none" jobmanager="local:pbs" />
   <gridftp url="local://localhost" />
   <workdirectory>/home/wilde/swiftwork</workdirectory>
</pool>
</config>

-- which gave:

tp$ swift hellos.swift -sites.file sites.xml -tc.file tc.data
Swift svn swift-r2701 cog-r2332

RunID: 20090319-0658-3ejpl9xc
Progress:
Progress:  Submitting:9 Submitted:1
Progress:  Submitted:9 Active:1
Progress:  Submitted:4 Active:3 Stage out:1 Finished successfully:2
Final status:  Finished successfully:10
Cleaning up...
Shutting down service at https://128.135.125.117:50002
Got channel MetaChannel: 101224864 -> GSSSChannel-null(1)
- Done

--


On 3/19/09 2:42 AM, Andrew Boyce wrote:
> Hello,
> 
> I am currently running Swift in conjunction with the PBS scheduler. My 
> annoyance at the moment is this:
> 
> When running any script, even a simple script such as first.swift (which 
> normally finishes almost instantaneously), Swift always takes precisely 
> five minutes to tell me that my job Finished successfully and copy the 
> files back to the appropriate folder. It is always almost exactly five 
> minutes; I've checked many logs - it polls the scheduler for five 
> minutes. When I run a script (like first.swift) without using the PBS 
> scheduler, everything happens as normal; execution and "Finished 
> successfully" are nearly immediate.
> 
> I think I know what the problem is: even after the scheduler says that 
> the job is 'completed,' (which is generally right away) the scheduler 
> keeps the job up on qstat and such for 5 minutes after (this setting is 
> a PBS server attribute known as 'keep_completed', and I have checked 
> that it is indeed set to 300 seconds; unfortunately I don't have 
> permissions to change it). So when Swift polls the scheduler, the job is 
> still up on qstat, and Swift must think that the task has not yet 
> "Finished successfully."
> 
> My question is this:
> Am I indeed right that Swift does not "understand" that when the PBS 
> scheduler says a job is 'completed', the job really has "Finished 
> successfully"?
> Can this be changed so that Swift does "understand" that a 'completed' 
> job has "Finished successfully"?
> 
> I have not included any files because I think I have narrowed the 
> problem down to a question that does not require those that I would 
> usually provide, but if I am wrong, then I can provide.
> 
> Thank you and sorry for the length.
> 
> Regards,
> 
> Andrew Boyce
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user



More information about the Swift-user mailing list