[Swift-user] Swift/PBS Scheduler Slow to Report "Finished"?

Michael Wilde wilde at mcs.anl.gov
Thu Mar 19 07:35:56 CDT 2009


Sorry, one clarification:

 > - you'll want to run from a source release

By this I meant that the coaster code is changing frequently due to 
ongoing testing. So you'll typically want to run the latest svn 
revision. Checkout and build is simple and described under "Building 
Swift" at: http://www.ci.uchicago.edu/swift/downloads/index.php


On 3/19/09 7:26 AM, Michael Wilde wrote:
> The Swift developers will need to look into this issue, which seems to 
> be with the Karajan PBS provider. I dont think we see this delay on our 
> local PBS cluster here.
> 
> In the meantime, you might want to try the fairly new "coaster" provider:
> 
> http://www.ci.uchicago.edu/swift/guides/userguide.php#coasters
> 
> This starts "worker" jobs in the target cluster which stay up for the 
> duration of a script, into which Swift sends jobs directly without 
> involving the scheduler.
> 
> If your scheduler has this 5-minute "linger" setting, the overall script 
> will still wait at the end (I think), but all the jobs in the script 
> should finish very quickly.
> 
> If you're interested, preliminary notes on the design of coasters is at:
> http://wiki.cogkit.org/wiki/Coasters
> 
> A few cautions:
> 
> - coasters are a new feature, code is changing rapidly, and they are not 
> yet suffciently tested.
> 
> - we'd welcome your help in evaluating them
> 
> - you'll want to run from a source release
> 
> - I think they are a good base for a lot of interesting projects and 
> studies on scheduling and resource allocation algorithms and approaches.
> 
> I tested a simple 10-echo foreach loop with this sites.xml file on our 
> local pbs cluster:
> 
> -- 
> 
> <config>
> <pool handle="teraport" >
>   <execution provider="coaster" url="none" jobmanager="local:pbs" />
>   <gridftp url="local://localhost" />
>   <workdirectory>/home/wilde/swiftwork</workdirectory>
> </pool>
> </config>
> 
> -- which gave:
> 
> tp$ swift hellos.swift -sites.file sites.xml -tc.file tc.data
> Swift svn swift-r2701 cog-r2332
> 
> RunID: 20090319-0658-3ejpl9xc
> Progress:
> Progress:  Submitting:9 Submitted:1
> Progress:  Submitted:9 Active:1
> Progress:  Submitted:4 Active:3 Stage out:1 Finished successfully:2
> Final status:  Finished successfully:10
> Cleaning up...
> Shutting down service at https://128.135.125.117:50002
> Got channel MetaChannel: 101224864 -> GSSSChannel-null(1)
> - Done
> 
> -- 
> 
> 
> On 3/19/09 2:42 AM, Andrew Boyce wrote:
>> Hello,
>>
>> I am currently running Swift in conjunction with the PBS scheduler. My 
>> annoyance at the moment is this:
>>
>> When running any script, even a simple script such as first.swift 
>> (which normally finishes almost instantaneously), Swift always takes 
>> precisely five minutes to tell me that my job Finished successfully 
>> and copy the files back to the appropriate folder. It is always almost 
>> exactly five minutes; I've checked many logs - it polls the scheduler 
>> for five minutes. When I run a script (like first.swift) without using 
>> the PBS scheduler, everything happens as normal; execution and 
>> "Finished successfully" are nearly immediate.
>>
>> I think I know what the problem is: even after the scheduler says that 
>> the job is 'completed,' (which is generally right away) the scheduler 
>> keeps the job up on qstat and such for 5 minutes after (this setting 
>> is a PBS server attribute known as 'keep_completed', and I have 
>> checked that it is indeed set to 300 seconds; unfortunately I don't 
>> have permissions to change it). So when Swift polls the scheduler, the 
>> job is still up on qstat, and Swift must think that the task has not 
>> yet "Finished successfully."
>>
>> My question is this:
>> Am I indeed right that Swift does not "understand" that when the PBS 
>> scheduler says a job is 'completed', the job really has "Finished 
>> successfully"?
>> Can this be changed so that Swift does "understand" that a 'completed' 
>> job has "Finished successfully"?
>>
>> I have not included any files because I think I have narrowed the 
>> problem down to a question that does not require those that I would 
>> usually provide, but if I am wrong, then I can provide.
>>
>> Thank you and sorry for the length.
>>
>> Regards,
>>
>> Andrew Boyce
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user



More information about the Swift-user mailing list