[Swift-user] Set maxtime > maxwalltime or your script will hang

wilde at mcs.anl.gov wilde at mcs.anl.gov
Tue Apr 27 06:04:48 CDT 2010


[cc'ing swift-user]

Marcin, 

Quick answer: Since you changed the maxwalltime in tc.data to 10 minutes, change the "maxtime" setting in sites.xml to N times 10 minutes *plus* 1 minute. The "plus" is important.

Long answer:

Swift deducts a small fraction of maxtime to use to cleanly shut down the PBS job. The default for this "reserve time" (see the Users Guide) is 10 seconds. So while before it was happily fitting 15-second jobs (the prior setting you had for maxwalltime) into (600-10) second slots, now (with maxwalltime increased to 10 minutes) it could not find any slots into which it could fit a 10 *minute* job. Unfortunately, at the moment, Swift just hangs, continuing to try to find a slot until the maxtime time runs out and the PBS jobs shut down. (There are "good" reasons for this "bad" behavior, which we need to fix)

So bottom line: make maxtime some multiple of maxwalltime, and add a bit to maxtime (say 1 minute, or at least 10 seconds). Note that since maxwalltime is just an *estimate* of how long you expect the apptask to run for, this division is by necessity approximate. After an apptask finishes on a coaster worker CPU, the CPU becomes free, and has some varying amount of time left before the coaster worker expires. Then the process repeats, and Swift again uses maxwalltime to see if there is a coaster worker with at least maxwalltime remaining that can run the next apptask.

- Mike

ps. How about if from now on, if you forget to cc swift-user on these questions, then I will just cc the list on my replies. This has 2 benefits: other users benefit from your questions and my answers, and other swift developers and users can contribute more advice, suggest betters approach, or correct me when I goof.

You should join the swift-user list if you have not already done so:
  http://www.ci.uchicago.edu/swift/support/index.php

Thanks!

----- "Marcin Hitczenko" <marcin at galton.uchicago.edu> wrote:

> Hi Mike,
> 
> Sorry to be such a pain, but I can't get the jobs to run again. I am
> still
> running environment_setup.swift and have not made any changes to the
> pbscoast.xml file other than to change walltime. I am using the same
> tc.data file. I again run into the problem where it seems to run for
> a
> minute, then fail and restart and it does so over and over. I logged
> into
> the node and it didn't seem to be running my stuff. Also, the output
> files
> are not being written.
> 
> I am not sure why it would not work now, though I ran it and it
> worked
> before.
> 
> Marcin
> 
> 
> 




More information about the Swift-user mailing list