[Swift-user] Re: Set maxtime > maxwalltime or your script will hang

Marcin Hitczenko marcin at galton.uchicago.edu
Tue Apr 27 10:56:17 CDT 2010


Hi Mike,

Two unrelated questions:

1. It seems then if I have 8 apptasks calls each of which is one hour
(maxwalltime is 01:00:00). I should be able to accomplish this by
submitting one PBS job for one node/ 8 cores and having each core run a
different task. If I understand correctly, this should take on the order
of 1 hour plus some extra. But, we would set maxwalltime to 60*60*8, or 8
hours. Why should it run for 8 hours? I feel as I may still be
misunderstanding the distinctions.

2. I submitted a job and would like to cancel it (because it has an
error). I use qdel to cancel the job, but within a few minutes a new job
restarts. I take it this is swift retrying. How do I actually cancel the
entire job? I have done qdel quite a few times, but jobs keep popping up?

Thanks,

Marcin


> [cc'ing swift-user]
>
> Marcin,
>
> Quick answer: Since you changed the maxwalltime in tc.data to 10 minutes,
> change the "maxtime" setting in sites.xml to N times 10 minutes *plus* 1
> minute. The "plus" is important.
>
> Long answer:
>
> Swift deducts a small fraction of maxtime to use to cleanly shut down the
> PBS job. The default for this "reserve time" (see the Users Guide) is 10
> seconds. So while before it was happily fitting 15-second jobs (the prior
> setting you had for maxwalltime) into (600-10) second slots, now (with
> maxwalltime increased to 10 minutes) it could not find any slots into
> which it could fit a 10 *minute* job. Unfortunately, at the moment, Swift
> just hangs, continuing to try to find a slot until the maxtime time runs
> out and the PBS jobs shut down. (There are "good" reasons for this "bad"
> behavior, which we need to fix)
>
> So bottom line: make maxtime some multiple of maxwalltime, and add a bit
> to maxtime (say 1 minute, or at least 10 seconds). Note that since
> maxwalltime is just an *estimate* of how long you expect the apptask to
> run for, this division is by necessity approximate. After an apptask
> finishes on a coaster worker CPU, the CPU becomes free, and has some
> varying amount of time left before the coaster worker expires. Then the
> process repeats, and Swift again uses maxwalltime to see if there is a
> coaster worker with at least maxwalltime remaining that can run the next
> apptask.
>
> - Mike
>
> ps. How about if from now on, if you forget to cc swift-user on these
> questions, then I will just cc the list on my replies. This has 2
> benefits: other users benefit from your questions and my answers, and
> other swift developers and users can contribute more advice, suggest
> betters approach, or correct me when I goof.
>
> You should join the swift-user list if you have not already done so:
>   http://www.ci.uchicago.edu/swift/support/index.php
>
> Thanks!
>
> ----- "Marcin Hitczenko" <marcin at galton.uchicago.edu> wrote:
>
>> Hi Mike,
>>
>> Sorry to be such a pain, but I can't get the jobs to run again. I am
>> still
>> running environment_setup.swift and have not made any changes to the
>> pbscoast.xml file other than to change walltime. I am using the same
>> tc.data file. I again run into the problem where it seems to run for
>> a
>> minute, then fail and restart and it does so over and over. I logged
>> into
>> the node and it didn't seem to be running my stuff. Also, the output
>> files
>> are not being written.
>>
>> I am not sure why it would not work now, though I ran it and it
>> worked
>> before.
>>
>> Marcin
>>
>>
>>
>




More information about the Swift-user mailing list