[Swift-user] Re: Set maxtime > maxwalltime or your script will hang

Marcin Hitczenko marcin at galton.uchicago.edu
Tue Apr 27 12:00:39 CDT 2010


Hi,

> Marcin,
>
> ----- "Marcin Hitczenko" <marcin at galton.uchicago.edu> wrote:
>
>> Hi Mike,
>>
>> Two unrelated questions:
>>
>> 1. It seems then if I have 8 apptasks calls each of which is one hour
>> (maxwalltime is 01:00:00). I should be able to accomplish this by
>> submitting one PBS job for one node/ 8 cores and having each core run
>> a
>> different task. If I understand correctly, this should take on the
>> order
>> of 1 hour plus some extra. But, we would set maxwalltime to 60*60*8,
>> or 8
>> hours. Why should it run for 8 hours? I feel as I may still be
>> misunderstanding the distinctions.
>
> For this case, set maxwalltime to 01:00:00 (ie the estimated duration of
> each app() task) and maxtime to 3700 (1 hour with 100 secs "reserve")
>
> One caveat, though, for running on the Fusion cluster: I know that if you
> ask for the batch queue, it insists that you request at least 2 nodes; if
> you ask for one, your job is rejected. We need to test exactly how to
> configure coasters for 1 node on that system. You can try removing the
> queue element from your sites.xml, and see how it behaves. I need to do
> more testing on that system and post the results.

But even if we have 16 app() tasks and we call for 2 nodes, shouldn't
maxtime be more or less the same i.e. 3600+reserve? I guess my question
refers to maxtime=maxwalltime*N + reserve. Shouldn't N be the number of
waves or (# app() tasks)/(8cores/node*#nodes), rather than the # of app()
tasks?




>> 2. I submitted a job and would like to cancel it (because it has an
>> error). I use qdel to cancel the job, but within a few minutes a new
>> job
>> restarts. I take it this is swift retrying. How do I actually cancel
>> the
>> entire job? I have done qdel quite a few times, but jobs keep popping
>> up?

I am using the command: swift .... >& swift.out &, so I can't kill the
command directly, I don't think.

> I think the best way to clean up is to interrupt/kill the swift command
> with a ^C,
> and then it should clean up its PBS jobs. I think this has been working
> for me; let us know if that doesnt work for you.
>
> - Mike
>
>
>> Thanks,
>>
>> Marcin
>>
>>
>> > [cc'ing swift-user]
>> >
>> > Marcin,
>> >
>> > Quick answer: Since you changed the maxwalltime in tc.data to 10
>> minutes,
>> > change the "maxtime" setting in sites.xml to N times 10 minutes
>> *plus* 1
>> > minute. The "plus" is important.
>> >
>> > Long answer:
>> >
>> > Swift deducts a small fraction of maxtime to use to cleanly shut
>> down the
>> > PBS job. The default for this "reserve time" (see the Users Guide)
>> is 10
>> > seconds. So while before it was happily fitting 15-second jobs (the
>> prior
>> > setting you had for maxwalltime) into (600-10) second slots, now
>> (with
>> > maxwalltime increased to 10 minutes) it could not find any slots
>> into
>> > which it could fit a 10 *minute* job. Unfortunately, at the moment,
>> Swift
>> > just hangs, continuing to try to find a slot until the maxtime time
>> runs
>> > out and the PBS jobs shut down. (There are "good" reasons for this
>> "bad"
>> > behavior, which we need to fix)
>> >
>> > So bottom line: make maxtime some multiple of maxwalltime, and add a
>> bit
>> > to maxtime (say 1 minute, or at least 10 seconds). Note that since
>> > maxwalltime is just an *estimate* of how long you expect the apptask
>> to
>> > run for, this division is by necessity approximate. After an
>> apptask
>> > finishes on a coaster worker CPU, the CPU becomes free, and has
>> some
>> > varying amount of time left before the coaster worker expires. Then
>> the
>> > process repeats, and Swift again uses maxwalltime to see if there is
>> a
>> > coaster worker with at least maxwalltime remaining that can run the
>> next
>> > apptask.
>> >
>> > - Mike
>> >
>> > ps. How about if from now on, if you forget to cc swift-user on
>> these
>> > questions, then I will just cc the list on my replies. This has 2
>> > benefits: other users benefit from your questions and my answers,
>> and
>> > other swift developers and users can contribute more advice,
>> suggest
>> > betters approach, or correct me when I goof.
>> >
>> > You should join the swift-user list if you have not already done
>> so:
>> >   http://www.ci.uchicago.edu/swift/support/index.php
>> >
>> > Thanks!
>> >
>> > ----- "Marcin Hitczenko" <marcin at galton.uchicago.edu> wrote:
>> >
>> >> Hi Mike,
>> >>
>> >> Sorry to be such a pain, but I can't get the jobs to run again. I
>> am
>> >> still
>> >> running environment_setup.swift and have not made any changes to
>> the
>> >> pbscoast.xml file other than to change walltime. I am using the
>> same
>> >> tc.data file. I again run into the problem where it seems to run
>> for
>> >> a
>> >> minute, then fail and restart and it does so over and over. I
>> logged
>> >> into
>> >> the node and it didn't seem to be running my stuff. Also, the
>> output
>> >> files
>> >> are not being written.
>> >>
>> >> I am not sure why it would not work now, though I ran it and it
>> >> worked
>> >> before.
>> >>
>> >> Marcin
>> >>
>> >>
>> >>
>> >
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>




More information about the Swift-user mailing list