[Swift-user] Re: Set maxtime > maxwalltime or your script will hang

Michael Wilde wilde at mcs.anl.gov
Tue Apr 27 11:06:09 CDT 2010


Marcin,

----- "Marcin Hitczenko" <marcin at galton.uchicago.edu> wrote:

> Hi Mike,
> 
> Two unrelated questions:
> 
> 1. It seems then if I have 8 apptasks calls each of which is one hour
> (maxwalltime is 01:00:00). I should be able to accomplish this by
> submitting one PBS job for one node/ 8 cores and having each core run
> a
> different task. If I understand correctly, this should take on the
> order
> of 1 hour plus some extra. But, we would set maxwalltime to 60*60*8,
> or 8
> hours. Why should it run for 8 hours? I feel as I may still be
> misunderstanding the distinctions.

For this case, set maxwalltime to 01:00:00 (ie the estimated duration of each app() task) and maxtime to 3700 (1 hour with 100 secs "reserve")

One caveat, though, for running on the Fusion cluster: I know that if you ask for the batch queue, it insists that you request at least 2 nodes; if you ask for one, your job is rejected. We need to test exactly how to configure coasters for 1 node on that system. You can try removing the queue element from your sites.xml, and see how it behaves. I need to do more testing on that system and post the results.

> 
> 2. I submitted a job and would like to cancel it (because it has an
> error). I use qdel to cancel the job, but within a few minutes a new
> job
> restarts. I take it this is swift retrying. How do I actually cancel
> the
> entire job? I have done qdel quite a few times, but jobs keep popping
> up?

I think the best way to clean up is to interrupt/kill the swift command with a ^C,
and then it should clean up its PBS jobs. I think this has been working for me; let us know if that doesnt work for you.

- Mike


> Thanks,
> 
> Marcin
> 
> 
> > [cc'ing swift-user]
> >
> > Marcin,
> >
> > Quick answer: Since you changed the maxwalltime in tc.data to 10
> minutes,
> > change the "maxtime" setting in sites.xml to N times 10 minutes
> *plus* 1
> > minute. The "plus" is important.
> >
> > Long answer:
> >
> > Swift deducts a small fraction of maxtime to use to cleanly shut
> down the
> > PBS job. The default for this "reserve time" (see the Users Guide)
> is 10
> > seconds. So while before it was happily fitting 15-second jobs (the
> prior
> > setting you had for maxwalltime) into (600-10) second slots, now
> (with
> > maxwalltime increased to 10 minutes) it could not find any slots
> into
> > which it could fit a 10 *minute* job. Unfortunately, at the moment,
> Swift
> > just hangs, continuing to try to find a slot until the maxtime time
> runs
> > out and the PBS jobs shut down. (There are "good" reasons for this
> "bad"
> > behavior, which we need to fix)
> >
> > So bottom line: make maxtime some multiple of maxwalltime, and add a
> bit
> > to maxtime (say 1 minute, or at least 10 seconds). Note that since
> > maxwalltime is just an *estimate* of how long you expect the apptask
> to
> > run for, this division is by necessity approximate. After an
> apptask
> > finishes on a coaster worker CPU, the CPU becomes free, and has
> some
> > varying amount of time left before the coaster worker expires. Then
> the
> > process repeats, and Swift again uses maxwalltime to see if there is
> a
> > coaster worker with at least maxwalltime remaining that can run the
> next
> > apptask.
> >
> > - Mike
> >
> > ps. How about if from now on, if you forget to cc swift-user on
> these
> > questions, then I will just cc the list on my replies. This has 2
> > benefits: other users benefit from your questions and my answers,
> and
> > other swift developers and users can contribute more advice,
> suggest
> > betters approach, or correct me when I goof.
> >
> > You should join the swift-user list if you have not already done
> so:
> >   http://www.ci.uchicago.edu/swift/support/index.php
> >
> > Thanks!
> >
> > ----- "Marcin Hitczenko" <marcin at galton.uchicago.edu> wrote:
> >
> >> Hi Mike,
> >>
> >> Sorry to be such a pain, but I can't get the jobs to run again. I
> am
> >> still
> >> running environment_setup.swift and have not made any changes to
> the
> >> pbscoast.xml file other than to change walltime. I am using the
> same
> >> tc.data file. I again run into the problem where it seems to run
> for
> >> a
> >> minute, then fail and restart and it does so over and over. I
> logged
> >> into
> >> the node and it didn't seem to be running my stuff. Also, the
> output
> >> files
> >> are not being written.
> >>
> >> I am not sure why it would not work now, though I ran it and it
> >> worked
> >> before.
> >>
> >> Marcin
> >>
> >>
> >>
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list