[Swift-user] Question on coaster job time calculations

wilde at mcs.anl.gov wilde at mcs.anl.gov
Wed May 5 20:48:35 CDT 2010


Your points below do not address the main problem I am pointing out: that my script ran the initial 300+ shorter jobs fine, and then *hung*, not running the last 100 longer jobs even though there was a coaster pool defined that should have accepted them.

Why was that?

I'll try a different approach though.

More followup below on your other replies.

----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> Given that we are talking about coasters whose purpose is to bunch
> multiple jobs into one big job, I find that expectation odd :)

PADS frequently has a huge queue of jobs in the 1+ hour category. But jobs in the fast (<1 hour) queue start right away. So in this case its better not to batch these jobs. When I wound up in the short queue, I waited almost an hour without a single job starting.

I realize that on PADS, I can move these jobs to an ordinary PBS queue (and will do that). But on almost all TeraGrid systems, which we're trying to use as well, we *must* use coasters in order to use all cores of multi-core nodes. SO we should still address the problem.

> 
> What you want to play with is (low|high)Overallocation and
> overallocationDecayFactor. They are documented in the swift user
> guide.

I have read this many times and still don't understand it well enough to apply it. It get the gist of it, but whats missing is the fundamental understanding of what it coaster scheduling and "packing" approach is. And one really needs to plot the exponential decay function you describe in the user guide text on "low overallocation". Lastly Im not sure that the formula as its describe there is correct. Some initial evaluations are confusing me; I need to re-run and plot the curve.

> I believe that you want a slightly larger decay factor (say 0.003
> instead of 0.001)
> 
> >  but the PBS walltime was set to 90 mins (causing the job to wait
> in
> > the short queue rather than start right away in the fast queue; I
> have
> > the sites "queue" element set to "route" which selects the best
> queue
> > based on PBS walltime).
> 
> Though for 30 minutes and the default settings, you should get a
> block
> walltime of about 75 minutes, not 90. Are you sure about the 90?

Pretty sure, but I will check next chance I get to experiment; probably not for a while.

- Mike

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list