[Swift-user] Question on coaster job time calculations

Mihael Hategan hategan at mcs.anl.gov
Wed May 5 21:23:57 CDT 2010


On Wed, 2010-05-05 at 20:48 -0500, wilde at mcs.anl.gov wrote:
> Your points below do not address the main problem I am pointing out:
> that my script ran the initial 300+ shorter jobs fine, and then
> *hung*, not running the last 100 longer jobs even though there was a
> coaster pool defined that should have accepted them.
> 
> Why was that?

No clue. I was responding to the previous email.
I will try to look at the logs tonight.

> 
> I'll try a different approach though.
> 
> More followup below on your other replies.
> 
> ----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> 
> > Given that we are talking about coasters whose purpose is to bunch
> > multiple jobs into one big job, I find that expectation odd :)
> 
> PADS frequently has a huge queue of jobs in the 1+ hour category. But
> jobs in the fast (<1 hour) queue start right away. So in this case its
> better not to batch these jobs. When I wound up in the short queue, I
> waited almost an hour without a single job starting.
> 
> I realize that on PADS, I can move these jobs to an ordinary PBS queue
> (and will do that). But on almost all TeraGrid systems, which we're
> trying to use as well, we *must* use coasters in order to use all
> cores of multi-core nodes. SO we should still address the problem.

Maybe there is a misunderstanding here. Coasters/glideins are supposed
to increase performance by accommodating multiple job rounds into a
single LRM job. As such, it is not entirely unexpected to have the LRM
job have a larger walltime than the actual job. I understood your
problem, but I am not sure why the behavior is surprising.

> 
> > 
> > What you want to play with is (low|high)Overallocation and
> > overallocationDecayFactor. They are documented in the swift user
> > guide.
> 
> I have read this many times and still don't understand it well enough
> to apply it. It get the gist of it, but whats missing is the
> fundamental understanding of what it coaster scheduling and "packing"
> approach is.

It sorts the jobs and divides them into bunches based on the spread. For
each bunch then, it computes the overallocation (using the longest job
in the bunch). That is the walltime of the block.

>  And one really needs to plot the exponential decay function you
> describe in the user guide text on "low overallocation". Lastly Im not
> sure that the formula as its describe there is correct.

Could be. Here's the code, verbatim:
(wt * ((settings.getLowOverallocation() -
settings.getHighOverallocation())
                        * Math.exp(-wt *
settings.getOverallocationDecayFactor()) +
settings.getHighOverallocation()));

That's jobwalltime * (lo - hi) * e^(-jobwalltime*decay) + hi.





More information about the Swift-user mailing list