[Swift-devel] Re: replication vs site score

Mihael Hategan hategan at mcs.anl.gov
Wed Apr 8 10:03:07 CDT 2009


On Wed, 2009-04-08 at 16:03 +0800, Qin Zheng wrote:
> Dear Ben,
> 
> Thanks for your detailed reply and it helps me understand scheduling in Swift better.
> 
> I wrote from a researcher perspective and I understand that for
> development, there are much more practical issues and are more
> challenging. I agree with you that scheduling a task after its parents
> completes is cost effective.  It is the best "time" given all the
> updated info on the completion times of its parents. Also, it makes
> DAG submission easy (without dependency description) and minimizes the
> number of job instances in queues.

The main reasoning was that it can be dealt with efficiently and that
planning the whole workflow buys us little in a (very) dynamic
environment in which submitting a job one minute later may mean the
difference between 1 minute of queue time and one hour of queue time
(though that's statistically a rare occurrence).

>  The concern is that at this time, the task still needs to be
> submitted in queue and wait. This may not be sufficient for workflows
> with deadlines, where certain delivery guarantee in response time is
> necessary.

You need some SLA/QOS to address that. Guessing the average queue time
does not reduce its variation hence the risk of not finishing it by the
time promised. You can use replication (i.e. race competing jobs) to
reduce that variation (assuming that it follows some reasonable
distribution), but I don't see how there could be a guarantee.

>  The same applies for other remaining tasks in the workflow.
> 
> I felt besides offline planning, runtime adaptation is necessary
> considering task duration variation (overrun) and faults. But the
> number of updates should be kept minimum and only for the very near
> future as the workflow proceeds. I am writing a paper on this and
> hopefully I could share it with you guys in a few weeks. This implies
> that the Swift code could be submitted a little bit more eagerly with
> a short-sighted look ahead.

I remember somebody mentioning (or having implemented) a similar scheme.
If we have  dependent jobs a and b, in swift that would go something
like:
Qa + Ra + Qb + Rb (where Qx - queuing time and Rx run time)

But there's also the possibility of submitting B earlier by the average
queue time or less and than having it wait until A produces its results.
But then glide-ins/coasters, that's pretty much what they do.

Mihael




More information about the Swift-devel mailing list