[Swift-devel] Re: replication vs site score

Qin Zheng qinz at ihpc.a-star.edu.sg
Wed Apr 8 03:03:34 CDT 2009


Dear Ben,

Thanks for your detailed reply and it helps me understand scheduling in Swift better.

I wrote from a researcher perspective and I understand that for development, there are much more practical issues and are more challenging. I agree with you that scheduling a task after its parents completes is cost effective.  It is the best "time" given all the updated info on the completion times of its parents. Also, it makes DAG submission easy (without dependency description) and minimizes the number of job instances in queues. The concern is that at this time, the task still needs to be submitted in queue and wait. This may not be sufficient for workflows with deadlines, where certain delivery guarantee in response time is necessary. The same applies for other remaining tasks in the workflow.

I felt besides offline planning, runtime adaptation is necessary considering task duration variation (overrun) and faults. But the number of updates should be kept minimum and only for the very near future as the workflow proceeds. I am writing a paper on this and hopefully I could share it with you guys in a few weeks. This implies that the Swift code could be submitted a little bit more eagerly with a short-sighted look ahead.


Yes, your points on the differences are valid and the replica in my case is used for FT while in Swift it could enable a task to run earlier (by submitting a replica at a short queue). You mentioned about queue time and can you share more on it, for example its accuracy and also the change to have some other job prioritization for coasters?

I will be on star cruise to Malaysia in a few hours :). If I can not access email there, I will reply to you guys on Friday when I return to Singapore.

Qin Zheng

-----Original Message-----
From: Ben Clifford [mailto:benc at hawaga.org.uk]
Sent: Tuesday, April 07, 2009 7:31 PM
To: Qin Zheng
Cc: Ian Foster; swift-devel
Subject: RE: [Swift-devel] Re: replication vs site score


Hi.

Most/all of the work that we've done with Swift works with fairly
opportunistic use of resources - we submit work into job queues on one or
more sites, where those job queues are shared with many other users, and
where the runtimes for both our jobs and other users jobs are not well
defined ahead of time.

So whilst we use the word 'scheduling' sometimes in Swift, its more a case
of "what do we think is the best site to queue a job on right now?" rather
than making an execution plan that we think will be valid for a long
period of time.

Our replication mechanism sounds fairly similar to your pre-scheduled
backups, but I think there are these important differences:

  * we don't launch a replica until we think there is a reasonable chance
that the replica will run instead of the original (based on queue time)

  * as soon as one of the jobs *starts* running, we cancel all the others.
from what I understand, you do that when one of the jobs *ends*
successfully.

We do have one situation where we have some pre-allocation of resources,
and that is when coasters are being used. These use the above
opportunistic queuing methods to acquire a worker node for a long period
of time, and then runs Swift level jobs in there, at present on a
first-come first-serve basis. Its likely that we'll change that to have
some other job prioritisation, but still pre-scheduling the jobs.

Where Swift would have trouble working with an ahead-of-time
planner/scheduler is that the module that generates file transfer and
execution tasks from high level SwiftScripts does not submit a dependent
task for scheduling and execution until its predecessors have been
successfully executed.

What the scheduler sees is a stream, over time, of file transfer and
execution tasks that are safe to run immediately.

It might be easy, or it might be hard, to make the Swift code submit more
eagerly, with description of task dependencies, which would allow you to
plug in a pre-planner underneath.


On Tue, 7 Apr 2009, Qin Zheng wrote:

> Prof Foster, thanks for introducing me to the team.
>
> My research interest is on scheduling workflows (DAGs). Ben, we decided
> not to use resubmission in the consideration that a DAG cannot be
> completed when any of its tasks fails, which each time would trigger the
> resubmission\retry of the DAG. Instead, we use fault tolerance by
> pre-scheduling replica (backup) for each task (see enclosure for
> details). The objective is to guarantee that this DAG can be completed
> (in a preplanned manner with fast failover to the backup upon failure)
> before its deadline.
>
> Currently I am also working on workflow scheduling under uncertainties
> of task running times. This work includes priorities tasks based on the
> impact of the variation of its running time on the overall response time
> and offline planning for high-priority tasks as well as runtime
> adaptation for all tasks once up-to-date information is available.
>
> I am looking forward to talking to you guys and knowing your research!
>
> Regards,
> Qin Zheng
> ________________________________
> From: Ian Foster [mailto:foster at anl.gov]
> Sent: Monday, April 06, 2009 10:46 PM
> To: Ben Clifford
> Cc: swift-devel; Qin Zheng
> Subject: Re: [Swift-devel] Re: replication vs site score
>
> Ben:
>
> You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here.
>
> I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection.
>
> Ian.
>
>
> On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote:
>
>
> even more rambling... in the context of a scheduler that is doing things
> like prioritising jobs based on more than the order that Swift happened to
> submit them (hopefully I will have a student for this in the summer), I
> think a replicant job should be pushed toward later execution rather than
> earlier execution to reduce the number of replicant jobs in the system at
> any one time.
>
> This is because I suspect (though I have gathered no numerical evidence)
> that given the choice between submitting a fresh job and a replicant job
> (making up terminology here too... mmm), it is almost always better to
> submit the fresh job. Either we end up submitting the replicant job
> eventually (in which case we are no worse off than if we submitted the
> replicant first and then a fresh job); or by delaying the replicant job we
> give that replicant's original a chance to start running and thus do not
> discard our precious time-and-load-dollars that we have already spent on
> queueing that replicant's original.
>
> --
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu<mailto:Swift-devel at ci.uchicago.edu>
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
> ________________________________
> This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
>

This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.



More information about the Swift-devel mailing list