[Swift-devel] Re: replication vs site score

Qin Zheng qinz at ihpc.a-star.edu.sg
Tue Apr 7 04:07:20 CDT 2009


Prof Foster, thanks for introducing me to the team.

My research interest is on scheduling workflows (DAGs). Ben, we decided not to use resubmission in the consideration that a DAG cannot be completed when any of its tasks fails, which each time would trigger the resubmission\retry of the DAG. Instead, we use fault tolerance by pre-scheduling replica (backup) for each task (see enclosure for details). The objective is to guarantee that this DAG can be completed (in a preplanned manner with fast failover to the backup upon failure) before its deadline.

Currently I am also working on workflow scheduling under uncertainties of task running times. This work includes priorities tasks based on the impact of the variation of its running time on the overall response time and offline planning for high-priority tasks as well as runtime adaptation for all tasks once up-to-date information is available.

I am looking forward to talking to you guys and knowing your research!

Regards,
Qin Zheng
________________________________
From: Ian Foster [mailto:foster at anl.gov]
Sent: Monday, April 06, 2009 10:46 PM
To: Ben Clifford
Cc: swift-devel; Qin Zheng
Subject: Re: [Swift-devel] Re: replication vs site score

Ben:

You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing jobs that enable new jobs to run. Those ideas seem relevant here.

I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who has been working on the scheduling of replicant jobs. His interest is in doing this for jobs that have failed, while I think your interest is in scheduling for jobs that may have failed--a somewhat different thing. But there may be a connection.

Ian.


On Apr 6, 2009, at 9:39 AM, Ben Clifford wrote:


even more rambling... in the context of a scheduler that is doing things
like prioritising jobs based on more than the order that Swift happened to
submit them (hopefully I will have a student for this in the summer), I
think a replicant job should be pushed toward later execution rather than
earlier execution to reduce the number of replicant jobs in the system at
any one time.

This is because I suspect (though I have gathered no numerical evidence)
that given the choice between submitting a fresh job and a replicant job
(making up terminology here too... mmm), it is almost always better to
submit the fresh job. Either we end up submitting the replicant job
eventually (in which case we are no worse off than if we submitted the
replicant first and then a fresh job); or by delaying the replicant job we
give that replicant's original a chance to start running and thus do not
discard our precious time-and-load-dollars that we have already spent on
queueing that replicant's original.

--

_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu<mailto:Swift-devel at ci.uchicago.edu>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


________________________________
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090407/0371cacf/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fault Tolerance_TC_Mar09.pdf
Type: application/pdf
Size: 2142133 bytes
Desc: Fault Tolerance_TC_Mar09.pdf
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090407/0371cacf/attachment.pdf>


More information about the Swift-devel mailing list