[Swift-devel] Re: replication vs site score
Qin Zheng
qinz at ihpc.a-star.edu.sg
Sat Apr 11 10:10:49 CDT 2009
Hi,
I read the paper and found that (a) ok, it's bound on future queuing time; (b) only an upper bound for 0.95 quantile (with 95% confidence). Note that to me it should be relatively tight to the actual wait time (so as to be useful when deciding where to queue) while on the other hand it's much safer to put higher bound to guarantee a certain confidence level. Mihael, the case you mentioned is very likely as seen from their Fig 1, where for majority actual wait times (in black) near to 0 unit, their bounds (in red) could be 50 thousands units or more. It's still a success (not a failure). It is not what I want on expected queuing time for my work.
Let me know if you guys have thoughts on the above.
Qin Zheng
________________________________
From: Qin Zheng
Sent: Saturday, April 11, 2009 12:48 AM
To: iraicu at cs.uchicago.edu; Ben Clifford
Cc: Mihael Hategan; swift-devel; Ian Foster
Subject: RE: [Swift-devel] Re: replication vs site score
Dear all,
I came from the angle of application (such as Enterprise application or disaster recovery for an extreme case) requirement in SLA. Reservation can give some idea of response time (let's talk separately about failure and inaccurate execution time estimation) while queuing time prediction can give some probability of (mean and upbound of) expected start time. Knowing queuing time is important according to feedbacks from users of our in-hours supercomputers while they bare errors in their execution time estimations. However, even for a single task, only dynamically queuing it (or a number of its replicas) does not provide time-related information (as have been mentioned).
Ioan, I was also thinking along the line of queue time estimation, which may be sufficient for what I am doing now. I considered reservation (so no queuing time) in my previous fault tolerance work due to the strict timing sequence requirement. I will read the paper soon to clarify a few points, especially the two points made by Mihael. Because to me it is only useful if it can tell (a) a queuing time, not only for the current state and immediately changes when new jobs are queued; (b) a mean and an upbound on queuing time, or if only the upbound is given, it should be tight in some sense (at most 20 minutes for the 2-minute job example). Finally, when a node can fail, it can also affect jobs queuing for it and this paper briefly mentions something about detecting the failure using queuing time data.
I will share my findings regarding queuing time with you guys soon.
Cheers,
Qin Zheng
________________________________
From: Ioan Raicu [iraicu at cs.uchicago.edu]
Sent: Thursday, April 09, 2009 4:38 AM
To: Ben Clifford
Cc: Mihael Hategan; Qin Zheng; swift-devel; Ian Foster
Subject: Re: [Swift-devel] Re: replication vs site score
Does a batch-queue prediction service help things in any way?
https://portal.teragrid.org/gridsphere/gridsphere?cid=queue-prediction
I've always wondered how the Swift scheduler would behave differently if it had statistical information about queue times. Qin, have you compared your job replication strategy with one that was cognizant of the expected wait queue time, in order to meet deadlines? On the surface, assuming that the batch queue prediction is accurate, it would seem that scheduling with known queue times might solve the same deadline cognizant scheduling problem, but without wasting resources by unnecessary replication. The place where the queue prediction doesn't help, is when there is a bad node which causes an application to be slow or fail. In this case, replication is probably the better recourse to guarantee meeting deadlines.
Here is their latest paper on this: http://www.springerlink.com/content/7552901360631246/fulltext.pdf. The system is deployed on the TeraGrid, and has been for a few years now. As far as I have heard, it is quite robust and accurate.
Cheers,
Ioan
Ben Clifford wrote:
On Wed, 8 Apr 2009, Mihael Hategan wrote:
This:
planning the whole workflow buys us little in a (very) dynamic
environment in which submitting a job one minute later may mean the
difference between 1 minute of queue time and one hour of queue time
and this:
You need some SLA/QOS to address that.
seem to be significant characteristics that make the environments we run
on not amenable to scheduling in the traditional sense. The lack of any
meaningful guarantees about almost anything time-related makes everything
basically opportunistic rather than scheduled.
--
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu<mailto:iraicu at cs.uchicago.edu>
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
________________________________
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20090411/07088735/attachment.html>
More information about the Swift-devel
mailing list