[Swift-devel] Re: replication vs site score

Ben Clifford benc at hawaga.org.uk
Mon Apr 6 10:00:08 CDT 2009


> You may recall the work that was done by Greg Maleciwz (sp?) on prioritizing
> jobs that enable new jobs to run. Those ideas seem relevant here.

yes, its ongoing thoughts based on that that lead me to thinking about 
this - more generally, what are the useful things to prioritise work on 
(both at the Swift level - a SwiftScript procedure call - and at the lower 
level of file transfers and remote job submissions)

> I met last week with a smart fellow in Singapore, Qin Zheng (CCed here), who
> has been working on the scheduling of replicant jobs. His interest is in doing
> this for jobs that have failed, while I think your interest is in scheduling
> for jobs that may have failed--a somewhat different thing. But there may be a
> connection.

Replicated jobs are jobs that the remote job submission system (eg GRAM) 
says are in a queue but that we think that we can probably run better 
(i.e. quicker or even run at all) by resubmitting; when doing that, we 
don't cancel the original job and potentially it will be that original job 
that runs, not the replica. Sometimes that is because the remote queue is 
"infintely long" (the site is taking jobs and losing them); sometimes its 
because it is "very long" (eg teraport's 14 day queue when my laptop has a 
local CPU free and no queue)

In your above paragraph, that sounds more like Swift's retry mechanism - 
when a Swift-level job (SwiftScript procedure call) fails, we submit it 
again, basically using the same mechanism as with replicated jobs. 
However, in that case, the original job does not exist any more.

-- 




More information about the Swift-devel mailing list