[Swift-devel] Re: scheduling

Wed Mar 31 07:28:27 CDT 2010

To think through this question it helps to define the actors that are involved:

- input mappers, which have the opportunity to do their mapping from a replica catalog

- the site selector in Swift's scheduler, which could factor into its criteria where the data needed by an application lives (and perhaps where the output must go, or where the user prefers that it go)

- the Swift execution wrapper _swiftwrap and/or the post-execution logic and cache management logic in Swift which can influence whether and how long a file stays in the site shared/ cache

0 there also may be back-end replicators which go around replicating data objects for speed or reliability, but since these are typically asynchronous they dont affect the discussion much.

One model of replication (out of many possible) is then:

- Mappers use a replica catalog to map Swift objects to logical names

- The scheduler considers  available replicas when making a site selection, and then translates logical file names to physical site-specific names

- the post-job-execution logic updates a replica catalog with the cached location of results. Some aspect of or alternative to the shared/ directory persists across script executions.

- backend (and/or in-execution) logic does some cache cleanup and/or additional replication or relocation based on criterial like space and usage and other "policy".

One thing thats puzzles me is whether the mappers could or should do some level of site selection, but its not clear they could get access to the info they need. That would reduce impact on the scheduler code, but is perhaps not the ideal place to make the decision.

I suspect there are many complicating factors that would effect this analysis, and Im not sure I got all the details of current behavior right, but this is the way Ive been thinking about the problem to date.

It seems like a reasonable GSoC experiment.

- Mike

----- "Ben Clifford" <benc at hawaga.org.uk> wrote:

> there was (and I think still is) a bluriness about what swift's data 
> replication handling should be.
> 
> for example, should it be for swift to keep track of where it has
> placed 
> data on sites? (Here is "mydata.txt" locally- process it and put the 
> result back locally and if you happen to do some persistent
> replication 
> that you cna use later, fine)
> 
> should it be for arbitrary actors (perhaps swift, perhaps other
> people) 
> placing replicas of data in arbitrary locations on the network, not 
> necessarily attached to 'sites' (in the sites.xml sense): (my data is
> 
> stored at http://a.com/repo/foo.txt and also at 
> http://b.com/guests/monkey/tree - I don't mind which you pull to an 
> execution site as I am asserting that they are equal).
> 
> Those cases are similar but different and somehow I get the feeling
> that 
> there is not a clearly defined consensus.
> 
> -- 
> http://www.hawaga.org.uk/ben/
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory