[Swift-devel] Re: scheduling

Wed Mar 31 12:50:00 CDT 2010

On Wed, 2010-03-31 at 12:25 -0500, Michael Wilde wrote:
> Yeah, I agree with you on keeping scheduling decisions out of mappers.
> 
> But mappers could/should, I think, still be involved, if only to the
> extent that the name space they would work on in a replica-based
> environment would be the logical namespace of some (abstract) replica
> catalog (which is itself another mapping).

We've been using something that greatly simplified reasoning about
swift: local consistency (i.e. data in swift is organized as it would be
on a single filesystem; i.e. having swift data on a single machine
always works).

That means that with a data catalog there would be three sets:
1. mapped atomic data (i.e. variables; e.g. a, f[1], etc.)
2. local/logical files (e.g. a.txt, f0001.txt, etc.)
3. grid layer (e.g. (gsiftp://site1/a.txt, gsiftp://site2/a.txt),
(gsiftp://site3/f0001.txt), etc.)

They are isomorphic. That is, for each element in one set there is
exactly one element in the other set:
a <-> a.txt <-> (gsiftp://site1/a.txt, gsiftp://site2/a.txt)

Where 3 has tuples of physical files.

The exact morphisms that go from one to the other are:
1 <-> 2: mappers (i.e. a mapper defines what each mapped swift data
corresponds to in terms of logical files as well as defining what each
file corresponds to in terms of swift data).
2 <-> 3: replica catalog (I'm using that term loosely).

Currently our replica catalog is a special form of the identity mapping
(a.txt <-> (a.txt)).

We need to:
1. extend 2 <-> 3 to be more general (i.e. allow non-singular tuples in
set 3)
2. pass elements from set 3 to the scheduler
3. make the scheduler decide sites based on elements from set 3.

It's slightly more complex in practice because set 2 also allows things
of the form "gsiftp://site/file". But let's ignore that for now.