[Swift-devel] MapReduce, doubts

Michael Wilde wilde at mcs.anl.gov
Sun Aug 28 14:33:57 CDT 2011



----- Original Message -----
> From: "Yadu Nand" <yadudoc1729 at gmail.com>
...
> Okay, I will read more on this. Do you mean to say that we currently
> can tune/tweak the scheduler to pick optimal sites ?

Yes, in the sense that one can enhance the code to do this :)

> Google's MapReduce waits till all map jobs are complete. They list
> some reasons for choosing this over running reduce in parallel.

My understanding was wrong - thanks for the correction.

...
> Again as mentioned earlier we can use a *combiner* at each site
> to pre-reduce the intermediates to lessen the bandwidth needs if
> required. (provided the functions are associative and commutative)
> The combiner is usually the same as the reducer function, but run
> locally.

Sounds like a good approach. Might need the new primitive "foreachsite" which would operate on all members of a collection cached at a site, and do so over all sites that hold members of the collection. That would be a "funny" operator in the sense that its based on some physical aspect of the implementation, and the state of a run, unlike the rest of the language that has no physical site connections.  But it seems useful and pragmatic, and to my mind worth at least exploring.

...
> I haven't read the paper yet. With execution.retries, lazy.errors
> don't
> we have the required behavior ? Which is, if a job fails retry a
> limited
> number of times and if there is no progress ignore the job. I think
> replication.enabled can also be useful here. MapReduce uses a similar
> idea of spawning multiple-redundant jobs to handle cases where jobs
> run too slowly. Can we expect similar behavior here as well ?

I think so, at least to a first approximation.

- Mike



More information about the Swift-devel mailing list