[Swift-devel] MapReduce, doubts

Sun Aug 28 13:55:51 CDT 2011

>> I was going through some materials ([1], [2] , [3]) to understand
>> Google's MapReduce system and I have a couple of queries :
>>
>> 1. How do we address the issue of data locality ?
>> When we run a map job, it is a priority to run it such that least
>> network overhead is incurred, so preferably on the same system
>> holding the data (or one which is nearest , I don't know how this
>> works).
>
> Currently, we dont. We have discussed a new feature to do this (its listed as a GSoC project and I can probably locate a discussion with a 2010 GSoC candidate in which I detailed a possible strategy).
>
> We can current implement a similar scheme using an external mapper to select input files from multiple sites and map them to gsiftp URIs.  Then an enhancement in the scheduler could select a site based on the URI of some or all or the input files.

Okay, I will read more on this. Do you mean to say that we currently
can tune/tweak the scheduler to pick optimal sites ?

>> 2. Is it possible to somehow force the reduce tasks to wait till all
>> map jobs are done ?
>
> Isn't that just normal swift semantics? If we coded a simple-minded reduce job whose input was the array of outputs from the map() stage, the reduce (assuming its an app function) would wait for all the map() ops to finish, right?
>
> I would ask instead "do we want to?". Do the distributed reduce ops in map-reduce really wait? Doesn't MR do distributed reduction in batches, asynchronously to the completion of the map() operations? Isnt this a key property that is made possible by the name/value pair-based nature of the MR data model?  I thought MR reduce ops take place at any location, in any input chunk size, in a tree-based manner, and that this is possible because the reduction operator is "distributed" in the mathematical sense.

Google's MapReduce waits till all map jobs are complete. They list
some reasons for choosing this over running reduce in parallel.
* Difficulty when a site fails (both mappers and reducers will need
to restart and will need to remember states. This adds unnecessary
complexity)
* In the end, its CPU cycles we are intelligently dealing with. We
could just use it for map and then start the reduce stage.
* In the lecture ([2]) it is stated that keeping reduce towards the end
led to lesser bandwidth usage.

Again as mentioned earlier we can use a *combiner* at each site
to pre-reduce the intermediates to lessen the bandwidth needs if
required. (provided the functions are associative and commutative)
The combiner is usually the same as the reducer function, but run
locally.

>> 3. How does swift handle failures ? Is there a facility for
>> re-execution ?
>
> Yes, Swift retries failing app invocations as controlled by the properties execution.retries and lazy.errors. You can read on these in the users guide and in the properties file.
Great, I went through the user-guide pages on Swift properties. I see
the relication.enabled option as well. With this I think a lot of plus points
of MapReduce will be covered :)

> No, we dont, but some of this would come with using a replication-based model for the input dataset where the mapper could supply a list of possible inputs instead of one, and the scheduler could pick a replica each time it selects a site for a (retried) job.
>
> Also, we might think of a "forMostOf" statement which could implement semantics that would be suitable for runs in which you dont need every single map() to complete. I.e. the target array can be considered closed when "most of" (tbd) the input collection had been processed. The formost() could complete when it enters the "tail" of the loop (see Tim Armstrong's paper on the tail phenomenon).
>
I haven't read the paper yet. With execution.retries, lazy.errors don't
we have the required behavior ? Which is, if a job fails retry a limited
number of times and if there is no progress ignore the job. I think
replication.enabled can also be useful here. MapReduce uses a similar
idea of spawning multiple-redundant jobs to handle cases where jobs
run too slowly.  Can we expect similar behavior here as well ?

>> I'm stopping here, there are more questions nagging me, but its
>> probably best to not blurt it out all at once :)
>
> I think you are hitting the right issues here, and I encourage you to keep pushing towards something that you could readily experiment with.  This si exactly where we need to go to provide a convenient method for expressing map-reduce as an elegant high-level script.
> I also encourage you to read on what Ed Walker did for map-reduce in his parallel shell.

Okay, I will read this paper as well and post. Thanks :)

-- 
Thanks and Regards,
Yadu Nand B