[Swift-devel] MapReduce, doubts

Sun Aug 28 08:03:29 CDT 2011

Hi,

I was going through some materials ([1], [2] , [3]) to understand
Google's MapReduce system and I have a couple of queries :

1. How do we address the issue of data locality ?
When we run a map job, it is a priority to run it such that least
network overhead is incurred, so preferably on the same system
holding the data (or one which is nearest , I don't know how this
works).

2. Is it possible to somehow force the reduce tasks to wait till all
map jobs are done ?
The MapReduce uses a system which permits reduce to run only
after all the map jobs are done executing. I'm not entirely sure why
this is a requirement but this has its own issues, such as a single
slow mapper. This is usually tackled by the main-controller noticing
the slow one and running multiple instances of the map job to get
results faster. Does swift at some level use the concept of a central
controller ? How do we tackle this ?

3. How does swift handle failures ? Is there a facility for re-execution ?
Is this documented somewhere ? Do we use any file-system that
handles loss of a particular file /input-set ?

I'm stopping here, there are more questions nagging me, but its
probably best to not blurt it out all at once :)

[1] http://code.google.com/edu/parallel/mapreduce-tutorial.html
[2] http://www.youtube.com/watch?v=-vD6PUdf3Js
[3] http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html
-- 
Thanks and Regards,
Yadu Nand B