<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Aug 28, 2011, at 1:55 PM, Yadu Nand wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div><blockquote type="cite"><blockquote type="cite">I was going through some materials ([1], [2] , [3]) to understand<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Google's MapReduce system and I have a couple of queries :<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">1. How do we address the issue of data locality ?<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">When we run a map job, it is a priority to run it such that least<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">network overhead is incurred, so preferably on the same system<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">holding the data (or one which is nearest , I don't know how this<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">works).<br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Currently, we dont. We have discussed a new feature to do this (its listed as a GSoC project and I can probably locate a discussion with a 2010 GSoC candidate in which I detailed a possible strategy).<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">We can current implement a similar scheme using an external mapper to select input files from multiple sites and map them to gsiftp URIs.  Then an enhancement in the scheduler could select a site based on the URI of some or all or the input files.<br></blockquote><br>Okay, I will read more on this. Do you mean to say that we currently<br>can tune/tweak the scheduler to pick optimal sites ?<br></div></blockquote><div><br></div>I think he was saying that the scheduler can choose where to run the app based on where the data it needs is mapped.  If you say that data is mapped on PADS using the GSIURL for PADS then the scheduler will give preference to run on PADS.  So saying:<br><br></div><div>file input1 <“<a href="gsiftp://gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/jonmon/data/input1.txt%E2%80%9D">gsiftp://gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/jonmon/data/input1.txt”</a>>;</div><div>output1 = cat(input1)</div><div><br></div><div>would have the app defined for cat run on PADS since the input1 is mapped to PADS.</div><div><blockquote type="cite"><div><br><blockquote type="cite"><blockquote type="cite">2. Is it possible to somehow force the reduce tasks to wait till all<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">map jobs are done ?<br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Isn't that just normal swift semantics? If we coded a simple-minded reduce job whose input was the array of outputs from the map() stage, the reduce (assuming its an app function) would wait for all the map() ops to finish, right?<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">I would ask instead "do we want to?". Do the distributed reduce ops in map-reduce really wait? Doesn't MR do distributed reduction in batches, asynchronously to the completion of the map() operations? Isnt this a key property that is made possible by the name/value pair-based nature of the MR data model?  I thought MR reduce ops take place at any location, in any input chunk size, in a tree-based manner, and that this is possible because the reduction operator is "distributed" in the mathematical sense.<br></blockquote><br>Google's MapReduce waits till all map jobs are complete. They list<br>some reasons for choosing this over running reduce in parallel.<br>* Difficulty when a site fails (both mappers and reducers will need<br>to restart and will need to remember states. This adds unnecessary<br>complexity)<br>* In the end, its CPU cycles we are intelligently dealing with. We<br>could just use it for map and then start the reduce stage.<br>* In the lecture ([2]) it is stated that keeping reduce towards the end<br>led to lesser bandwidth usage.<br><br>Again as mentioned earlier we can use a *combiner* at each site<br>to pre-reduce the intermediates to lessen the bandwidth needs if<br>required. (provided the functions are associative and commutative)<br>The combiner is usually the same as the reducer function, but run<br>locally.<br><br><blockquote type="cite"><blockquote type="cite">3. How does swift handle failures ? Is there a facility for<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">re-execution ?<br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Yes, Swift retries failing app invocations as controlled by the properties execution.retries and lazy.errors. You can read on these in the users guide and in the properties file.<br></blockquote>Great, I went through the user-guide pages on Swift properties. I see<br>the relication.enabled option as well. With this I think a lot of plus points<br>of MapReduce will be covered :)<br><br><blockquote type="cite">No, we dont, but some of this would come with using a replication-based model for the input dataset where the mapper could supply a list of possible inputs instead of one, and the scheduler could pick a replica each time it selects a site for a (retried) job.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Also, we might think of a "forMostOf" statement which could implement semantics that would be suitable for runs in which you dont need every single map() to complete. I.e. the target array can be considered closed when "most of" (tbd) the input collection had been processed. The formost() could complete when it enters the "tail" of the loop (see Tim Armstrong's paper on the tail phenomenon).<br></blockquote><blockquote type="cite"><br></blockquote>I haven't read the paper yet. With execution.retries, lazy.errors don't<br>we have the required behavior ? Which is, if a job fails retry a limited<br>number of times and if there is no progress ignore the job. I think<br>replication.enabled can also be useful here. MapReduce uses a similar<br>idea of spawning multiple-redundant jobs to handle cases where jobs<br>run too slowly.  Can we expect similar behavior here as well ?<br><br><blockquote type="cite"><blockquote type="cite">I'm stopping here, there are more questions nagging me, but its<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">probably best to not blurt it out all at once :)<br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">I think you are hitting the right issues here, and I encourage you to keep pushing towards something that you could readily experiment with.  This si exactly where we need to go to provide a convenient method for expressing map-reduce as an elegant high-level script.<br></blockquote><blockquote type="cite">I also encourage you to read on what Ed Walker did for map-reduce in his parallel shell.<br></blockquote><br>Okay, I will read this paper as well and post. Thanks :)<br><br>-- <br>Thanks and Regards,<br>Yadu Nand B<br>_______________________________________________<br>Swift-devel mailing list<br><a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel<br></div></blockquote></div><br></body></html>