[Swift-devel] MapReduce, doubts

Sun Aug 28 13:37:33 CDT 2011

I'll add it to the cookbook because I am not sure what mappers it works for and what options work. I only know that it works for those two mappers so testing to see exactly what works is needed. Is the cookbook in asciidoc or is it the ci swift wiki one? 

----- Reply message -----
From: "Michael Wilde" <wilde at mcs.anl.gov>
Date: Sun, Aug 28, 2011 12:11 pm
Subject: [Swift-devel] MapReduce, doubts
To: "Jonathan Monette" <jonmon at mcs.anl.gov>
Cc: "Yadu Nand" <yadudoc1729 at gmail.com>, "swift-devel" <swift-devel at ci.uchicago.edu>

Jon, yes, thats right. Can you add the info below to the User Guide, or if time is pressing, to the cookbook?

Thanks,

- Mike

----- Original Message -----
> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Yadu Nand" <yadudoc1729 at gmail.com>, "swift-devel" <swift-devel at ci.uchicago.edu>
> Sent: Sunday, August 28, 2011 11:32:44 AM
> Subject: Re: [Swift-devel] MapReduce, doubts
> On Aug 28, 2011, at 10:15 AM, Michael Wilde wrote:
> 
> > ----- Original Message -----
> >> From: "Yadu Nand" <yadudoc1729 at gmail.com>
> >> To: "swift-devel" <swift-devel at ci.uchicago.edu>, "Justin M Wozniak"
> >> <wozniak at mcs.anl.gov>, "Mihael Hategan"
> >> <hategan at mcs.anl.gov>, "Michael Wilde" <wilde at mcs.anl.gov>
> >> Sent: Sunday, August 28, 2011 8:03:29 AM
> >> Subject: MapReduce, doubts
> >> Hi,
> >>
> >> I was going through some materials ([1], [2] , [3]) to understand
> >> Google's MapReduce system and I have a couple of queries :
> >>
> >> 1. How do we address the issue of data locality ?
> >> When we run a map job, it is a priority to run it such that least
> >> network overhead is incurred, so preferably on the same system
> >> holding the data (or one which is nearest , I don't know how this
> >> works).
> >
> > Currently, we dont. We have discussed a new feature to do this (its
> > listed as a GSoC project and I can probably locate a discussion with
> > a 2010 GSoC candidate in which I detailed a possible strategy).
> >
> > We can current implement a similar scheme using an external mapper
> > to select input files from multiple sites and map them to gsiftp
> > URIs. Then an enhancement in the scheduler could select a site based
> > on the URI of some or all or the input files.
> 
> In my work with GOSwift I found that mappers will follow the GSIURI
> path.
> 
> For a single file:
> file input1
> <“gsiftp://gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/jonmon/data/input1.txt”>;
> 
> The above will map the file input1.txt that resides on PADS.
> 
> For a group of files:
> file inputs[] <filesys_mapper; location =
> "gsiftp://gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/jonmon/data”,
> suffix=“.txt”>;
> 
> The above will map all files with a ".txt" extension in the directory
> data on PADS.
> 
> I think this is what you were talking about having the external mapper
> do.
> 
> >
> >> 2. Is it possible to somehow force the reduce tasks to wait till
> >> all
> >> map jobs are done ?
> >
> > Isn't that just normal swift semantics? If we coded a simple-minded
> > reduce job whose input was the array of outputs from the map()
> > stage, the reduce (assuming its an app function) would wait for all
> > the map() ops to finish, right?
> >
> > I would ask instead "do we want to?". Do the distributed reduce ops
> > in map-reduce really wait? Doesn't MR do distributed reduction in
> > batches, asynchronously to the completion of the map() operations?
> > Isnt this a key property that is made possible by the name/value
> > pair-based nature of the MR data model? I thought MR reduce ops take
> > place at any location, in any input chunk size, in a tree-based
> > manner, and that this is possible because the reduction operator is
> > "distributed" in the mathematical sense.
> >
> >> The MapReduce uses a system which permits reduce to run only
> >> after all the map jobs are done executing. I'm not entirely sure
> >> why
> >> this is a requirement but this has its own issues, such as a single
> >> slow mapper. This is usually tackled by the main-controller
> >> noticing
> >> the slow one and running multiple instances of the map job to get
> >> results faster. Does swift at some level use the concept of a
> >> central
> >> controller ? How do we tackle this ?
> >>
> >> 3. How does swift handle failures ? Is there a facility for
> >> re-execution ?
> >
> > Yes, Swift retries failing app invocations as controlled by the
> > properties execution.retries and lazy.errors. You can read on these
> > in the users guide and in the properties file.
> >
> >> Is this documented somewhere ? Do we use any file-system that
> >> handles loss of a particular file /input-set ?
> >
> > No, we dont, but some of this would come with using a
> > replication-based model for the input dataset where the mapper could
> > supply a list of possible inputs instead of one, and the scheduler
> > could pick a replica each time it selects a site for a (retried)
> > job.
> >
> > Also, we might think of a "forMostOf" statement which could
> > implement semantics that would be suitable for runs in which you
> > dont need every single map() to complete. I.e. the target array can
> > be considered closed when "most of" (tbd) the input collection had
> > been processed. The formost() could complete when it enters the
> > "tail" of the loop (see Tim Armstrong's paper on the tail
> > phenomenon).
> 
> I think this has been discussed before. In the Montage application
> there is a step where I map more filenames than will be created. So I
> don't need all the maps to complete for the workflow to keep
> progressing. I made a workaround but I think this "forMostOf" feature
> would be useful. I will locate the thread in which Mihael and I had
> this discussion.
> >
> >> I'm stopping here, there are more questions nagging me, but its
> >> probably best to not blurt it out all at once :)
> >
> > I think you are hitting the right issues here, and I encourage you
> > to keep pushing towards something that you could readily experiment
> > with. This si exactly where we need to go to provide a convenient
> > method for expressing map-reduce as an elegant high-level script.
> >
> > I also encourage you to read on what Ed Walker did for map-reduce in
> > his parallel shell.
> >
> > - Mike
> >
> >> [1] http://code.google.com/edu/parallel/mapreduce-tutorial.html
> >> [2] http://www.youtube.com/watch?v=-vD6PUdf3Js
> >> [3]
> >> http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html
> >> --
> >> Thanks and Regards,
> >> Yadu Nand B
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110828/b6e282b2/attachment.html>