[Swift-devel] MapReduce, doubts

Ketan Maheshwari ketancmaheshwari at gmail.com
Sun Aug 28 13:38:47 CDT 2011


Jon,

The cookbook is asciidoc located at:
http://www.ci.uchicago.edu/swift/cookbook/cookbook-asciidoc.html


On Sun, Aug 28, 2011 at 1:37 PM, Jonathan Monette <jonmon at mcs.anl.gov>wrote:

> I'll add it to the cookbook because I am not sure what mappers it works for
> and what options work. I only know that it works for those two mappers so
> testing to see exactly what works is needed. Is the cookbook in asciidoc or
> is it the ci swift wiki one?
>
> ----- Reply message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> Date: Sun, Aug 28, 2011 12:11 pm
> Subject: [Swift-devel] MapReduce, doubts
> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
> Cc: "Yadu Nand" <yadudoc1729 at gmail.com>, "swift-devel" <
> swift-devel at ci.uchicago.edu>
>
>
> Jon, yes, thats right. Can you add the info below to the User Guide, or if
> time is pressing, to the cookbook?
>
> Thanks,
>
> - Mike
>
> ----- Original Message -----
> > From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > Cc: "Yadu Nand" <yadudoc1729 at gmail.com>, "swift-devel" <
> swift-devel at ci.uchicago.edu>
> > Sent: Sunday, August 28, 2011 11:32:44 AM
> > Subject: Re: [Swift-devel] MapReduce, doubts
> > On Aug 28, 2011, at 10:15 AM, Michael Wilde wrote:
> >
> > > ----- Original Message -----
> > >> From: "Yadu Nand" <yadudoc1729 at gmail.com>
> > >> To: "swift-devel" <swift-devel at ci.uchicago.edu>, "Justin M Wozniak"
> > >> <wozniak at mcs.anl.gov>, "Mihael Hategan"
> > >> <hategan at mcs.anl.gov>, "Michael Wilde" <wilde at mcs.anl.gov>
> > >> Sent: Sunday, August 28, 2011 8:03:29 AM
> > >> Subject: MapReduce, doubts
> > >> Hi,
> > >>
> > >> I was going through some materials ([1], [2] , [3]) to understand
> > >> Google's MapReduce system and I have a couple of queries :
> > >>
> > >> 1. How do we address the issue of data locality ?
> > >> When we run a map job, it is a priority to run it such that least
> > >> network overhead is incurred, so preferably on the same system
> > >> holding the data (or one which is nearest , I don't know how this
> > >> works).
> > >
> > > Currently, we dont. We have discussed a new feature to do this (its
> > > listed as a GSoC project and I can probably locate a discussion with
> > > a 2010 GSoC candidate in which I detailed a possible strategy).
> > >
> > > We can current implement a similar scheme using an external mapper
> > > to select input files from multiple sites and map them to gsiftp
> > > URIs. Then an enhancement in the scheduler could select a site based
> > > on the URI of some or all or the input files.
> >
> > In my work with GOSwift I found that mappers will follow the GSIURI
> > path.
> >
> > For a single file:
> > file input1
> > <“gsiftp://
> gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/jonmon/data/input1.txt”>;
> >
> > The above will map the file input1.txt that resides on PADS.
> >
> > For a group of files:
> > file inputs[] <filesys_mapper; location =
> > "gsiftp://gridftp.pads.ci.uchicago.edu//gpfs/pads/swift/jonmon/data”,
> > suffix=“.txt”>;
> >
> > The above will map all files with a ".txt" extension in the directory
> > data on PADS.
> >
> > I think this is what you were talking about having the external mapper
> > do.
> >
> > >
> > >> 2. Is it possible to somehow force the reduce tasks to wait till
> > >> all
> > >> map jobs are done ?
> > >
> > > Isn't that just normal swift semantics? If we coded a simple-minded
> > > reduce job whose input was the array of outputs from the map()
> > > stage, the reduce (assuming its an app function) would wait for all
> > > the map() ops to finish, right?
> > >
> > > I would ask instead "do we want to?". Do the distributed reduce ops
> > > in map-reduce really wait? Doesn't MR do distributed reduction in
> > > batches, asynchronously to the completion of the map() operations?
> > > Isnt this a key property that is made possible by the name/value
> > > pair-based nature of the MR data model? I thought MR reduce ops take
> > > place at any location, in any input chunk size, in a tree-based
> > > manner, and that this is possible because the reduction operator is
> > > "distributed" in the mathematical sense.
> > >
> > >> The MapReduce uses a system which permits reduce to run only
> > >> after all the map jobs are done executing. I'm not entirely sure
> > >> why
> > >> this is a requirement but this has its own issues, such as a single
> > >> slow mapper. This is usually tackled by the main-controller
> > >> noticing
> > >> the slow one and running multiple instances of the map job to get
> > >> results faster. Does swift at some level use the concept of a
> > >> central
> > >> controller ? How do we tackle this ?
> > >>
> > >> 3. How does swift handle failures ? Is there a facility for
> > >> re-execution ?
> > >
> > > Yes, Swift retries failing app invocations as controlled by the
> > > properties execution.retries and lazy.errors. You can read on these
> > > in the users guide and in the properties file.
> > >
> > >> Is this documented somewhere ? Do we use any file-system that
> > >> handles loss of a particular file /input-set ?
> > >
> > > No, we dont, but some of this would come with using a
> > > replication-based model for the input dataset where the mapper could
> > > supply a list of possible inputs instead of one, and the scheduler
> > > could pick a replica each time it selects a site for a (retried)
> > > job.
> > >
> > > Also, we might think of a "forMostOf" statement which could
> > > implement semantics that would be suitable for runs in which you
> > > dont need every single map() to complete. I.e. the target array can
> > > be considered closed when "most of" (tbd) the input collection had
> > > been processed. The formost() could complete when it enters the
> > > "tail" of the loop (see Tim Armstrong's paper on the tail
> > > phenomenon).
> >
> > I think this has been discussed before. In the Montage application
> > there is a step where I map more filenames than will be created. So I
> > don't need all the maps to complete for the workflow to keep
> > progressing. I made a workaround but I think this "forMostOf" feature
> > would be useful. I will locate the thread in which Mihael and I had
> > this discussion.
> > >
> > >> I'm stopping here, there are more questions nagging me, but its
> > >> probably best to not blurt it out all at once :)
> > >
> > > I think you are hitting the right issues here, and I encourage you
> > > to keep pushing towards something that you could readily experiment
> > > with. This si exactly where we need to go to provide a convenient
> > > method for expressing map-reduce as an elegant high-level script.
> > >
> > > I also encourage you to read on what Ed Walker did for map-reduce in
> > > his parallel shell.
> > >
> > > - Mike
> > >
> > >> [1] http://code.google.com/edu/parallel/mapreduce-tutorial.html
> > >> [2] http://www.youtube.com/watch?v=-vD6PUdf3Js
> > >> [3]
> > >>
> http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html
> > >> --
> > >> Thanks and Regards,
> > >> Yadu Nand B
> > >
> > > --
> > > Michael Wilde
> > > Computation Institute, University of Chicago
> > > Mathematics and Computer Science Division
> > > Argonne National Laboratory
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20110828/11863ac1/attachment.html>


More information about the Swift-devel mailing list