[Swift-devel] Comparing Swift and Hadoop

Mon May 14 17:49:22 CDT 2012

I quite don't see how Hadoop is tightly coupled - it's data parallel and
there isn't any message passing...

On Mon, May 14, 2012 at 5:15 PM, Allan Espinosa
<aespinosa at cs.uchicago.edu>wrote:

> Also, the point of the workflow systems was to make loose-coupling of
> apps to be parallelized easier.  If one is thinking to  invest on
> tight-coupling (be it MPI or Hadoop), then it can be assumed that you
> have specific optimizations in mind for your app.
>
> Allan
>
> 2012/5/14 Tim Armstrong <tim.g.armstrong at gmail.com>:
> > To be clear, I'm not making the case that it's impossible to implement
> > things in Swift that are implemented in MapReduce, just that Swift isn't
> > well suited to them, because it wasn't designed with them in mind.  I've
> > seen the argument before that MapReduce is a particular data flow DAG,
> and
> > that you can express arbitrary data flow DAGs in other systems, but I
> think
> > that somewhat misses the point of what MapReduce is trying to provide to
> > application developers.  By treating all tasks and data dependencies as
> > equivalent, it ignores all of the runtime infrastructure that MapReduce
> > inserts into the processes, and ignores, for example, some of the
> details of
> > how data is moved between mappers and reducers.
> >
> > For example, a substantial amount of code in the Hadoop MapReduce code
> base
> > has to do with a) file formats b) compression c) checksums d)
> serialization
> > e) buffering input and output data and f) bucketing/sorting the data.
> This
> > is all difficult to implement well and important for many big data
> > applications.  I think that scientific workflow systems don't take any of
> > these things seriously since it isn't important for most canonical
> > scientific workflow applications.
> >
> > I think one of the other big differences is that Hadoop assumes that all
> you
> > have are a bunch of unreliable machines on a network, so that it must
> > provide its own a job scheduler and replicated distributed file system.
> > Swift, in contrast, seems mostly designed for systems where there is a
> > reliable shared file system, and where it acquires compute resources for
> a
> > fixed blocks of time from some existing cluster manager.  I know there
> are
> > ways you can have Swift/Coaster/Falkon run on networks of unreliable
> > machines, but it's not quite like Hadoop's job scheduler which is
> designed
> > to actually be the primary submission mechanism for a multi-user cluster.
> >
> > I don't think it would make much sense to run Swift on a network of
> > unreliable machines and then to just leave your data on those machines
> (you
> > would normally stage the final data to some backed-up file system), but
> it
> > would make perfect sense for Hadoop, especially if the data is so big
> that
> > it's difficult to find someplace else to put it.  In contrast, you can
> > certainly stand up a Hadoop instance on a shared cluster for a few hours
> to
> > run your jobs, and stage data in and out of HDFS, but that use case isn't
> > what Hadoop was designed or optimized for. Most of the core developers on
> > Hadoop are working in environments where they have devoted Hadoop
> clusters,
> > where they can't afford much cluster downtime and where they need to
> > reliably persist huge amounts of data for years on unreliable hardware.
> > E.g. at the extreme end, this is the kind of thing Hadoop developers are
> > thinking about:
> >
> https://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920
> >
> > - Tim
> >
> >
> >
> > On Sun, May 13, 2012 at 3:57 PM, Ioan Raicu <iraicu at cs.iit.edu> wrote:
> >>
> >> Hi Tim,
> >> I always thought of MapReduce being a subset of workflow systems. Can
> you
> >> give me an example of an application that can be implemented in
> MapReduce,
> >> but not a workflow system such as Swift? I can't think of any off the
> top of
> >> my head.
> >>
> >>
> >> Ioan
> >>
> >> --
> >> =================================================================
> >> Ioan Raicu, Ph.D.
> >> Assistant Professor
> >> =================================================================
> >> Computer Science Department
> >> Illinois Institute of Technology
> >> 10 W. 31st Street Chicago, IL 60616
> >> =================================================================
> >> Cel:   1-847-722-0876
> >> Email: iraicu at cs.iit.edu
> >> Web:   http://www.cs.iit.edu/~iraicu/
> >> =================================================================
> >> =================================================================
> >>
> >>
> >>
> >> On May 13, 2012, at 1:09 PM, Tim Armstrong <tim.g.armstrong at gmail.com>
> >> wrote:
> >>
> >> I've worked on both Swift and Hadoop implementations and my tendency is
> to
> >> say that there isn't actually any deep similarity beyond them both
> >> supporting  distributed data processing/computation.  They both make
> >> fundamentally different assumptions about the clusters they run on and
> the
> >> applications they're supporting.
> >>
> >> Swift is mainly designed for time-shared clusters with reliable shared
> >> file systems. Hadoop assumes that it will be running on unreliable
> commodity
> >> machines with no shared file system, and will be running continuously
> on all
> >> machines on the cluster.  Swift is designed for orchestrating existing
> >> executables with their own file formats, so mostly remains agnostic to
> the
> >> contents of the files it is processing.  Hadoop needs to have some
> >> understanding of the contents of the files it is processing, to be able
> to
> >> segment them into records and perform key comparisons so it can do a
> >> distributed sort, etc.  It provides its own file formats (including
> >> compression, serialization, etc) that users can use, although is
> extensible
> >> to custom file formats.
> >>
> >> Hadoop implements its own distributed file-system with software
> >> redundancy, Swift uses an existing cluster filesystem or node-local file
> >> systems.  For bulk data processing, this means Hadoop will generally be
> able
> >> to deliver more disk bandwidth and has a bunch of other implications.
> >> Hadoop has a record-oriented view of the world, i.e. it is built around
> >> the idea that you are processing a record at at time, rather than a
> file at
> >> a time as in Swift
> >> As a result, Hadoop includes a bunch of functionality to do with file
> >> formats, compression, serialization etc: Swift is B.Y.O. file format
> >> Hadoop's distributed sort is a core part of the MapReduce (and something
> >> that a lot of effort has gone into implementing and optimizing), Swift
> >> doesn't have built-in support for anything similar
> >> Swift lets you construct arbitrary dataflow graphs between tasks, so in
> >> some ways is less restrictive than the map-reduce pattern (although it
> >> doesn't directly support some things that the map-reduce pattern does,
> so I
> >> wouldn't say that it is strictly more general)
> >>
> >> I'd say that some applications might fit in both paradigms, but that
> >> neither supports a superset of the applications that the other supports.
> >> Performance would depend to a large extent on the application.  Swift
> might
> >> actually be quicker to start up a job and dispatch tasks (Hadoop is
> >> notoriously slow on that front), but otherwise I'd say it just depends
> on
> >> the application, how you implement the application, the cluster, etc.
> I'm
> >> not sure that there is a fair comparison between the two systems since
> >> they're just very different: most of the results would be predictable
> just
> >> be looking at the design of the system (e.g. if the application needs
> to do
> >> a big distributed sort, Hadoop is much better) .  If the application is
> >> embarrassingly parallel (like it sounds like your application is), then
> you
> >> could probably implement it in either, but I'm not sure that it would
> >> actually stress the differences between the systems if data sizes are
> small
> >> and runtime is mostly dominated by computation.
> >>
> >> I think the Cloudera Hadoop distribution is well documented reasonably
> >> easy to set up and run, provided that you're not on a time-shared
> cluster.
> >> Apache Hadoop is more of a pain to get working.
> >>
> >> - Tim
> >>
> >>
> >> On Sun, May 13, 2012 at 9:27 AM, Ketan Maheshwari
> >> <ketancmaheshwari at gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We are working on a project from GE Energy corporation which runs
> >>> independent MonteCarlo simulations in order to find device reliability
> >>> leading to a power grid wise device replacement decisions. The
> computation
> >>> is repeated MC simulations done in parallel.
> >>>
> >>> Currently, this is running under Hadoop setup on Cornell Redcloud and
> EC2
> >>> (10 nodes). Looking at the computation, it struck me this is a good
> Swift
> >>> candidate. And since the performance numbers etc are already extracted
> for
> >>> Hadoop, it might also be nice to have a comparison between Swift and
> Hadoop.
> >>>
> >>> However, some reality check before diving in: has it been done before?
> Do
> >>> we know how Swift fares against map-reduce? Are they even comparable?
> I have
> >>> faced this question twice here: Why use Swift when you have Hadoop?
> >>>
> >>> I could see Hadoop needs quite a bit of setup effort before getting it
> to
> >>> run. Could we quantify usability and compare the two?
> >>>
> >>> Any ideas and inputs are welcome.
> >>>
> >>> Regards,
> >>> --
> >>> Ketan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120514/6bc8ef01/attachment.html>