[Swift-devel] Comparing Swift and Hadoop

Tim Armstrong tim.g.armstrong at gmail.com
Sun May 13 13:09:48 CDT 2012


I've worked on both Swift and Hadoop implementations and my tendency is to
say that there isn't actually any deep similarity beyond them both
supporting  distributed data processing/computation.  They both make
fundamentally different assumptions about the clusters they run on and the
applications they're supporting.

Swift is mainly designed for time-shared clusters with reliable shared file
systems. Hadoop assumes that it will be running on unreliable commodity
machines with no shared file system, and will be running continuously on
all machines on the cluster.  Swift is designed for orchestrating existing
executables with their own file formats, so mostly remains agnostic to the
contents of the files it is processing.  Hadoop needs to have some
understanding of the contents of the files it is processing, to be able to
segment them into records and perform key comparisons so it can do a
distributed sort, etc.  It provides its own file formats (including
compression, serialization, etc) that users can use, although is extensible
to custom file formats.

   - Hadoop implements its own distributed file-system with software
   redundancy, Swift uses an existing cluster filesystem or node-local file
   systems.  For bulk data processing, this means Hadoop will generally be
   able to deliver more disk bandwidth and has a bunch of other implications.
   - Hadoop has a record-oriented view of the world, i.e. it is built
   around the idea that you are processing a record at at time, rather than a
   file at a time as in Swift
   - As a result, Hadoop includes a bunch of functionality to do with file
   formats, compression, serialization etc: Swift is B.Y.O. file format
   - Hadoop's distributed sort is a core part of the MapReduce (and
   something that a lot of effort has gone into implementing and optimizing),
   Swift doesn't have built-in support for anything similar
   - Swift lets you construct arbitrary dataflow graphs between tasks, so
   in some ways is less restrictive than the map-reduce pattern (although it
   doesn't directly support some things that the map-reduce pattern does, so I
   wouldn't say that it is strictly more general)

I'd say that some applications might fit in both paradigms, but that
neither supports a superset of the applications that the other supports.
Performance would depend to a large extent on the application.  Swift might
actually be quicker to start up a job and dispatch tasks (Hadoop is
notoriously slow on that front), but otherwise I'd say it just depends on
the application, how you implement the application, the cluster, etc. I'm
not sure that there is a fair comparison between the two systems since
they're just very different: most of the results would be predictable just
be looking at the design of the system (e.g. if the application needs to do
a big distributed sort, Hadoop is much better) .  If the application is
embarrassingly parallel (like it sounds like your application is), then you
could probably implement it in either, but I'm not sure that it would
actually stress the differences between the systems if data sizes are small
and runtime is mostly dominated by computation.
I think the Cloudera Hadoop distribution is well documented reasonably easy
to set up and run, provided that you're not on a time-shared cluster.
Apache Hadoop is more of a pain to get working.

- Tim


On Sun, May 13, 2012 at 9:27 AM, Ketan Maheshwari <
ketancmaheshwari at gmail.com> wrote:

> Hi,
>
> We are working on a project from GE Energy corporation which runs
> independent MonteCarlo simulations in order to find device reliability
> leading to a power grid wise device replacement decisions. The computation
> is repeated MC simulations done in parallel.
>
> Currently, this is running under Hadoop setup on Cornell Redcloud and EC2
> (10 nodes). Looking at the computation, it struck me this is a good Swift
> candidate. And since the performance numbers etc are already extracted for
> Hadoop, it might also be nice to have a comparison between Swift and Hadoop.
>
> However, some reality check before diving in: has it been done before? Do
> we know how Swift fares against map-reduce? Are they even comparable? I
> have faced this question twice here: Why use Swift when you have Hadoop?
>
> I could see Hadoop needs quite a bit of setup effort before getting it to
> run. Could we quantify usability and compare the two?
>
> Any ideas and inputs are welcome.
>
> Regards,
> --
> Ketan
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120513/e86b7b51/attachment.html>


More information about the Swift-devel mailing list