[Swift-devel] Comparing Swift and Hadoop

Mon May 14 17:15:42 CDT 2012

Also, the point of the workflow systems was to make loose-coupling of
apps to be parallelized easier.  If one is thinking to  invest on
tight-coupling (be it MPI or Hadoop), then it can be assumed that you
have specific optimizations in mind for your app.

Allan

2012/5/14 Tim Armstrong <tim.g.armstrong at gmail.com>:
> To be clear, I'm not making the case that it's impossible to implement
> things in Swift that are implemented in MapReduce, just that Swift isn't
> well suited to them, because it wasn't designed with them in mind.  I've
> seen the argument before that MapReduce is a particular data flow DAG, and
> that you can express arbitrary data flow DAGs in other systems, but I think
> that somewhat misses the point of what MapReduce is trying to provide to
> application developers.  By treating all tasks and data dependencies as
> equivalent, it ignores all of the runtime infrastructure that MapReduce
> inserts into the processes, and ignores, for example, some of the details of
> how data is moved between mappers and reducers.
>
> For example, a substantial amount of code in the Hadoop MapReduce code base
> has to do with a) file formats b) compression c) checksums d) serialization
> e) buffering input and output data and f) bucketing/sorting the data.  This
> is all difficult to implement well and important for many big data
> applications.  I think that scientific workflow systems don't take any of
> these things seriously since it isn't important for most canonical
> scientific workflow applications.
>
> I think one of the other big differences is that Hadoop assumes that all you
> have are a bunch of unreliable machines on a network, so that it must
> provide its own a job scheduler and replicated distributed file system.
> Swift, in contrast, seems mostly designed for systems where there is a
> reliable shared file system, and where it acquires compute resources for a
> fixed blocks of time from some existing cluster manager.  I know there are
> ways you can have Swift/Coaster/Falkon run on networks of unreliable
> machines, but it's not quite like Hadoop's job scheduler which is designed
> to actually be the primary submission mechanism for a multi-user cluster.
>
> I don't think it would make much sense to run Swift on a network of
> unreliable machines and then to just leave your data on those machines (you
> would normally stage the final data to some backed-up file system), but it
> would make perfect sense for Hadoop, especially if the data is so big that
> it's difficult to find someplace else to put it.  In contrast, you can
> certainly stand up a Hadoop instance on a shared cluster for a few hours to
> run your jobs, and stage data in and out of HDFS, but that use case isn't
> what Hadoop was designed or optimized for. Most of the core developers on
> Hadoop are working in environments where they have devoted Hadoop clusters,
> where they can't afford much cluster downtime and where they need to
> reliably persist huge amounts of data for years on unreliable hardware.
> E.g. at the extreme end, this is the kind of thing Hadoop developers are
> thinking about:
> https://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920
>
> - Tim
>
>
>
> On Sun, May 13, 2012 at 3:57 PM, Ioan Raicu <iraicu at cs.iit.edu> wrote:
>>
>> Hi Tim,
>> I always thought of MapReduce being a subset of workflow systems. Can you
>> give me an example of an application that can be implemented in MapReduce,
>> but not a workflow system such as Swift? I can't think of any off the top of
>> my head.
>>
>>
>> Ioan
>>
>> --
>> =================================================================
>> Ioan Raicu, Ph.D.
>> Assistant Professor
>> =================================================================
>> Computer Science Department
>> Illinois Institute of Technology
>> 10 W. 31st Street Chicago, IL 60616
>> =================================================================
>> Cel:   1-847-722-0876
>> Email: iraicu at cs.iit.edu
>> Web:   http://www.cs.iit.edu/~iraicu/
>> =================================================================
>> =================================================================
>>
>>
>>
>> On May 13, 2012, at 1:09 PM, Tim Armstrong <tim.g.armstrong at gmail.com>
>> wrote:
>>
>> I've worked on both Swift and Hadoop implementations and my tendency is to
>> say that there isn't actually any deep similarity beyond them both
>> supporting  distributed data processing/computation.  They both make
>> fundamentally different assumptions about the clusters they run on and the
>> applications they're supporting.
>>
>> Swift is mainly designed for time-shared clusters with reliable shared
>> file systems. Hadoop assumes that it will be running on unreliable commodity
>> machines with no shared file system, and will be running continuously on all
>> machines on the cluster.  Swift is designed for orchestrating existing
>> executables with their own file formats, so mostly remains agnostic to the
>> contents of the files it is processing.  Hadoop needs to have some
>> understanding of the contents of the files it is processing, to be able to
>> segment them into records and perform key comparisons so it can do a
>> distributed sort, etc.  It provides its own file formats (including
>> compression, serialization, etc) that users can use, although is extensible
>> to custom file formats.
>>
>> Hadoop implements its own distributed file-system with software
>> redundancy, Swift uses an existing cluster filesystem or node-local file
>> systems.  For bulk data processing, this means Hadoop will generally be able
>> to deliver more disk bandwidth and has a bunch of other implications.
>> Hadoop has a record-oriented view of the world, i.e. it is built around
>> the idea that you are processing a record at at time, rather than a file at
>> a time as in Swift
>> As a result, Hadoop includes a bunch of functionality to do with file
>> formats, compression, serialization etc: Swift is B.Y.O. file format
>> Hadoop's distributed sort is a core part of the MapReduce (and something
>> that a lot of effort has gone into implementing and optimizing), Swift
>> doesn't have built-in support for anything similar
>> Swift lets you construct arbitrary dataflow graphs between tasks, so in
>> some ways is less restrictive than the map-reduce pattern (although it
>> doesn't directly support some things that the map-reduce pattern does, so I
>> wouldn't say that it is strictly more general)
>>
>> I'd say that some applications might fit in both paradigms, but that
>> neither supports a superset of the applications that the other supports.
>> Performance would depend to a large extent on the application.  Swift might
>> actually be quicker to start up a job and dispatch tasks (Hadoop is
>> notoriously slow on that front), but otherwise I'd say it just depends on
>> the application, how you implement the application, the cluster, etc. I'm
>> not sure that there is a fair comparison between the two systems since
>> they're just very different: most of the results would be predictable just
>> be looking at the design of the system (e.g. if the application needs to do
>> a big distributed sort, Hadoop is much better) .  If the application is
>> embarrassingly parallel (like it sounds like your application is), then you
>> could probably implement it in either, but I'm not sure that it would
>> actually stress the differences between the systems if data sizes are small
>> and runtime is mostly dominated by computation.
>>
>> I think the Cloudera Hadoop distribution is well documented reasonably
>> easy to set up and run, provided that you're not on a time-shared cluster.
>> Apache Hadoop is more of a pain to get working.
>>
>> - Tim
>>
>>
>> On Sun, May 13, 2012 at 9:27 AM, Ketan Maheshwari
>> <ketancmaheshwari at gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> We are working on a project from GE Energy corporation which runs
>>> independent MonteCarlo simulations in order to find device reliability
>>> leading to a power grid wise device replacement decisions. The computation
>>> is repeated MC simulations done in parallel.
>>>
>>> Currently, this is running under Hadoop setup on Cornell Redcloud and EC2
>>> (10 nodes). Looking at the computation, it struck me this is a good Swift
>>> candidate. And since the performance numbers etc are already extracted for
>>> Hadoop, it might also be nice to have a comparison between Swift and Hadoop.
>>>
>>> However, some reality check before diving in: has it been done before? Do
>>> we know how Swift fares against map-reduce? Are they even comparable? I have
>>> faced this question twice here: Why use Swift when you have Hadoop?
>>>
>>> I could see Hadoop needs quite a bit of setup effort before getting it to
>>> run. Could we quantify usability and compare the two?
>>>
>>> Any ideas and inputs are welcome.
>>>
>>> Regards,
>>> --
>>> Ketan