I quite don't see how Hadoop is tightly coupled - it's data parallel and there isn't any message passing...<br><br><div class="gmail_quote">On Mon, May 14, 2012 at 5:15 PM, Allan Espinosa <span dir="ltr"><<a href="mailto:aespinosa@cs.uchicago.edu" target="_blank">aespinosa@cs.uchicago.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Also, the point of the workflow systems was to make loose-coupling of<br>

apps to be parallelized easier.  If one is thinking to  invest on<br>

tight-coupling (be it MPI or Hadoop), then it can be assumed that you<br>

have specific optimizations in mind for your app.<br>

<br>

Allan<br>

<br>

2012/5/14 Tim Armstrong <<a href="mailto:tim.g.armstrong@gmail.com">tim.g.armstrong@gmail.com</a>>:<br>

<div class="HOEnZb"><div class="h5">> To be clear, I'm not making the case that it's impossible to implement<br>

> things in Swift that are implemented in MapReduce, just that Swift isn't<br>

> well suited to them, because it wasn't designed with them in mind.  I've<br>

> seen the argument before that MapReduce is a particular data flow DAG, and<br>

> that you can express arbitrary data flow DAGs in other systems, but I think<br>

> that somewhat misses the point of what MapReduce is trying to provide to<br>

> application developers.  By treating all tasks and data dependencies as<br>

> equivalent, it ignores all of the runtime infrastructure that MapReduce<br>

> inserts into the processes, and ignores, for example, some of the details of<br>

> how data is moved between mappers and reducers.<br>

><br>

> For example, a substantial amount of code in the Hadoop MapReduce code base<br>

> has to do with a) file formats b) compression c) checksums d) serialization<br>

> e) buffering input and output data and f) bucketing/sorting the data.  This<br>

> is all difficult to implement well and important for many big data<br>

> applications.  I think that scientific workflow systems don't take any of<br>

> these things seriously since it isn't important for most canonical<br>

> scientific workflow applications.<br>

><br>

> I think one of the other big differences is that Hadoop assumes that all you<br>

> have are a bunch of unreliable machines on a network, so that it must<br>

> provide its own a job scheduler and replicated distributed file system.<br>

> Swift, in contrast, seems mostly designed for systems where there is a<br>

> reliable shared file system, and where it acquires compute resources for a<br>

> fixed blocks of time from some existing cluster manager.  I know there are<br>

> ways you can have Swift/Coaster/Falkon run on networks of unreliable<br>

> machines, but it's not quite like Hadoop's job scheduler which is designed<br>

> to actually be the primary submission mechanism for a multi-user cluster.<br>

><br>

> I don't think it would make much sense to run Swift on a network of<br>

> unreliable machines and then to just leave your data on those machines (you<br>

> would normally stage the final data to some backed-up file system), but it<br>

> would make perfect sense for Hadoop, especially if the data is so big that<br>

> it's difficult to find someplace else to put it.  In contrast, you can<br>

> certainly stand up a Hadoop instance on a shared cluster for a few hours to<br>

> run your jobs, and stage data in and out of HDFS, but that use case isn't<br>

> what Hadoop was designed or optimized for. Most of the core developers on<br>

> Hadoop are working in environments where they have devoted Hadoop clusters,<br>

> where they can't afford much cluster downtime and where they need to<br>

> reliably persist huge amounts of data for years on unreliable hardware.<br>

> E.g. at the extreme end, this is the kind of thing Hadoop developers are<br>

> thinking about:<br>

> <a href="https://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920" target="_blank">https://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920</a><br>


><br>

> - Tim<br>

><br>

><br>

><br>

> On Sun, May 13, 2012 at 3:57 PM, Ioan Raicu <<a href="mailto:iraicu@cs.iit.edu">iraicu@cs.iit.edu</a>> wrote:<br>

>><br>

>> Hi Tim,<br>

>> I always thought of MapReduce being a subset of workflow systems. Can you<br>

>> give me an example of an application that can be implemented in MapReduce,<br>

>> but not a workflow system such as Swift? I can't think of any off the top of<br>

>> my head.<br>

>><br>

>><br>

>> Ioan<br>

>><br>

>> --<br>

>> =================================================================<br>

>> Ioan Raicu, Ph.D.<br>

>> Assistant Professor<br>

>> =================================================================<br>

>> Computer Science Department<br>

>> Illinois Institute of Technology<br>

>> 10 W. 31st Street Chicago, IL 60616<br>

>> =================================================================<br>

>> Cel:   <a href="tel:1-847-722-0876" value="+18477220876">1-847-722-0876</a><br>

>> Email: <a href="mailto:iraicu@cs.iit.edu">iraicu@cs.iit.edu</a><br>

>> Web:   <a href="http://www.cs.iit.edu/~iraicu/" target="_blank">http://www.cs.iit.edu/~iraicu/</a><br>

>> =================================================================<br>

>> =================================================================<br>

>><br>

>><br>

>><br>

>> On May 13, 2012, at 1:09 PM, Tim Armstrong <<a href="mailto:tim.g.armstrong@gmail.com">tim.g.armstrong@gmail.com</a>><br>

>> wrote:<br>

>><br>

>> I've worked on both Swift and Hadoop implementations and my tendency is to<br>

>> say that there isn't actually any deep similarity beyond them both<br>

>> supporting  distributed data processing/computation.  They both make<br>

>> fundamentally different assumptions about the clusters they run on and the<br>

>> applications they're supporting.<br>

>><br>

>> Swift is mainly designed for time-shared clusters with reliable shared<br>

>> file systems. Hadoop assumes that it will be running on unreliable commodity<br>

>> machines with no shared file system, and will be running continuously on all<br>

>> machines on the cluster.  Swift is designed for orchestrating existing<br>

>> executables with their own file formats, so mostly remains agnostic to the<br>

>> contents of the files it is processing.  Hadoop needs to have some<br>

>> understanding of the contents of the files it is processing, to be able to<br>

>> segment them into records and perform key comparisons so it can do a<br>

>> distributed sort, etc.  It provides its own file formats (including<br>

>> compression, serialization, etc) that users can use, although is extensible<br>

>> to custom file formats.<br>

>><br>

>> Hadoop implements its own distributed file-system with software<br>

>> redundancy, Swift uses an existing cluster filesystem or node-local file<br>

>> systems.  For bulk data processing, this means Hadoop will generally be able<br>

>> to deliver more disk bandwidth and has a bunch of other implications.<br>

>> Hadoop has a record-oriented view of the world, i.e. it is built around<br>

>> the idea that you are processing a record at at time, rather than a file at<br>

>> a time as in Swift<br>

>> As a result, Hadoop includes a bunch of functionality to do with file<br>

>> formats, compression, serialization etc: Swift is B.Y.O. file format<br>

>> Hadoop's distributed sort is a core part of the MapReduce (and something<br>

>> that a lot of effort has gone into implementing and optimizing), Swift<br>

>> doesn't have built-in support for anything similar<br>

>> Swift lets you construct arbitrary dataflow graphs between tasks, so in<br>

>> some ways is less restrictive than the map-reduce pattern (although it<br>

>> doesn't directly support some things that the map-reduce pattern does, so I<br>

>> wouldn't say that it is strictly more general)<br>

>><br>

>> I'd say that some applications might fit in both paradigms, but that<br>

>> neither supports a superset of the applications that the other supports.<br>

>> Performance would depend to a large extent on the application.  Swift might<br>

>> actually be quicker to start up a job and dispatch tasks (Hadoop is<br>

>> notoriously slow on that front), but otherwise I'd say it just depends on<br>

>> the application, how you implement the application, the cluster, etc. I'm<br>

>> not sure that there is a fair comparison between the two systems since<br>

>> they're just very different: most of the results would be predictable just<br>

>> be looking at the design of the system (e.g. if the application needs to do<br>

>> a big distributed sort, Hadoop is much better) .  If the application is<br>

>> embarrassingly parallel (like it sounds like your application is), then you<br>

>> could probably implement it in either, but I'm not sure that it would<br>

>> actually stress the differences between the systems if data sizes are small<br>

>> and runtime is mostly dominated by computation.<br>

>><br>

>> I think the Cloudera Hadoop distribution is well documented reasonably<br>

>> easy to set up and run, provided that you're not on a time-shared cluster.<br>

>> Apache Hadoop is more of a pain to get working.<br>

>><br>

>> - Tim<br>

>><br>

>><br>

>> On Sun, May 13, 2012 at 9:27 AM, Ketan Maheshwari<br>

>> <<a href="mailto:ketancmaheshwari@gmail.com">ketancmaheshwari@gmail.com</a>> wrote:<br>

>>><br>

>>> Hi,<br>

>>><br>

>>> We are working on a project from GE Energy corporation which runs<br>

>>> independent MonteCarlo simulations in order to find device reliability<br>

>>> leading to a power grid wise device replacement decisions. The computation<br>

>>> is repeated MC simulations done in parallel.<br>

>>><br>

>>> Currently, this is running under Hadoop setup on Cornell Redcloud and EC2<br>

>>> (10 nodes). Looking at the computation, it struck me this is a good Swift<br>

>>> candidate. And since the performance numbers etc are already extracted for<br>

>>> Hadoop, it might also be nice to have a comparison between Swift and Hadoop.<br>

>>><br>

>>> However, some reality check before diving in: has it been done before? Do<br>

>>> we know how Swift fares against map-reduce? Are they even comparable? I have<br>

>>> faced this question twice here: Why use Swift when you have Hadoop?<br>

>>><br>

>>> I could see Hadoop needs quite a bit of setup effort before getting it to<br>

>>> run. Could we quantify usability and compare the two?<br>

>>><br>

>>> Any ideas and inputs are welcome.<br>

>>><br>

>>> Regards,<br>

>>> --<br>

>>> Ketan<br>

</div></div></blockquote></div><br>