[Swift-commit] r7084 - SwiftApps/Swift-MapRed/Paper

ketan at ci.uchicago.edu ketan at ci.uchicago.edu
Wed Sep 18 21:03:54 CDT 2013


Author: ketan
Date: 2013-09-18 21:03:54 -0500 (Wed, 18 Sep 2013)
New Revision: 7084

Modified:
   SwiftApps/Swift-MapRed/Paper/swifthadoop.tex
Log:
discussion with Yadu

Modified: SwiftApps/Swift-MapRed/Paper/swifthadoop.tex
===================================================================
--- SwiftApps/Swift-MapRed/Paper/swifthadoop.tex	2013-09-18 19:52:18 UTC (rev 7083)
+++ SwiftApps/Swift-MapRed/Paper/swifthadoop.tex	2013-09-19 02:03:54 UTC (rev 7084)
@@ -494,3 +494,41 @@
 
 What is HDFS?
 HDFS, inspired by Google's GFS.%TODO: citation
+
+Mail discussion with Yadu on Zaharia's RDD paper:
+
+
+=== Yadu ===
+So, the idea here is to hold result between stages in memory rather that write it to disk or shared file system. 
+With no persistence, they deal with the possibility of failure by keeping track of the lineage of each generated 
+dataset. I think the assumption here is that, the cost of recomputing a lost data item is much smaller and 
+faster than check pointing every data set to disk. RDD's can also spillover to disk when RAM is full, and they 
+don't explain that too well, or if there is a possibility of thrashing.
+
+(As for the lineage part, the whole point of using file-pointers in SwiftK was to make sure swift kept track of 
+files, and in case any file is lost it could be recomputed, so this is something we can show with little effort)
+
+There is also a very simple description of their constructs map, join, filter applied to RDDs in memory. Ofcourse
+one major pitfall I see is that the functions passed are all written in scala (major plus point for swift).
+
+Examples given are iterative processing ones, which put hadoop at a disadvantage (due to having to write to HDFS)
+and they should benefit our point. They have given some page-rank algo, a clustering algo.. (i don't remember the 
+rest).
+
+So, these are the main ideas, I think. We should probably talk once you've read it. Though I can't help feel like 
+these guys beat us to our ideas.
+
+=== Ketan ===
+ok, I read the paper. I think the idea of in-memory computing is not new but the way they are manipulating data in-memory is new. However, above and beyond all the speed benefits etc, there are two key limitations:
+1. Applications that deal with large amount of intermediate data. If you spill over a block of data to disk, you have to make a trip to disc to make the full data available.
+2. Applications which do not have any intermediate reusable data: The model pretty much proves to be memory hungry in that case unless it degenerates to a disc oriented data easily.
+
+Apart from this, there are still inter-node and inter-cluster bandwidths which will make a large part of application data movement times.
+
+Does these points make sense to you?
+
+In my opinion, while the paper is relevant to our work, it does not out shine our ideas and implementation. And as you said, we are much less obscure than Scala in terms of expression and programmability.
+
+Let's discuss this more and DSSAT implementation when we meet.
+We should also take a look at this HaLoop thingy.
+




More information about the Swift-commit mailing list