[Swift-devel] Paper: Making Sense of Performance in Data Analytics Frameworks

Mon Apr 20 10:33:37 CDT 2015

I thought I would share this summary of a paper in NSDI that's worth
reading:
http://blog.acolyer.org/2015/04/20/making-sense-of-performance-in-data-analytics-frameworks/

The overall message is that in systems like Hadoop or Spark (as used
nowadays for analytics or data warehousing), the performance bottleneck is
mainly CPU time rather than disk or network I/O.

This is a result of widespread application of compression techniques in
file formats used - they reduce I/O requirements but increase CPU
requirements.  SSDs also give you a lot more I/O bandwidth at the cost of
capacity (so you need to compress more).  Even basic stuff like gzipping
parts of  files is somewhat effective, but then there are open source
projects like Parquet and closed source projects like Google's ColumnIO
that are even more effective.

This matches my experience at Facebook - CPU was the bottleneck on their
Hive workloads and disk space (rather than I/O operations) was becoming the
bottleneck for their MySQL workloads.  I think some people in industry have
been aware of this trend for a while now, but I think academia has mainly
been thinking mainly about optimising I/O or network usage.

Worth thinking about, maybe not so much for Swift right now since the
architecture and workloads are different, but for future research plans.

- Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20150420/fb116c1e/attachment.html>