I’ve been working with Hadoop, Map-Reduce and other “scalable” frameworks for a little over 3 years now. One of the latest and greatest innovations in our open source space has been Apache Spark, a parallel processing framework that’s built on the paradigm Map-Reduce introduced, but packed with enhancements, improvements, optimizations and features. You probably know about Spark, so I don’t need to give you the whole pitch.
You’re likely also aware of its main components:
- Spark Core: the parallel processing engine written in the Scala programming language
- Spark SQL: allows you to programmatically use SQL in a Spark pipeline for data manipulation
- Spark MLlib: machine learning algorithms ported to Spark for easy use by devs
- Spark GraphX: a graphing library built on the Spark Core engine
- Spark Streaming: a framework for handling data that is live-streaming at high speed
Spark Streaming is what I’ve been working on lately. Specifically, building apps in Scala that utilize Spark Streaming to stream data from Kafka topics, do some on-the-fly manipulations and joins in memory, and write newly augmented data in “near real time” to HDFS.
I’ve learned a few things along the way. Here are my tips: Continue reading