What I Learned Building My First Spark Streaming App

IMG_20170811_183243
Get it? Spark Streaming? Stream… it’s a stream…. sigh.

I’ve been working with Hadoop, Map-Reduce and other “scalable” frameworks for a little over 3 years now. One of the latest and greatest innovations in our open source space has been Apache Spark, a parallel processing framework that’s built on the paradigm Map-Reduce introduced, but packed with enhancements, improvements, optimizations and features. You probably know about Spark, so I don’t need to give you the whole pitch.

You’re likely also aware of its main components:

  • Spark Core: the parallel processing engine written in the Scala programming language
  • Spark SQL: allows you to programmatically use SQL in a Spark pipeline for data manipulation
  • Spark MLlib: machine learning algorithms ported to Spark for easy use by devs
  • Spark GraphX: a graphing library built on the Spark Core engine
  • Spark Streaming: a framework for handling data that is live-streaming at high speed

Spark Streaming is what I’ve been working on lately. Specifically, building apps in Scala that utilize Spark Streaming to stream data from Kafka topics, do some on-the-fly manipulations and joins in memory, and write newly augmented data in “near real time” to HDFS.

I’ve learned a few things along the way. Here are my tips:

1 – Determine what version of Scala and Spark you’re using on your cluster, and use those during development


Screen Shot 2017-08-16 at 12.48.16 PM

In my workplace, we use Spark 2.1.1 with Scala 2.11.8. It’s critical to align versions of these appropriately, or you’re likely to suffer errors when compiling and running your code.

What type of errors, you ask? They usually manifest as “NoClassDefFoundErrors”, which is when a class/function/method used in your code isn’t found when it’s called – typically because it has moved packages between releases. That’s why it’s important to make sure your Scala and Spark releases are in lockstep. Here is a good reference article on StackOverflow.

2 – Leverage Your IDE to Help You Understand Scala Quirks


logo_intellij_idea

Syntactically, Scala is a remarkably concise language, with, most of the time, high readability. This is thanks to its often high level of abstraction for complex functionality. While this results in brief code that accomplishes impressive tasks, it can be very difficult to comprehend at first, even frustrating.

You can remedy this by using the power of your modern Integrated Development Environment (I prefer and highly recommend IntelliJ Community Edition) to learn what’s happening in the background. For example:

Screen Shot 2017-08-16 at 12.51.22 PM

  • retype pieces of code you get from StackOverflow or a peer, and observe the results in the dialog windows that appear. When you see syntax you don’t understand, type it slowly and watch as the dialogs showcase what objects you’re dealing with, what’s being returned in a method call, and what methods are available on certain objects.

Screen Shot 2017-08-16 at 12.52.22 PM

  • Use CMD+B (or Navigate > Declaration) to view the original declaration or source code of an object, class, or method! This is super helpful for understanding what API calls expect as input, or seeing what Spark’s map function will return as output.

3 – Don’t be afraid to Use Java Stuff


java_logo

Scala isn’t Java, but it does run in the JVM, which gives it access to Java functionality. In short, you can write in-line Java code right in your Scala apps. Pretty neat huh? Pretty confusing too.

But don’t be afraid to use it if you’re up against a wall. Java has had years to build backlogs upon backlogs of solutions to just about any possible programming-related problem you should incur.

Example: timestamp manipulation! In data engineering, we play with timestamps all the time: epochs, yyyy-mm-dd hh:mm:ss, you name it. There’s 100 ways to do this in Java, all very well documented on places like, you guessed it, StackOverflow.

If there’s a Java solution and it’s efficient enough for your program, don’t fear using it just because it isn’t “fully Scala.” You’ll be glad you used it when it just works.

4 – Take the Time to Study and Understand Spark Streaming


anigif_enhanced-buzz-24255-1322329285-30

Get a hot cup of coffee, get comfortable, and chew through the documentation on Spark Streaming. Get more than acquainted with the terms, get friendly. Diagram process flows if that helps you (it helps me). Learn what a DStream is.

Let the programming guide (and other helpful blogs from others who have suffered the hard, pioneering work) guide you to success and knowledge.

And when you’re done with that – rewrite some of the documentation of the components you use in your Spark applications. Not just so that your team/company can understand what you’ve written, but so that you may better understand it yourself. Teaching is the best way to know something better and better.

5 – Monitor Your App Results and Performance by Building Simple Tables and Charts in Excel


Screen Shot 2017-08-16 at 12.59.32 PM

I dislike spending time in Excel as much as the next guy, but this was remarkably helpful to me in truly understanding what was happening under the hood with my first streaming apps. In the scenario below, I had a streaming app that was reading two kafka topics to get two different datasets: we’ll call them event type 1 and event type 2. The goal is to join them within the streaming app by a column they have in common, a key/id. By having my app print the counts of each step in my process, I was able to see how successful my “join rate” was over time (using stateful streaming) by simply using a few sum and division functions that are then fed into simple line graphs.

Then, as I changed my code to debug and introduce improvements, I could clone this sheet (with its simple formulas and line graphs) and plug in new data and see immediate changes visually.

It’s not as critical a suggestion as the above four, but I found it immensely valuable (and the stakeholders of my application did as well). Visuals always help.

6 – Get Spark Running Locally in Your Local IDE


For real. Don’t use your cluster for testing (or at least basic, functional testing). Get Spark running in IntelliJ with this tutorial (setting up spark 2.0 with intellij). The time up front used doing this will save you hours and headaches.

That’s all I have for this post. Over the next week or so, I plan to upload small snippets of Spark Streaming code I find immensely helpful, and hopefully you will too. Thanks for reading!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s