We are Speaking at Spark + AI Summit 2019!

Hi everyone, Landon here with some exciting news!

I, alongside my colleagues at SpotX (Ben Storrie and Jack Chapa) will be presenting at Spark + AI Summit 2019. It’s an exciting opportunity to share what we’ve learned over the last year or so – particularly in the realm of Spark Streaming applications.

We will be presenting two different sessions:

Headaches and Breakthroughs in Building Continuous Applications

At SpotX, we have built and maintained a portfolio of Spark Streaming applications — all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We’ll detail what we’ve learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.

  • Speakers: Landon Robinson & Jack Chapa
  • Topic: Streaming
  • Time: 11:00am on Thursday, April 25th
  • Room: 2007

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application – and you can start using it in minutes. In this talk, we’ll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they’ve changed our world for the better at SpotX. If you’re looking for a “Eureka!” moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!

  • Speakers: Landon Robinson & Ben Storrie
  • Topic: Developer
  • Time: 5:30pm on Thursday, April 25th
  • Room: 3016

When I wrote the abstract submissions for this year’s summit, I had little expectation of either of them being accepted, let alone both. So for my part I’m proud to have the opportunity to share the exciting developments we’ve celebrated at SpotX with the wider Big Data community.

For those of you that have been reading this website for a time, you know that my biggest goal has been to democratize the spread of helpful, technical big data knowledge and breakthroughs to engineers around the world. Having these sessions at Spark + AI Summit 2019, and shared freely on YouTube afterward, is a huge step in that pursuit. I couldn’t be prouder.

If you’ll be there, definitely drop by the sessions! I’d love to meet you. Feel free to reach out to me on LinkedIn and we can sync up.

-Landon

Advertisements

How to Join Static Data with Streaming Data (DStream) in Spark

Today we’ll briefly showcase how to join a static dataset in Spark with a streaming “live” dataset, otherwise known as a DStream. This is helpful in a number of scenarios: like when you have a live stream of data from Kafka (or RabbitMQ, Flink, etc) that you want to join with tabular data you queried from a database (or a Hive table, or a file, etc), or anything you can normally consume into Spark. Continue reading

How to Write ORC Files and Hive Partitions in Spark

sporc

ORC, or Optimized Row Columnar, is a popular big data file storage format. Its rise in popularity is due to it being highly performant, very compressible, and progressively more supported by top-level Apache products, like Hive, Crunch, Cascading, Spark, and more.

I recently wanted/needed to write ORC files from my Spark pipelines, and found specific documentation lacking. So, here’s a way to do it. Continue reading

What I Learned Building My First Spark Streaming App

IMG_20170811_183243
Get it? Spark Streaming? Stream… it’s a stream…. sigh.

I’ve been working with Hadoop, Map-Reduce and other “scalable” frameworks for a little over 3 years now. One of the latest and greatest innovations in our open source space has been Apache Spark, a parallel processing framework that’s built on the paradigm Map-Reduce introduced, but packed with enhancements, improvements, optimizations and features. You probably know about Spark, so I don’t need to give you the whole pitch.

You’re likely also aware of its main components:

  • Spark Core: the parallel processing engine written in the Scala programming language
  • Spark SQL: allows you to programmatically use SQL in a Spark pipeline for data manipulation
  • Spark MLlib: machine learning algorithms ported to Spark for easy use by devs
  • Spark GraphX: a graphing library built on the Spark Core engine
  • Spark Streaming: a framework for handling data that is live-streaming at high speed

Spark Streaming is what I’ve been working on lately. Specifically, building apps in Scala that utilize Spark Streaming to stream data from Kafka topics, do some on-the-fly manipulations and joins in memory, and write newly augmented data in “near real time” to HDFS.

I’ve learned a few things along the way. Here are my tips: Continue reading

Spark History Server Automatic Cleanup

largelogpile
I wonder how much paper you’d need to print 1.5 Tb of logs…

If you’ve been running Spark applications for a few months, you might start to notice some odd behavior with the history server (default port 18080). Specifically, it’ll take forever to load the page, show links to applications that don’t exist or even crash. Three parameters take care of this once and for all. Continue reading