Today we’ll briefly showcase how to join a static dataset in Spark with a streaming “live” dataset, otherwise known as a DStream. This is helpful in a number of scenarios: like when you have a live stream of data from Kafka (or RabbitMQ, Flink, etc) that you want to join with tabular data you queried from a database (or a Hive table, or a file, etc), or anything you can normally consume into Spark.
Let’s dive right in with an overview of the code that’ll do the trick. But first… let’s assume we have some transaction data streaming in from Kafka that contains a timestamp, a customer id, a transaction id (unique to this transaction), and a transaction amount. Let’s also assume we have a dataset of line-item details for each transaction in a file, Hive table, or other database that we want to join with this stream (using transaction_id).
But how? With the transform() method in the DStream class, which, by definition will “return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.”
Here’s one way to do it in a few short steps:
The result is a DStream that contains the transaction_id, the live transaction record (from Kafka), and the transaction detail record (from whatever source you used). This is because the join function used the transaction_id present in both the Kafka record (Transaction) and the static data.
It’s that simple. Thanks, transform() !