How to Write ORC Files and Hive Partitions in Spark

sporc

ORC, or Optimized Row Columnar, is a popular big data file storage format. Its rise in popularity is due to it being highly performant, very compressible, and progressively more supported by top-level Apache products, like Hive, Crunch, Cascading, Spark, and more.

I recently wanted/needed to write ORC files from my Spark pipelines, and found specific documentation lacking. So, here’s a way to do it.

First, a few assumptions:

  • You have a working Spark application
  • You know what RDDs and DataFrames are (and the difference)
  • You have a Dstream, RDD or DataFrame with data in it

If you have a DataFrame, writing ORC to HDFS could not be simpler:

Writing ORC from a DataFrame

This will simply write some good old .orc files in an HDFS directory. You can put a Hive table on top of it.

Writing ORC (Partitioned) from a DataFrame

Write some good old .orc files in the HDFS directory specified, but inside of partitions/folders just like a Hive table would be laid out. You can put a partitioned Hive table on top of it.

If you have an RDD, it’s really not much more complex.

Writing ORC from an RDD

Write some good old .orc files in an HDFS directory. Same drill as a DataFrame, really. 

Writing ORC from a DStream

Working on a Spark Streaming app? Want to write ORC files out and avoid that awful, nasty “.saveAsTextFiles()” option? Do this, friend.

And that’s really it! It’s pretty easy to benefit from the ORC file format without needing to do much in the way of altering your current pipeline.

And it beats raw text. If you have any questions or issues, put ’em below!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s