ORC, or Optimized Row Columnar, is a popular big data file storage format. Its rise in popularity is due to it being highly performant, very compressible, and progressively more supported by top-level Apache products, like Hive, Crunch, Cascading, Spark, and more.
I recently wanted/needed to write ORC files from my Spark pipelines, and found specific documentation lacking. So, here’s a way to do it.
First, a few assumptions:
- You have a working Spark application
- You know what RDDs and DataFrames are (and the difference)
- You have a Dstream, RDD or DataFrame with data in it
If you have a DataFrame, writing ORC to HDFS could not be simpler:
Writing ORC from a DataFrame
This will simply write some good old .orc files in an HDFS directory. You can put a Hive table on top of it.
Writing ORC (Partitioned) from a DataFrame
Write some good old .orc files in the HDFS directory specified, but inside of partitions/folders just like a Hive table would be laid out. You can put a partitioned Hive table on top of it.
If you have an RDD, it’s really not much more complex.
Writing ORC from an RDD
Write some good old .orc files in an HDFS directory. Same drill as a DataFrame, really.
Writing ORC from a DStream
Working on a Spark Streaming app? Want to write ORC files out and avoid that awful, nasty “.saveAsTextFiles()” option? Do this, friend.
And that’s really it! It’s pretty easy to benefit from the ORC file format without needing to do much in the way of altering your current pipeline.
And it beats raw text. If you have any questions or issues, put ’em below!