How to Write ORC Files and Hive Partitions in Spark

sporc

ORC, or Optimized Row Columnar, is a popular big data file storage format. Its rise in popularity is due to it being highly performant, very compressible, and progressively more supported by top-level Apache products, like Hive, Crunch, Cascading, Spark, and more.

I recently wanted/needed to write ORC files from my Spark pipelines, and found specific documentation lacking. So, here’s a way to do it. Continue reading

Advertisements

Apache Crunch Tutorial 9: Reading & Ingesting Orc Files

LITTLE_CRUNCH

This post is the ninth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

Continue reading