Apache Crunch Tutorial 9: Reading & Ingesting Orc Files


This post is the ninth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

In my previous and eighth tutorial on Apache Crunch, we covered leveraging the native features of Crunch to write your data in a PCollection out to ORC (Optimized Row Columnar) format. Let’s talk today about how to read the Apache Orc file format in Crunch, and ingesting it, also using built-in functionality.

Crunch can read in a few different Hadoop common formats: text, Avro, and Orc are the 3 big ones, and it can read those formats right from HDFS or local spaces easily. For those of you that learned how to write ORC format from Crunch in the last tutorial, this tutorial should be helpful in showing you how to read those types of files too! Whether that’s to create another pipeline atop that previously created data, or you just have some ORC format you need to read for a project.

Crunch has a few different methods by which you can ingest ORC data. I’ll be focusing in this tutorial on what I feel is the easiest, most scalable, and most preferred approach. It’s called the reflection approach, and really involves very few steps.

Let’s get started.


Orc, Explained

You can learn more about Orc (and my opinions on it) in this earlier post on Hadoopsters. That page will also tell you how to leverage it in Hive (which is a useful aside to this post). Go check that out for more info — for now, we’ll operate on the assumption that you want to go ahead and learn how to read/ingest Orc into a PCollection using Crunch.

Let’s do it.

Ingesting Orc into Crunch

Let’s import some stuff:

import org.apache.crunch.io.orc.OrcFileSource;
import org.apache.crunch.types.orc.Orcs;
import org.apache.hadoop.fs.Path;

Let’s create:

  • a pipeline
  • a Java class file that emulates what a record looks like in the Orc file (think of it as a Hive schema)
  • an OrcFileSource: Crunch uses a concept of “sources” to read files of a certain type from designated path. We’ll tie this source to a Java class file using “reflection”, which I’ll explain later
  • a PCollection to store the data from the Orc file(s)!
//Create a Map-Reduce pipeline
Pipeline pipeline = new MRPipeline(MyClass.class, "Crunch Pipeline", getConf());

//Create a target linked to an HDFS path 'outputPath'
OrcFileSource<LocationOpenDateRecord> locationOpenDateRecordOrcFileSource = new OrcFileSource<LocationOpenDateRecord>(new Path("/path/to/my/orc/data/"), Orcs.reflects(LocationOpenDateRecord.class));

//Read the data in!
PCollection<LocationOpenDateRecord> locationOpenDateRecordPCollection = pipeline.read(locationOpenDateRecordOrcFileSource);

Let’s talk about what’s happening here.

  1. We established our Crunch MapReduce pipeline. Standard stuff if you’re a Crunch user — it’s just the line that ties all the jobs of our workflow together.
  2. We established an OrcFileSource around an HDFS path to our ORC data we want to ingest, and tie it to a Java class.
  3. We established a PCollection to read/ingest the data from the OrcFileSource (tied to a java class), using our pipeline.

Wait, what is this weird Java class? I don’t know what that is! Relax, friend. It’s very easy. It’s just one small artifact we create to make ingesting Orc easier, more streamlined, and scalable from a code perspective. I’ve created this one myself, and it looks like this:

public class LocationOpenDateRecord {

 Integer lct_nbr;
 String start_bus_dt;

 public Integer getLct_nbr() {
 return lct_nbr;

 public void setLct_nbr(Integer lct_nbr) {
 this.lct_nbr = lct_nbr;

 public String getStart_bus_dt() {
 return start_bus_dt;

 public void setStart_bus_dt(String start_bus_dt) {
 this.start_bus_dt = start_bus_dt;

In my ORC file, I know that all my records have two columns: a lct_nbr (location number) as a smallint, and a start_bus_dt (starting business date) as a string. So this class has two attributes that match accordingly, as well as convenient setters and getters for data access and manipulation in my pipeline. Isn’t that handy?

And because Crunch supports the ingestion of Orc files through a handful of different means, I’ve focused on this approach, called “reflection”, because of its ease of use, implementation, and scalability. You can look into other forms of ORC ingestion if they suit you better.

With that said, let’s go over one more time what’s just happened.

  1. We built a pipeline object.
  2. We built an OrcFileSource object that is tied to an HDFS path and reflects a Java class we built (which “reflects” the contents of the ORC file in Java form).
  3. We read/ingested the contents of the OrcFileSource from HDFS, and stored it in a PCollection of LocationOpenDateRecord objects (that now have convenient getters and setters!)

Great! So now that we have a PCollection of these objects, we can use them in DoFns just like we do any other PCollection of any other type. Maybe you want to take each record in that ORC data and return a simple CSV-style string of its contents? What does that look like in action, you politely ask?

public class ClassOfMyDoFns {

 static DoFn<LocationOpenDateRecord, String> DoFn_Example()
 return new DoFn<LocationOpenDateRecord, String>()
 public void process(LocationOpenDateRecord input, Emitter<String> emitter)
 Integer location = input.getLct_nbr();
 String start_date = input.getStart_bus_dt();

 emitter.emit(location + "," + start_date);


And if you wanted to use that DoFn with your new PCollection…

PCollection<String> result = locationOpenDateRecordPCollection.parallelDo(DataQualityDoFns.DoFn_Example(), Writables.strings());

Now that’s what I call cool.

And that’s how you read from Orc into Crunch! Next time, we’ll cover… something else! Comments, feedback, and suggestions welcome in the comments. Thanks for reading!

<< Previous Tutorial  |  Next Tutorial >>


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s