Apache Crunch Tutorial #4: Distincts, Materialization, and Objects


This post is the fourth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

In my previous and third tutorial on Apache Crunch, we went a little deeper on DoFns and common structures in Crunch. In today’s tutorial, we’re going to touch on some cool and useful features in Crunch like materialization, distincts, and turning data into objects (java classes).


Crunch has a built in class called Distinct, which constructs a new PCollection containing only the unique elements of a given PCollection. Code looks like this:

PCollection<String> names;
PCollection<String> distinctNames = Distinct.distinct(names);

So if the PCollection names contained [‘John’, ‘Billy’, ‘John’, ‘Sally’, ‘Joey’, ‘John’, ‘Johnny’, ‘Julie’, ‘Julie’, ‘Julia’], then the PCollection distinctNames would contain [‘John’, ‘Billy’, ‘Sally’, ‘Joey’, ‘Johnny’, ‘Julie’, ‘Julia’].


In Crunch, you have access to data in a PCollection via two methods: inside of a DoFn as a Map/Reduce job, and by materializing a PCollection (writes to disk). In method one, you can’t control which instances of a DoFn will get specific data you might be after, so it’s not a reliable way to quickly find something like a particular record. Remember, DoFns are just chunks of Java code executed in Map/Reduce, so each one will get a random chunk of your PCollection’s data.

If you want a way to access ALL of the data inside a PCollection inside your main Java code, you simply need to materialize it. This will create an Interable of the data type respective to your PCollection, which you can loop through. See the sample code I’ve written below:

PCollection<String> names = pipeline.readTextFile(inputPath);
for (String record : names.materialize(){

Data as Objects (POJOs)

We’ve learned that we can read data into PCollections as certain types, like Strings, Ints, and more. But what if we wanted a little more depth and flexibility in what we store in PCollections? Turns out, we can have that easily, out of the box! You can store your data as plain old Java objects, with getters, setters, and more!

Check out the code below, which reads data from a text file into a PCollection, uses a custom DoFn to convert those strings into Java objects with getters for ease of use in the program.

PCollection<String> names = pipeline.readTextFile(inputPath);
PCollection<Name> name_records = names.parallelDo(DoFn_CreateNameRecords(), Avros.records(Name.class));

static DoFn<String, Name> DoFn_CreateNameRecords(){
 return new DoFn<String, Name>(){
 public void process(String input, Emitter<Name> emitter){
 String[] input_values = input.split(",");
 emitter.emit(new Name(inputParts[0]));

And the class file itself:

public class Name{
 private String name;

 public Name(String name){
 this.name = name;

 public String getName(){
 return name;


And with that, you now have a PCollection of Name objects as opposed to String objects. One benefit is easier usage in DoFns, since now you pass Name objects with get methods like getName() that can be used in DoFns. Cool.

That’s a Wrap!

Thanks for reading today’s tutorial! Hope you can use one or more of these cool tips in Crunch.

<< Previous Tutorial  |  Next Tutorial >>


3 thoughts on “Apache Crunch Tutorial #4: Distincts, Materialization, and Objects

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s