Apache Crunch Tutorial #6: Configurations on a Per Job Basis

LITTLE_CRUNCH

This post is the sixth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

In my previous and fifth tutorial on Apache Crunch, we covered the super cool and functional topic of Hadoop configurations, and how Crunch (for the most part) honors them. But that covered how to set global settings for an entire pipeline of Map-Reduce jobs that are fired off from Crunch – today we’re going to address the more fun elephant in the room: setting custom configuration options on a per-job basis.

Job by Job Hadoop Configurations

It’s actually pretty simple if you already know how to write a DoFn. You know how the DoFN is just a static class with a process method inside? All you have to do to pass your custom settings similar to how you did for the pipeline is to include one additional method: configure().

 
 @Override
 public void configure(Configuration conf){
 conf.set("mapred.job.queue.name", "batch");
 }

But wait, where does this cool little function go? Inside your DoFn class, and right after your process method:

 
 static DoFn<String, CustomRecord> DoFn_CreateCustomRecords(){
 return new DoFn<String, CustomRecord>() {
 @Override
 public void process(String input, Emitter<CustomRecord> emitter) {
 emitter.emit(input);
 }
 };

 @Override
 public void configure(Configuration conf){
 conf.set("mapred.job.queue.name", "batch");
 }
 }

It’s important to @Override this method, essentially since we plan to, well, Override the settings established by the pipeline. With that said, what kind of settings can we set for this particular job? Refer to the previous tutorial for some additional examples, but you can set any Map-Reduce parameter you know of here, since this job is just another Map-Reduce job anyway. You can set the queue name, map-side compression codecs, or whatever!

You can do this for any and all DoFns you wish. Pretty neat, huh?

Thanks for reading today’s tutorial! Hope you can use one or more of these cool configuration options in Crunch. Let me know what you’d like to learn about next!

<< Previous Tutorial  |  Next Tutorial >>

Advertisements

2 thoughts on “Apache Crunch Tutorial #6: Configurations on a Per Job Basis

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s