Apache Crunch Tutorial 8: Writing to Orc File Format

LITTLE_CRUNCH

This post is the eighth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

In my previous and seventh tutorial on Apache Crunch, we covered leveraging scaleFactor in Crunch, and how it can help you with data that grows or shrinks in size in processing. Let’s talk today about writing to the Apache Orc file format on Hadoop in Crunch, using built-in functionality.

Crunch can write to a few different Hadoop common formats: text, Avro, and Orc are the 3 big ones, and it can write those formats right to HDFS pretty easily. In my experience, text is sufficient for most folks when it comes to writing out their data. However, some would like to leverage a deeper, columnar-based format like Orc, which has several bonuses like compression and read-speed (while taking minor hits on write speed).

Today, I’ll show you how I’m able to convert text data to Orc format and write it to HDFS with Crunch — no plugins needed.

logo

Orc, Explained

You can learn more about Orc (and my opinions on it) in this earlier post on Hadoopsters. That page will also tell you how to leverage it in Hive (which is a useful aside to this post). Go check that out for more info — for now, we’ll operate on the assumption that you want to go ahead and learn how to write Orc using Crunch.

Let’s do it.

Writing Orc in Crunch

Let’s import some stuff:

import org.apache.crunch.io.orc.OrcFileTarget;
import org.apache.crunch.types.orc.Orcs;

Let’s create a pipeline and an OrcFileTarget. Crunch uses a concept of “targets” to write files of a certain type to designated path.

//Create a Map-Reduce pipeline
Pipeline pipeline = new MRPipeline(MyClass.class, "Crunch Pipeline", getConf());

//Create a target linked to an HDFS path 'outputPath'
OrcFileTarget target = new OrcFileTarget(new Path(outputPath));

Great! Now you have a target that can write out an Orc file. But how do we form some Orc data to write out? The easiest way is to write a DoFn that will convert your data into Orc records, known as Tuples.

See below an example of a dataset composed of 9 columns (strings and ints) that’s in a PCollection called my_data, which we’ll convert to an equal PCollection called my_data_as_orc:

PCollection<TupleN> my_data_as_orc = my_data.parallelDo(DoFn_ProduceOrcRecords(), Orcs.tuples(
        Writables.ints(), //column1
        Writables.ints(), //column2
        Writables.strings(), //column3
        Writables.strings(), //column4
        Writables.strings(), //column5
        Writables.ints(), //column6
        Writables.ints(), //column7
        Writables.ints(), //column8        
        Writables.ints(), //column9
));

Since this calls a custom DoFn (called DoFn_ProduceOrcRecords, we need to go ahead and create that DoFn right? Here it is:

//define a dofn that takes in a string, and returns a TupleN object
static DoFn<String, TupleN> DoFn_ProduceOrcRecords(){
        return new DoFn<String, TupleN>() {
            @Override
            public void process(String input, Emitter<TupleN> emitter) {
                String[] inputParts = input.split(",");

                String col_1 = inputParts[0];
                String col_2 = inputParts[1];
                String col_3 = inputParts[2];
                String col_4 = inputParts[3];
                String col_5 = inputParts[4];
                String col_6 = inputParts[5];
                String col_7 = inputParts[6];

                String col_8 = inputParts[7];
                String col_9 = inputParts[8];

                emitter.emit(new TupleN(
                        Integer.parseInt(col_1),
                        Integer.parseInt(col_2),
                        col_3,
                        col_4,
                        col_5,
                        Integer.parseInt(col_6),
                        Integer.parseInt(col_7),
                        Integer.parseInt(col_8),
                        Integer.parseInt(col_9)));

            }
        };
    }

Now we have a way to convert our data into a PCollection of Tuples (which Orc is formed from), now let’s write it to a path! And we’ll go ahead and tell it to overwrite what’s there, if anything (there are other options for writing, like append):

pipeline.write(finalDataOrc, target, Target.WriteMode.OVERWRITE);

So what does that look like in a file viewer? Well, not much of course…

Screen Shot 2015-12-08 at 11.31.01 AM

But in Hive… we have data! Check out the tutorial on Orc in Hive here on Hadoopsters to learn about why/how this works (hint: using an orc-based table).

3	102	SM0044	F	0	30	0	0	0	2014-08-01
3	102	SM0044	F	0	45	0	0	0	2014-08-01
3	102	SM0044	F	1	0	0	0	0	2014-08-01
3	102	SM0044	F	1	15	0	0	0	2014-08-01
3	102	SM0044	F	2	30	0	0	0	2014-08-01
3	102	SM0044	F	2	45	0	0	0	2014-08-01
3	102	SM0044	F	3	0	0	0	0	2014-08-01
3	102	SM0044	F	3	15	0	0	0	2014-08-01
3	102	SM0044	F	4	30	0	0	0	2014-08-01
3	102	SM0044	F	4	45	0	0	0	2014-08-01

Breakdown

Alright, so we did a lot there. Let’s break it down.

  • we imported the necessary Orc packages to support OrcFileTarget and the Orc type
  • we setup an OrcFileTarget mapped to an HDFS path (where we’ll write the file)
  • we setup a DoFn that converts our text records to Orc (which uses TupleN) and called it
  • we wrote the newly Orc PCollection to the target we setup earlier

And that’s how you write to Orc in Crunch! You can follow similar methods for writing Avro files too. Next time, we’ll cover… something else! Comments, feedback, and suggestions welcome in the comments. Thanks for reading!

<< Previous Tutorial  |  Next Tutorial >>

Advertisements

2 thoughts on “Apache Crunch Tutorial 8: Writing to Orc File Format

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s