How to Run a Jar in Oozie with Java Actions

oozie_282x1178

You probably know how jars work. Jars, short for Java Archives, are zipped up packages of Java class files with or without dependencies included. In most cases, it’s just your application code, and dependencies live elsewhere and are exported into a classpath. While we’ll cover that topic another day, let’s focus on the task at hand: getting your Jar running in Oozie.

Wait, what’s Oozie? Oozie is a DAG (direct acyclical graph) scheduler built for Hadoop. It allows you to schedule an action or series of actions in a workflow that runs on Hadoop. You can even automate it to run at a given frequency, such as daily, weekly or even hourly ( or less!!) But again, that’s not what we’re focusing on today. And really, this article assumes you have some basic understanding of Oozie, or at the very least, XML tags (since that’s what Oozie workflows are made of): so go learn about Oozie a little bit before flying into this.

But if you’re ready to learn how to get away from this:

hadoop jar myJar com.company.myJar /tmp/input.csv /tmp/output/

…and learn to let Oozie do it for you, let’s learn, friend.

An Oozie workflow consists of a series of actions that can be run in any order. For this example, we’ll keep it to one action, and the one we need for running jars: a Java Action. The Java Action, like Oozie’s other built-in actions, exists for an explicit use: running java code in the form of a compiled jar. That action looks like this:

Parameterized  JavaAction:

<action name="java-crunch-action">
 <java>
 <job-tracker>${jobTracker}</job-tracker>
 <name-node>${nameNode}</name-node>
 <prepare>
 <delete path="${outputPath}"/>
 </prepare>
 <configuration>
 <property>
 <name>mapred.job.queue.name</name>
 <value>${queueName}</value>
 </property>
 </configuration>
 <main-class>${myJavaClass}</main-class>
 <arg>${inputPath}</arg>
 <arg>${outputPath}</arg>
 <file>/apps/projectName/crunch.properties</file>
 <capture-output/>
 </java>
 <ok to="end"/>
 <error to="kill"/>
</action>

Non-Parameterized Java Action:

<action name="java-crunch-action">
 <java>
 <job-tracker>cluster01.mycompany.com:8050/job-tracker>
 <name-node>hdfs://cluster01.mycompany.com:80520</name-node>
 <prepare>
 <delete path="/tmp/output/"/>
 </prepare>
 <configuration>
 <property>
 <name>mapred.job.queue.name</name>
 <value>batch</value>
 </property>
 </configuration>
 <main-class>com.company.myJar</main-class>
 <arg>/tmp/input.csv</arg>
 <arg>/tmp/output/</arg>
 <file>/apps/projectName/code.properties</file>
 <capture-output/>
 </java>
 <ok to="end"/>
 <error to="kill"/>
</action>

Code Breakdown

<java> Required. Indicates the start of an Oozie java action. Child tags go in here.
<job-tracker> Required. Put the path to your cluster’s job tracker here. Eg: cluster01.mycompany.com:8050
<name-node> Required. Put the path to your cluster’s namenode here. Eg: hdfs://cluster01.mycompany.com:8020
<prepare> Optional. Hadoop FS commands to run before calling the jar go here.

Eg: <delete> tag deletes a HDFS path to prepare for jar’s output

<configuration> Optional. Place Hadoop/Engine properties here (ones you’d normally put with -D on the command line with “hadoop jar …”

Eg: mapred.job.queue.name=batch

<main-class> Required. Put the full package and class name of the class you want to call.

Eg: com.mycompany.bigdata.myJar

<arg> Optional. Put command line arguments here. Input, output, anything you’d normally give on command line that your code accepts/expects.
<file> Optional. Put full path (HDFS) of files you want to include in sharedlib of application. Property files are a great example.
<capture-output/> Optional. This tells oozie to capture and keep stdout and stderr output from your execution.

Setup

Wherever your workflow.xml is in HDFS, your desired Jar must be as well. So if you have your workflow.xml in /tmp/myJob/oozie/ , then your jar needs to be in /tmp/myJob/oozie/lib/. This is just how Oozie works — as of right now you (to my knowledge) can’t have it any other way. So when you call your Oozie workflow.xml from that folder, it will automatically search for a ‘lib’ folder in the same directory, and use the jars contained within when prompted in the workflow. Have dependencies? Put them in here too — Oozie will copy them automatically into this application’s temporary shared lib at runtime. And you don’t have to worry about a classpath! So great.

Follow any old Google tutorial to submit and run an Oozie workflow with the above Java action, and see how your results are! Let me know!

-Landon

Advertisements

One thought on “How to Run a Jar in Oozie with Java Actions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s