Guide to Apache Falcon #5: Tighter Oozie Integration

falcon-logo

This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively.

Welcome back, friends, to the fifth installment of the Hadoopsters Guide to Apache Falcon. Previously, we introduced you to Falcon Process entities. In this fifth guide, we’ll dive deeper into Processes and how they can be integrated more effectively with Oozie.

It’s actually not that scary, so let’s jump in!

Reviewing the Process Entity

Below is what a basic process entity/definition looks like, which we defined in the previous tutorial.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="my-example-process" xmlns="uri:falcon:process:0.1">
  <clusters>
    <cluster name="my-primary-cluster">
      <validity start="2015-05-14T23:00Z" end="2099-03-10T23:00Z"/>
    </cluster>
  </clusters>
  <parallel>1</parallel>
  <order>FIFO</order>
  <frequency>days(1)</frequency>
  <timezone>UTC</timezone>
  <outputs>
    <output name="my-example-feed" feed="my-example-feed" instance="now(0,0)"/>
  </outputs>
  <workflow name="my-example-workflow" version="2.0.0" engine="oozie" path="/tmp/my-example-workflow/"/>
  <retry policy="periodic" delay="minutes(15)" attempts="3"/>
  <ACL owner="someUnixUser" group="someUnixGroup" permission="0755"/>
</process>

But what if I told you we could do more? Much more!

Enhanced Process Entity

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="my-example-process" xmlns="uri:falcon:process:0.1">
 <clusters>
   <cluster name="my-primary-cluster">
   <validity start="2015-05-14T23:00Z" end="2099-03-10T23:00Z"/>
   </cluster>
 </clusters>
 <parallel>1</parallel>
 <order>FIFO</order>
 <frequency>days(1)</frequency>
 <timezone>UTC</timezone>
 <outputs>
 <output name="my-example-feed" feed="my-example-feed" instance="now(0,0)"/>
 </outputs>
   <properties>
     <property name="workflowName" value="my-example-workflow" /> 
     <property name="sshCommand" value="/path/to/my/script/cool_script.sh"/>
     <property name="inputFile" value="data.csv" />
     <property name="outputPath" value="/path/to/store/output/" />
   </properties>
 <workflow name="my-example-workflow" version="2.0.0" engine="oozie" path="/tmp/my-example-workflow/"/>
 <retry policy="periodic" delay="minutes(15)" attempts="3"/>
 <ACL owner="someUnixUser" group="someUnixGroup" permission="0755"/>
</process>

So what exactly did we do? We added a chunk of code that starts after the <outputs> end tag and begins before the <workflow> tag. We added a <properties> tag that has five (5) <property> tags containing key/value pairs.

  • workflowName = my-example-workflow
  • host = someUser@clustername002
  • sshCommand = /path/to/my/script/cool_script.sh
  • inputFile = data.csv
  • outputPath = /path/to/store/output/

Do these look familiar? They should — we used these exact properties directly (hard-coded) in our Oozie workflow in the previous tutorial on Process Entities. Now, they’ve appeared in our Falcon Process Entity, why is that?

Because we’re using Falcon, as opposed to Oozie, as a property file of sorts. What that means for us, is greater control and flexibility over the properties passed to Oozie, and beyond that, greater control and flexibility over the properties given to actions and scripts in said Oozie workflow.

But why is this better? Why is the control greater? Two reasons:

  • Oozie workflows have to be stored in HDFS before use
  • We already have so many other properties for the pipeline here, why not keep them all together and change them in one place? Eg: frequency, start time, name, path to workflow, etc

So, in short, putting our properties in Falcon as opposed to Oozie allows us to keep our Oozie workflow.xml in HDFS (and not have to be pulled out, edited, and placed back). It also allows us to leverage Falcon’s update feature to update the details of entities (including these <property> tags) without harming or halting Falcon jobs that are already scheduled (but we’ll cover that in a later tutorial).

Now… we’ve changed our Falcon process to feature properties. What do we have to change in our Oozie workflow (this one time, anyway)? Here’s the standard workflow.xml from the last tutorial:

Reviewing the Oozie Workflow

<workflow-app name="my-example-workflow" xmlns="uri:oozie:workflow:0.1">
 <start to="example-action"/>
 <action name="example-action">
  <ssh xmlns="uri:oozie:ssh-action:0.1">
	  <host>someUser@clustername002</host>
	  <command>/path/to/my/script/cool_script.sh</command>
	  <args>data.csv</args>
      <args>/path/to/store/output/</args>
      <capture-output/>
   </ssh>
   <ok to="end"/>
   <error to="kill"/>
 </action>

 <kill name="kill">
   <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
 </kill>
 <end name="end"/>
</workflow-app>

And here it is, enhanced. 

Enhanced Oozie Workflow

<workflow-app name="${workflowName}" xmlns="uri:oozie:workflow:0.1">
 <start to="example-action"/>
 <action name="example-action">
 <ssh xmlns="uri:oozie:ssh-action:0.1">
 <host>${host}</host>
 <command>${sshCommand}</command>
 <args>${inputFile}</args>
 <args>${outputPath}</args>
 <capture-output/>
 </ssh>
 <ok to="end"/>
 <error to="kill"/>
 </action>

 <kill name="kill">
 <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
 </kill>
 <end name="end"/>
</workflow-app>

See what we did? We turned our hard-coded tags and made them into variable references. More specifically, these variables represent properties written in Falcon’s process entity, and passed to the Oozie workflow as named variables. So instead of writing, say, data.csv for our first <arg>, we simply wrote it in Falcon and passed it dynamically.

Pretty neat, huh?

With this approach, we can keep the Oozie workflow ambiguous, and give total control over arguments, parameters, and details of an action. Nearly everything in the Oozie workflow can be replaced by a named ${Variable}, so have fun with it and try stuff out!

Next time we’ll talk about trigger inputs for processes. Post any questions, feedback or topic suggestions in the comments!

<< Previous Tutorial | Next Tutorial >>

Advertisements

2 thoughts on “Guide to Apache Falcon #5: Tighter Oozie Integration

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s