Guide to Apache Falcon #4: Process Entity Definitions

falcon-logo

This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively.

Welcome back, friends, to the fourth installment of the Hadoopsters Guide to Apache Falcon. Previously, we introduced you to Falcon Feed entities. In this fourth guide, we’ll dive deeper into Processes, the last of the entities, and showcase how they work and are defined.

This tutorial will extensively cover how to define and submit a process entity to Falcon. It’s very useful if you want to ingest, process, analyze, or do something to some data on your cluster as part of your workflow pipeline. In fact, Falcon will do anything you tell it to — run a bash/shell or Pig script, execute a Hive query — whatever!

It’s all about the Process and what you put in it. In other words, you reap what you sow, farmer.

Defining & Submitting a Process Entity

Below is what a basic process entity/definition looks like. It represents a process that:

  • executes a process daily at 23:00 UTC from my-primary-cluster starting on May 14, 2015 and will do so daily until 2099-03-10 (basically forever)
  • execution calls an Oozie Workflow, stored in HDFS
  • Oozie workflow has action(s) in it, which is truly where we tell our workflow what to do
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="my-example-process" xmlns="uri:falcon:process:0.1">
  <clusters>
    <cluster name="my-primary-cluster">
      <validity start="2015-05-14T23:00Z" end="2099-03-10T23:00Z"/>
    </cluster>
  </clusters>
  <parallel>1</parallel>
  <order>FIFO</order>
  <frequency>days(1)</frequency>
  <timezone>UTC</timezone>
  <outputs>
    <output name="my-example-feed" feed="my-example-feed" instance="now(0,0)"/>
  </outputs>
  <workflow name="my-example-workflow" version="2.0.0" engine="oozie" path="/tmp/my-example-workflow/"/>
  <retry policy="periodic" delay="minutes(15)" attempts="3"/>
  <ACL owner="someUnixUser" group="someUnixGroup" permission="0755"/>
</process>

Alright, so that’s the code. Let’s, as usual, break it all down. The first line is all standard XML stuff, you’re likely familiar with it, but the table below breaks down the rest of the fundamental code:

<process> Required. Indicates the start of a process definition. The parent tag.
name Required. Name (must be unique) your process.
<frequency> Required. A process has a frequency, which tells Falcon how often to run this process. Minutes, hours, days, and months are allowed.

Example: hours(5) for every five hours, days(7) to run once a week, and months(1) to run once a month.

<timezone> Optional. This tells Falcon what timezone the validity start and end times correspond to. Falcon defaults to UTC.
<clusters> Required. Contains all associated clusters for this process.
<cluster> Required. This determines which cluster this Process should execute on.
name, type Required. Attributes of the cluster tag, where you give a cluster entity name. Must reference a cluster entity that is already registered in Falcon.
<validity> Required. Combined with the frequency to determine when to window of time in which a Falcon process can execute.
start, end Required. Attributes of the validity tag, where you give a start time for the process and an end time for the process. This indicates that the process job can start at the given start time, and continue at the given <frequency> until end time arrives. The specific time given also indicates at what time during the day the job would run.
<order> Required. Lets Falcon know, in the scenario where it can run parallel instances of this process, in which order it should prioritize their completion order. Default is FIFO, known as First In First Out.
<parallel> Required. Lets Falcon know, in the scenario where it can run parallel instances of a process, how many it should be allowed to spawn. Default is 1, which means only one instance at a time of this process is allowed.
<outputs> Required. Used to link Process(es) to Feed(s). <output> tags go within this parent tag.
<workflow> Required. Designates the engine and workflow to use for this process. An oozie workflow is where the actions of “verbs” of the process really are defined. This tag has 4 tags: name (custom name you can give), engine (default is oozie), version (put your version here), and path (HDFS path to the oozie workflow you’ll build later in this tutorial).
<retry> Required. Falcon has a built in retry feature that will retry this process upon failure. It will try based on policy (default is periodic), delay time (I like 15 minutes), and number of retries to try.
<ACL> Required. The ACL (Access Control List)tag is useful for implementing permission requirements and provide a way to set different permissions for specific users or named groups. Isn’t well implemented right now.

There are many other tags we’ll use down the line that are very useful in Falcon, such as inputs! But for now, this is enough to digest, for sure. Let’s keep truckin’.

Integration with Oozie

Falcon leverages Oozie, an enterprise-grade capacity scheduler for Hadoop jobs. What does that mean? In a nutshell, it’s the service and framework by which you can build and schedule workflows for execution on Hadoop. Taking the form of Oozie Workflows, which at the basic level are just XML (yes, more XML), workflows allow you to execute a series of actions, be it 1 action or 42 actions (I actually don’t know if there’s a limit or not) that execute on your cluster.

Later on we’ll talk about how Falcon and Oozie integrate via the Process XML (though you might have noticed some hints in the <workflow> tag in the process entity definition above), but for now let’s go over what an Oozie workflow.xml looks like and get to running our process.

Definition of an Oozie Workflow

An Oozie Workflow (workflow.xml, it should be called), looks like this. This workflow represents:

  • a workflow that will SSH to clustername002 as someUser and execute ‘cool_script.sh’ in /path/to/my/script/ in the local linux pathway.
  • it uses data.csv as its input (argument 1)
  • it uses /path/to/store/output/ (argument 2) as the place to put data
  • this script is 100% imaginary, I’m just making a point
<workflow-app name="my-example-workflow" xmlns="uri:oozie:workflow:0.1">
 <start to="example-action"/>
 <action name="example-action">
  <ssh xmlns="uri:oozie:ssh-action:0.1">
	  <host>someUser@clustername002</host>
	  <command>/path/to/my/script/cool_script.sh</command>
	  <args>data.csv</args>
      <args>/path/to/store/output/</args>
      <capture-output/>
   </ssh>
   <ok to="end"/>
   <error to="kill"/>
 </action>

 <kill name="kill">
   <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
 </kill>
 <end name="end"/>
</workflow-app>

Okay, now let’s break that down some:

<workflow-app> Required. Indicates the start of a Oozie workflow definition. The parent tag.
start-to Required. This designates which action Oozie should run first. Action (must be uniquely named) is defined later in the file.
<action> Required. Indicates the start of a Action definition.
<ssh> Required. This is a SSH action. It takes host, command, and arg tags, and is an action that will SSH to a location as a user, and execute a command or script as told. It also applies given arguments. There are many different types of actions, including but not limited to: SSH, Shell, Hive, Pig, Java, and MapReduce.
<capture-output> Optional. Will log stdout and stderr of your command. Highly, highly recommended.
<ok-to> Required. Determines where Action will go next (to either another action or oozie construct) once it’s complete and a success. Recommended value is “end” if no other action is needed, or to another action if more work is required.
error-to Required. Determines where Action will go next (to either another action or oozie construct) if it FAILS or encounters an error. Recommended value is “kill”.
<kill> Required. Will Kill an Oozie workflow when reached, and report a message (typically the reason for failure).
end Required. Will End an Oozie workflow when reached. This is the natural endpoint for all successfully completed workflows.

 

So this is really cool and useful, BUT…. we can make it better. On both the Falcon and Oozie side. What do I mean? I mean all this script/argument hardcoding we got going on… look at that silliness! We can actually use Falcon as a living properties file that we can change at our leisure, and pass them (host/command/args/etc) to Oozie dynamically. After all, as you’ll learn below, making changes to an Oozie WF is a tad bit more of a pain since it has to be replaced in HDFS when you make said change. But we won’t worry about that in today’s post… but next time we will! For now, the above will do just fine to get the point across.

Now, an Oozie workflow must live in HDFS. That’s just the rule. So let’s pick a good place and put it there. For now, let’s go with /tmp/my-example-workflow/:

hadoop fs -put workflow.xml /tmp/my-example-workflow/

So, just to recap, here’s what we’ve done:

  • we defined a Falcon Process, which links to an Oozie workflow stored in HDFS
  • we defined an Oozie workflow, and placed it in HDFS (fulfilling the prophecy of the above command)

Alright, it’s time to submit your process (drum roll please)! You can submit it like you did a Cluster and Feed entity (as in prev tutorials)…

Submitting a Process to Falcon

Submitting entities is how you register cluster, feed, and process entities with Falcon. The process of submitting is simply that of submitting the XML entity to Falcon via a command line action. That looks like this:

falcon entity -type process -submit -file my-example-process.xml

That’s it. Just be in the directory where my-example-process.xml lives (most likely on the edge node of your Hadoop cluster) and where you can call Falcon via the above ‘falcon’ command, and it should submit without issue. To learn more about the makeup of this command, and get a breakdown, visit the previous tutorial on Cluster entities (a more extensive breakdown is available there).

Pretty simple, right?

If you have Falcon already installed and humming, go ahead and try to submit your process. If everything went fine, you shouldn’t see any returns or errors. If you do get an error, post it below in a comment and we’ll see if we can work it out.

If it did work, let’s ensure it’s actually there with a simple check:

falcon entity -type process -list

You should see an output akin to this:

(PROCESS) my-example-process

If the command went through, you have a process entity in Falcon! Yay! But we still have to schedule it. Submitting it just got it into Falcon’s store, now we have to tell Falcon to get it on the schedule (you did give it validity start and end times after all)!

Scheduling a Process in Falcon

falcon entity -type process -schedule -name my-example-process

Try the above command. Did it go great? Did it say it was scheduled successfully? Awesome! If it returned an error, post it in the comments below — we’ll try to help out.

If all went well, You’ve successfully setup and scheduled a process. In fact, it’s probably running RIGHT NOW! You can view that a few different ways:

  • falcon dashboard (not very extensive right now)
  • hue oozie viewer (best option)

I’d recommend using Hue to view the results and output of your Falcon process. It’s an effective tool that’s included in many stacks of Hadoop, and shows your process (and by association your oozie workflow) progress live in a sleek UI.

Now, in practice, it’s common to submit and schedule your feed at the same time, but you don’t have to. Remember, submitting and scheduling feeds and processes to Falcon is how you get them set to go, but they won’t actually go until their time has come (remember the validity start and end times in the definitions). When that time arrives, they’ll start and do as you asked.

And remember — processes act, and feeds replicate and retain. It’s because of this that a feed serves as an “output” for a process. It’s not actually linked to anything a process can physically emit, it’s just nomenclature in Falcon to indicate that you expect data from your process to arrive and live there — which allows Falcon to replicate and retain it for you! Make sense?

I say Feed you say Replicate and Retain!
“Feed!”
“Replicate and Retain!”

I say Process you say do stuff!
“Process!”
“Do Stuff!”

Awesome.

Now, that was a LOT. Let’s recap what all we did:

  • We defined a process xml
  • We defined an oozie workflow
  • Submitted a process to Falcon
  • Checked existence of process in Falcon store
  • Scheduled a process in Falcon

Tune in next time where we’ll get into the niftier elements and features processes, like triggers and inputs, properties, and more. Post any questions, feedback or topic suggestions in the comments!

<< Previous Tutorial | Next Tutorial >>

Advertisements

5 thoughts on “Guide to Apache Falcon #4: Process Entity Definitions

  1. Isaac December 16, 2015 / 4:41 pm

    Hi there. First, thanks for the article.

    I cannot schedule any of the processes.

    ERROR: Bad Request;default/org.apache.falcon.FalconWebException::org.apache.falcon.FaconException: Entity schedule failed for process: my-first-process

    Thanks in advance

    Like

    • Landon Robinson January 26, 2016 / 2:54 am

      Hi Isaac, apologies for the long response time. I hope to help.
      Do you have any more of the stacktrace of the error to share? Maybe some logs? That could help in debugging. Or even better, share your XMLs and the commands you tried exactly, as well as details about your environment (hdp, cdh, aws, etc).
      – Landon

      Like

  2. Andrej May 20, 2016 / 8:10 am

    Hi, I’m trying to set up a process that runs daily at a specific time. Latest Oozie versions have the ability to specify frequency in a cron-like manner. This seems is not yet supported by Falcon. Is there any way to have a cron-like frequency or something similar in Falcon.

    Thanks,
    Andrej

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s