This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively.
Welcome back, friends, to the Hadoopsters Guide to Apache Falcon. Previously, we introduced you to Falcon. In this second entry, we’ll dive deeper into processes, clusters, and feeds, and showcase how they work. Let’s start with a high level overview of a Falcon job/pipeline in execution:
This tutorial assumes you want to create something very similar to this scenario. You might not want replication, though, for example, and I’ll show you how to do that later (though it’s really just as simple as leaving out the feed entity, but still). Right now, let’s focus on how to make the above diagram a reality in our own Falcon installation.
This tutorial will still extensively cover how to define and submit a cluster entity to Falcon, so keep reading regardless of what kind of workflows you want to make! This stuff is still the crux of Falcon, and it’s important to understand it and save yourself the headache later.
Defining & Submitting a Cluster Entity
Ladies and Gentlemen, this is what a cluster entity/definition looks like. Assume that this is our sample cluster that we work on, upon which Falcon is already installed (or will be).
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <cluster name="my-primary-cluster" description="my first cluster entity" colo="Primary Hadoop Cluster: CompanyName" xmlns="uri:falcon:cluster:0.1"> <interfaces> <interface type="readonly" endpoint="hftp://clustername001.mycompany.com:50070" version="2.2.0"/> <interface type="write" endpoint="hdfs://clustername001.mycompany.com:8020" version="2.2.0"/> <interface type="execute" endpoint="clustername003.mycompany.com:8050" version="2.2.0"/> <interface type="workflow" endpoint="http://clustername002.mycompany.com:11000/oozie/" version="4.0.0"/> <interface type="messaging" endpoint="tcp://clustername004.mycompany.com:61616?daemon=true" version="5.1.6"/> </interfaces> <locations> <location name="staging" path="/apps/falcon/my-primary-cluster/staging"/> <location name="temp" path="/tmp"/> <location name="working" path="/apps/falcon/my-primary-cluster/working"/> </locations> <ACL owner="someUnixIDGoesHere" group="someUnixGroupGoesHere" permission="0700"/> </cluster>
Alright, so that’s a good chunk of code. Let’s break it down, shall we? The first line is all standard XML stuff, you’re likely familiar with it, but the table below breaks down the rest of the code:
|<cluster>||Required. Indicates the start of a cluster definition. The parent tag.|
|name, description, colo||Required. Name (must be unique) and describe your cluster. The colo specifies the colo(cation center) to which this cluster belongs.|
|<interfaces>||Required. A cluster has many interfaces. List them as an <interface> under this tag.|
|<interface>||Required. There are six types: readonly, write, execute, workflow, registry and messaging. More info available below.|
|<locations>||Required. Location has a name and a path, name is the type of location, and path is the absolute HDFS path to it.|
|<location>||Required. Falcon uses each location specified to do processing of processes and feeds given to Falcon. These locations must exist before a cluster entity is submitted to Falcon. Some special permissions may be required (we’ll cover those later).|
|<ACL>||Required. The ACL (Access Control List)tag is useful for implementing permission requirements and provide a way to set different permissions for specific users or named groups.|
So, in a nutshell, that’s what a Cluster entity looks like. You’ll need to customize the addresses and URLs as they apply to you, but the code really remains the same. Here’s a breakdown of each of the tags in the code:
<interface type="readonly" endpoint="hftp://clustername001.mycompany.com:50070" version="2.2.0"/>
This is the path to your Namenode service, for talking to the NameNode from Falcon. Change it as needed to correspond to your cluster’s configuration.
<interface type="write" endpoint="hdfs://clustername001.mycompany.com:8020" version="2.2.0"/>
This is the path to your NameNode as well, but used for writing to HDFS when Falcon requires.
<interface type="execute" endpoint="clustername003.mycompany.com:8050" version="2.2.0"/>
This is the path to your Resource Manager service, for help in scheduling Map/Reduce jobs.
<interface type="workflow" endpoint="http://clustername002.mycompany.com:11000/oozie/" version="4.0.0"/>
This is the path to your Oozie Service installation, for help in scheduling Oozie workflows.
<interface type="messaging" endpoint="tcp://clustername004.mycompany.com:61616?daemon=true" version="5.1.6"/>
This is the path to Falcon’s JMS Messaging Service, which is where JMS alerts will be posted for your consumption (which isn’t useful to core Falcon, but is for extending it!).
Moving through the rest of the XML entity, we take note of the next code chunk, <locations>, which contains a few <location> tags. These paths are very crucial to the inner workings of Falcon, and require a little special care when it comes to creation, ownership, and permissions.
Falcon Staging & Working Directories
Falcon requires two paths: a staging path and a working path. Here are the details.
- Staging: Falcon copies the artifacts of submitted/scheduled processes and feeds to distinct child folders under this location.
- Recommended path is ‘/apps/falcon/my-primary-cluster/working’ (HDFS).
- Owner should be user ‘falcon.’
- Permissions should be ‘777’.
- Working: Falcon copies the jars it requires for execution to this location to support processes and feeds.
- Recommended path is ‘/apps/falcon/my-primary-cluster/working’ (HDFS).
- Owner should be user ‘falcon.’
- Permissions should be ‘755’.
You can achieve this with a few simple commands:
hadoop fs -mkdir /apps/falcon/my-primary-cluster/staging/ hadoop fs -chmod 777 /apps/falcon/my-primary-cluster/staging/ hadoop fs -mkdir /apps/falcon/my-primary-cluster/working/ hadoop fs -chmod 755 /apps/falcon/my-primary-cluster/working/ hadoop fs -chown -R falcon:hadoop /apps/falcon/my-primary-cluster/
*Note that falcon is the user and hadoop is the group in the last command, and this may vary based on your system setup (though this applies as written to most Hortonworks HDP clusters).
Take note that the use of ‘my-primary-cluster’ in the HDFS path names above corresponds to the name chosen for your cluster. You can certainly call them whatever you want — they don’t have to match the name of your cluster. But in my experience, it sure does make things less of a headache.
So my recommendation: if your cluster entity name is walmart-primary-cluster, then make the path(s) above match accordingly (using walmart-primary-cluster in place of my-primary-cluster).
Outside of the information above, that’s all you really need to know (for now) about Cluster entities in Falcon and what comprises them. Now, let’s submit that cluster to Falcon.
Submitting a Cluster to Falcon
Submitting entities is how you register cluster, feed, and process entities with Falcon. The process of submitting is simply that of submitting the XML entity to Falcon via a command line action. That looks like this:
falcon entity -type cluster -submit -file my-primary-cluster.xml
That’s it. Just be in the directory where my-primary-cluster.xml lives (most likely on the edge node of your Hadoop cluster) and where you can call Falcon via the above ‘falcon’ command, and it should submit without issue.
Understanding the Command & the CLI
For now, let’s break down what this command does, word by word.
Firstly, we call Falcon like any other Hadoop tool, by calling its name. Simply typing ‘falcon’ won’t get you far though, since Falcon doesn’t have a shell you drop into like Hive or Pig (typed with ‘hive’ or ‘pig’ on command line). Instead, we need to tell it what to do, which in our case, is to submit a cluster entity file to Falcon’s store (store being the service which digests and sets up entities for you).
After the word ‘falcon’, we type ‘entity’ which indicates the operation is entity related. Then we specify which type of entity we’ll be working with (cluster, feed, or process) with the ‘-type’ command. We then specify that our command is ‘cluster’ related, and that we’ll be submitting it (through use of ‘-submit’). The common way to submit an entity to Falcon is by giving it a XML file to parse, so we use ‘-file’, followed by the path/name of our file. Make sense?
You can also give the path to the entity (again, this XML file is on our local linux path, not HDFS) like this, if you’re too lazy to CD:
falcon entity -type cluster -submit -file /path/to/my-primary-cluster.xml
Pretty simple, right?
If you have Falcon already installed and humming, go ahead and try to submit your cluster. If everything went fine, you shouldn’t see any returns or errors. If you do get an error, post it below in a comment and we’ll see if we can work it out.
If the command went through, then eureka again! You have a cluster entity in Falcon by which you can now submit feeds and processes! Yay! Now before we celebrate too much, let’s ensure it’s actually there with a simple check:
falcon entity -type cluster -list
You should see an output akin to this:
If you do, then you’re all set. You’ve successfully setup a cluster.
But wait, what about my backup cluster, the one in the diagram!?
Relax, friend. That’s easy — in fact, it requires the exact effort you just put forward in setting this cluster (my-primary-cluster) up just now! All you have to do is create another cluster entity XML, change the name and values accordingly, setup the required staging and working directories the cluster will use for operating, and submit it to Falcon! Just for consistency, here’s an example of that code:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <cluster name="my-backup-cluster" description="my second cluster entity" colo="Backup Hadoop Cluster: CompanyName" xmlns="uri:falcon:cluster:0.1"> <interfaces> <interface type="readonly" endpoint="hftp://cluster2name001.mycompany.com:50070" version="2.2.0"/> <interface type="write" endpoint="hdfs://cluster2name001.mycompany.com:8020" version="2.2.0"/> <interface type="execute" endpoint="cluster2name003.mycompany.com:8050" version="2.2.0"/> <interface type="workflow" endpoint="http://cluster2name002.mycompany.com:11000/oozie/" version="4.0.0"/> <interface type="messaging" endpoint="tcp://cluster2name004.mycompany.com:61616?daemon=true" version="5.1.6"/> </interfaces> <locations> <location name="staging" path="/apps/falcon/my-backup-cluster/staging"/> <location name="temp" path="/tmp"/> <location name="working" path="/apps/falcon/my-backup-cluster/working"/> </locations> <ACL owner="someUnixIDGoesHere" group="someUnixGroupGoesHere" permission="0700"/> </cluster>
This is half the work in getting Falcon setup for your workflows. Let’s recap what we did:
- We defined a cluster xml
- We established necessary staging and working directories for Falcon to use (with specific ownership and permissions)
- Submitted a cluster to Falcon
- Checked existence of cluster in Falcon store
- Submitted a backup cluster to Falcon
Tune in next time where we’ll get into the nitty gritty of feeds and processes, in the same way we broke down clusters this time. Post any questions, feedback or topic suggestions in the comments!