This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively.
Welcome, friends and all, to the first entry in The Hadoopsters guide to Apache Falcon. Most of you who have Hadoop are aware of the tools on your stack: Hive, Oozie, Pig, Flume, Sqoop, just to name a few. But you may or may not be aware of Falcon, a top-level Apache project that’s beginning to garner great interest because of its feature set. It has a total of 18 committers, 7 of which being from a major Hadoop vendor: Hortonworks. It’s also played a huge role in my recent work in Hadoop.
Since I’ve brought up our friends at Horton, why don’t I let them introduce you to Falcon… take it away, sizzle text I copy-pasted from the HW website!
Apache Falcon: A framework for managing data life cycle in Hadoop clusters.
Apache Falcon addresses enterprise challenges related to Hadoop data replication, business continuity, and lineage tracing by deploying a framework for data management and processing. Falcon centrally manages the data lifecycle, facilitate quick data replication for business continuity and disaster recovery and provides a foundation for audit and compliance by tracking entity lineage and collection of audit logs.
So now you know about Falcon! Well, you know of it, but you (at least if you’re here) don’t know much about how to use it, or better yet, be a power user of it! Hopefully, that’s what you’re here to do, and with a bevy of examples, explanations, and helpful hints through this guide, we can (again, hopefully) make you into the Grand Falcon Wizard you know you are, deep down inside.
Before getting into much code, let’s focus on the high level offerings of Falcon first. Here are some fast facts:
- is an XML-based entity-driven system that is made up of clusters, feeds, and processes
- simplifies and abstracts the complexity of data management and workflow scheduling
- is compatible with any scripts, tools, or Hadoop components you use (Pig, Hive, etc)
- offers automatic data replication for disaster recovery and cluster mirroring
- can show lineage history of scheduled jobs
- can apply retention policies on data to remove it when it expires from usefulness
- is in general pretty dope
Now. Let’s talk about how it works.
How Falcon Works – Entities
Falcon is a pipeline. It’s about ingestion and/or processing data, replicating it if you wish, and doing it all on a automated and scheduled basis. A pipeline is defined by 3 key attributes:
- a cluster entity
- a feed entity
- a process entity
A Cluster entity defines where data, tools, and processes live on your Hadoop cluster. Think of it as a “Facebook Profile” for your Hadoop cluster: things like the namenode address, Oozie URL, etc), which it uses to execute the other two entities: feeds and processes.
A Feed entity defines where data lives on your cluster (in HDFS). The feed is designed to designate to Falcon where your new data (that’s either ingested, processed, or both) lives so it can retain (through retention policies) and replicate (through replication policies) it on or from your Cluster (again, defined through the Cluster entity). A Feed is typically (but doesn’t have to be) the output of a Process…
A Process Entity defines what action or “process” will be taking place in a pipeline. Most typically, the Process links to an Oozie workflow (which you can learn more about here and here), which contains a series of actions to execute (such as shell scripts, Java Jars, Hive actions, Pig actions, Sqoop Actions, you name it) on your cluster. A Process also, by definition, takes Feeds as inputs, as outputs, and is where you define how often a workflow should run.
That’s great, you say? But how do I make them, you say? Well, by creating XML, of course! Falcon entities are defined seperately as .XML files and “linked” together by references to each other. For example: you can create a Process that operates on some data weekly with a script, and produces some new data. For that, you would require one (1) process that calls that script, and a feed pointing to where that new data will live when the process is done. Notice how the feed has no role in the actual creation or “output” of data — the process did it all. That’s because the feed isn’t really a literal output, per se, it’s just a mapping of the place data is expected to be output as a result of the process it’s the “output” of. Make sense? Don’t worry, it will when you see the .XML yourself. Think of it, for now, as a sign pointing Falcon to your data, so when it wants to put replication and retention policies on it, it knows where to go exactly.
Another example is using Feeds as an input to a Process, not just as an output. Say you want to run that same pipeline from before, but now you don’t want it to just run once a week like before… now you want it to run once a week BUT wait on the data (that the process will operate on) to arrive in the cluster. That’s simple with Falcon. Just define a new Feed that’s mapped to the location of where the input data should be in the cluster (HDFS), and list it in the Process as an input. One Feed as an Input (required for the Process to start), and another feed as an output (an indication of where data will be when a Process runs/finishes). Now, the Process you defined will not start until the data that needs to be there is there. And yes, this is configurable with folder/file dates, timestamps, and all that. Don’t worry — we’ll get there. 🙂
Random Facts about Entities:
- Feeds do not require Processes, they can be ran alone (if you just want replication and retention on some data in HDFS).
- Processes do not require Feeds, they can be ran alone (if you don’t need data to be replicated or retained as it relates to this workflow).
- Clusters require neither Feeds or Processes, and are largely initial “configurations” for Falcon. Processes and Feeds both, however, require a Cluster entity or entities. They have to know who they report to, right?
Wrap-Up… For Now!
That’s pretty much the high level on Falcon: it’s a pipeline that you define in the form of entities (basically the pieces of the workflow) that define a workflow, its frequency, and a number of policies like lateness, retention, and replication. All of that creates a single place where you can centrally manage your workflows, replicate data for disaster recovery needs, and more.
But this is just the start of our Falcon journey! Next time we’ll dive into entities more deeply, breaking down what each tag/line does and how you can leverage it. If you’re interested in getting your hands dirty right now, I’d highly recommend the following Falcon tutorials provided by Hortonworks, who is a major supporter of the Falcon project (like myself).
- Define and Process Data Pipelines in Hadoop with Apache Falcon
- Processing Data Pipeline with Apache Falcon
- Mirroring Datasets between Hadoop clusters with Apache Falcon
And while these tutorials are handy and useful for getting you off the ground, I plan and hope to make this guide a monumentally more helpful resource for those with problems Falcon can help solve. Much like Hadoopsters, I plan to share all my insight into Falcon that I have, so stick with me!