This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively.
Welcome back, friends, to the third installment of the Hadoopsters Guide to
the Galaxy Apache Falcon. Previously, we introduced you to Falcon Cluster entities. In this third guide, we’ll dive deeper into Feeds, and showcase how they work and are defined.
This tutorial will extensively cover how to define and submit a feed entity to Falcon. It’s very useful if you want to replicate and retain some data on your cluster as part of your workflow pipeline. In fact, it’s best use case is for replicating final data from one cluster to another, without also pushing all the temp files and other junk that’s typical in a workflow or staged pipeline.
Defining & Submitting a Feed Entity
Below is what a feed entity/definition looks like. It represents a feed that:
- replicates data at ‘/path/to/my/data/’ daily at 23:00 UTC from my-primary-cluster to my-backup-cluster as someUnixUser
- retains data in ‘/path/to/my/data/’ for 9999 months
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <feed name="my-example-feed" description="my-example-feed" xmlns="uri:falcon:feed:0.1"> <frequency>days(1)</frequency> <timezone>UTC</timezone> <clusters> <cluster name="my-primary-cluster" type="source"> <validity start="2015-05-01T23:00Z" end="2099-12-31T23:00Z"/> <retention limit="months(9999)" action="delete"/> <locations> <location type="data" path="/path/to/my/data/"/> </locations> </cluster> <cluster name="my-backup-cluster" type="target"> <validity start="2015-05-01T23:00Z" end="2099-12-31T23:00Z"/> <retention limit="months(9999)" action="delete"/> <locations> <location type="data" path="/path/to/my/data/"/> </locations> </cluster> </clusters> <locations> <location type="data" path="/path/to/my/data/"/> <location type="stats" path="/none"/> <location type="meta" path="/none"/> </locations> <ACL owner="someUnixUser" group="someUnixGroup" permission="0755"/> <schema location="/none" provider="none"/> </feed>
Alright, so that’s a quite a bit of code. Let’s break it down, shall we? The first line is all standard XML stuff, you’re likely familiar with it, but the table below breaks down the rest of the code:
|<feed>||Required. Indicates the start of a feed definition. The parent tag.|
|name, description||Required. Name (must be unique) and describe your cluster.|
|<frequency>||Required. A feed has a frequency, which tells Falcon how often to run this feed. Minutes, hours, days, and months are allowed.
Example: hours(5) for every five hours, days(7) to run once a week, and months(1) to run once a month.
|<timezone>||Optional. This tells Falcon what timezone the validity start and end times correspond to. Falcon defaults to UTC.|
|<clusters>||Required. Contains all associated clusters for this feed.|
|<cluster>||Required. Replication and Retention will be setup on this cluster for this feed. You can setup a cluster as a source or a target. Sources are where data start, and targets are where data is replicated to through a pull-based DistCP mechanism. Think of the source as the cluster on which your pipeline operates, and the target the cluster which receives a clean copy of the final and/or raw data.|
|name, type||Required. Attributes of the cluster tag, where you give a cluster entity name and a type as either a source or target. Must reference a cluster entity that is already registered in Falcon.|
|<validity>||Required. Combined with the frequency to determine when to window of time in which a Falcon feed can execute.|
|start, end||Required. Attributes of the validity tag, where you give a start time for the feed and an end time for the feed. This indicates that the feed job can start at the given start time, and continue at the given <frequency> until end time arrives. The specific time given also indicates at what time during the day the job would run.|
|<retention>||Required. For applying a data retention policy on the data specified in the location path for this cluster. Use limit to determine how long the path is retained for, and action to determine what to do with the data after limit runs up. Actions ‘delete’ and ‘archive’ are currently allowed, though Archive is not yet implemented (as of 11-2015).|
|<locations>||Required. Location has a name and a path, name is the type of location, and path is the absolute HDFS path to it. Type includes ‘data’, ‘stats’ and ‘meta.’ Data is the one you primarily care about, since it points to your feed’s data.|
|<location>||Required.Falcon uses the location specified to do replication and retention. It is from this path that Falcon will replicate between clusters (source and target), and apply retention policies on data on both ends.|
|<ACL>||Required. The ACL (Access Control List)tag is useful for implementing permission requirements and provide a way to set different permissions for specific users or named groups.|
For now, we’ll ignore in-depth detailing of the following tags until later (for now just keep them as they are above, their defaults):
- <location stats> – for use with stats
- <location meta> – for use with metadata
- <schema location> – for use with Hive integration
So, in a nutshell, that’s what a Feed entity looks like! You can submit it like you did a Cluster entity…
Submitting a Feed to Falcon
Submitting entities is how you register cluster, feed, and process entities with Falcon. The process of submitting is simply that of submitting the XML entity to Falcon via a command line action. That looks like this:
falcon entity -type feed -submit -file my-example-feed.xml
That’s it. Just be in the directory where my-example-feed.xml lives (most likely on the edge node of your Hadoop cluster) and where you can call Falcon via the above ‘falcon’ command, and it should submit without issue. To learn more about the makeup of this command, and get a breakdown, visit the previous tutorial.
Pretty simple, right?
If you have Falcon already installed and humming, go ahead and try to submit your feed. If everything went fine, you shouldn’t see any returns or errors. If you do get an error, post it below in a comment and we’ll see if we can work it out.
If it did work, let’s ensure it’s actually there with a simple check:
falcon entity -type feed -list
You should see an output akin to this:
If the command went through, you have a feed entity in Falcon! Yay! But we still have to schedule it. Submitting it just got it into Falcon’s store, now we have to tell Falcon to get it on the schedule (you did give it validity start and end times after all)!
Scheduling a Feed in Falcon
falcon entity -type feed -schedule -name my-example-feed
Try the above command. Did it go great? Did it say it was scheduled successfully? Awesome! If it returned an error, post it in the comments below — we’ll try to help out.
If all went well, You’ve successfully setup and scheduled a feed. Let’s recap what all we did:
- We defined a feed xml
- Submitted a feed to Falcon
- Checked existence of feed in Falcon store
- Scheduled a feed in Falcon
Tune in next time where we’ll get into the nitty gritty of processes, in the same way we broke down feeds this time (and clusters last time). Post any questions, feedback or topic suggestions in the comments!
<< Previous Tutorial | Next Tutorial >>