Guide to Apache Falcon #3: Feed Entity Definitions

falcon-logo

This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively.

Welcome back, friends, to the third installment of the Hadoopsters Guide to the Galaxy Apache Falcon. Previously, we introduced you to Falcon Cluster entities. In this third guide, we’ll dive deeper into Feeds, and showcase how they work and are defined.

This tutorial will extensively cover how to define and submit a feed entity to Falcon. It’s very useful if you want to replicate and retain some data on your cluster as part of your workflow pipeline. In fact, it’s best use case is for replicating final data from one cluster to another, without also pushing all the temp files and other junk that’s typical in a workflow or staged pipeline.

Defining & Submitting a Feed Entity

Below is what a feed entity/definition looks like. It represents a feed that:

  • replicates data at ‘/path/to/my/data/’ daily at 23:00 UTC from my-primary-cluster to my-backup-cluster as someUnixUser
  • retains data in ‘/path/to/my/data/’ for 9999 months
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="my-example-feed" description="my-example-feed" xmlns="uri:falcon:feed:0.1">
 <frequency>days(1)</frequency>
 <timezone>UTC</timezone>
 <clusters>
 <cluster name="my-primary-cluster" type="source">
 <validity start="2015-05-01T23:00Z" end="2099-12-31T23:00Z"/>
 <retention limit="months(9999)" action="delete"/>
 <locations>
 <location type="data" path="/path/to/my/data/"/>
 </locations>
 </cluster>
 <cluster name="my-backup-cluster" type="target">
 <validity start="2015-05-01T23:00Z" end="2099-12-31T23:00Z"/>
 <retention limit="months(9999)" action="delete"/>
 <locations>
 <location type="data" path="/path/to/my/data/"/>
 </locations>
 </cluster>
 </clusters>
 <locations>
 <location type="data" path="/path/to/my/data/"/>
 <location type="stats" path="/none"/>
 <location type="meta" path="/none"/>
 </locations>
 <ACL owner="someUnixUser" group="someUnixGroup" permission="0755"/>
 <schema location="/none" provider="none"/>
</feed>

Alright, so that’s a quite a bit of code. Let’s break it down, shall we? The first line is all standard XML stuff, you’re likely familiar with it, but the table below breaks down the rest of the code:

<feed> Required. Indicates the start of a feed definition. The parent tag.
name, description Required. Name (must be unique) and describe your cluster.
<frequency> Required. A feed has a frequency, which tells Falcon how often to run this feed. Minutes, hours, days, and months are allowed.

Example: hours(5) for every five hours, days(7) to run once a week, and months(1) to run once a month.

<timezone> Optional. This tells Falcon what timezone the validity start and end times correspond to. Falcon defaults to UTC.
<clusters> Required. Contains all associated clusters for this feed.
<cluster> Required. Replication and Retention will be setup on this cluster for this feed. You can setup a cluster as a source or a target. Sources are where data start, and targets are where data is replicated to through a pull-based DistCP mechanism. Think of the source as the cluster on which your pipeline operates, and the target the cluster which receives a clean copy of the final and/or raw data.
name, type Required. Attributes of the cluster tag, where you give a cluster entity name and a type as either a source or target. Must reference a cluster entity that is already registered in Falcon.
<validity> Required. Combined with the frequency to determine when to window of time in which a Falcon feed can execute.
start, end Required. Attributes of the validity tag, where you give a start time for the feed and an end time for the feed. This indicates that the feed job can start at the given start time, and continue at the given <frequency> until end time arrives. The specific time given also indicates at what time during the day the job would run.
<retention> Required. For applying a data retention policy on the data specified in the location path for this cluster. Use limit to determine how long the path is retained for, and action to determine what to do with the data after limit runs up. Actions ‘delete’ and ‘archive’ are currently allowed, though Archive is not yet implemented (as of 11-2015).
<locations> Required. Location has a name and a path, name is the type of location, and path is the absolute HDFS path to it. Type includes ‘data’, ‘stats’ and ‘meta.’ Data is the one you primarily care about, since it points to your feed’s data.
<location> Required.Falcon uses the location specified to do replication and retention. It is from this path that Falcon will replicate between clusters (source and target), and apply retention policies on data on both ends.
<ACL> Required. The ACL (Access Control List)tag is useful for implementing permission requirements and provide a way to set different permissions for specific users or named groups.

For now, we’ll ignore in-depth detailing of the following tags until later (for now just keep them as they are above, their defaults):

  • <location stats> – for use with stats
  • <location meta> – for use with metadata
  • <schema location> – for use with Hive integration

So, in a nutshell, that’s what a Feed entity looks like! You can submit it like you did a Cluster entity…

Submitting a Feed to Falcon

Submitting entities is how you register cluster, feed, and process entities with Falcon. The process of submitting is simply that of submitting the XML entity to Falcon via a command line action. That looks like this:

falcon entity -type feed -submit -file my-example-feed.xml

That’s it. Just be in the directory where my-example-feed.xml lives (most likely on the edge node of your Hadoop cluster) and where you can call Falcon via the above ‘falcon’ command, and it should submit without issue. To learn more about the makeup of this command, and get a breakdown, visit the previous tutorial.

Pretty simple, right?

If you have Falcon already installed and humming, go ahead and try to submit your feed. If everything went fine, you shouldn’t see any returns or errors. If you do get an error, post it below in a comment and we’ll see if we can work it out.

If it did work, let’s ensure it’s actually there with a simple check:

falcon entity -type feed -list

You should see an output akin to this:

(CLUSTER) my-example-feed

If the command went through, you have a feed entity in Falcon! Yay! But we still have to schedule it. Submitting it just got it into Falcon’s store, now we have to tell Falcon to get it on the schedule (you did give it validity start and end times after all)!

Scheduling a Feed in Falcon

falcon entity -type feed -schedule -name my-example-feed

Try the above command. Did it go great? Did it say it was scheduled successfully? Awesome! If it returned an error, post it in the comments below — we’ll try to help out.

If all went well, You’ve successfully setup and scheduled a feed. Let’s recap what all we did:

  • We defined a feed xml
  • Submitted a feed to Falcon
  • Checked existence of feed in Falcon store
  • Scheduled a feed in Falcon

Tune in next time where we’ll get into the nitty gritty of processes, in the same way we broke down feeds this time (and clusters last time). Post any questions, feedback or topic suggestions in the comments!

<< Previous Tutorial | Next Tutorial >>

Advertisements

6 thoughts on “Guide to Apache Falcon #3: Feed Entity Definitions

  1. aikinhdo May 26, 2016 / 1:34 pm

    Have you got the issue with FALCON_RETENTION workflow with yours feeds ?

    Like

  2. deep June 14, 2016 / 8:54 am

    Hi Thanks for the great tutorial … I have submitted and scheduled the feed. Scheduled it for every 5 minutes. But I am unable to see any data replication happening here …. am I missing something?

    Like

  3. deepdoradla June 14, 2016 / 8:56 am

    Hi Thanks for the great tutorial … As per above I have submitted and scheduled the feed. Scheduled it for every 5 minutes. But I am unable to see any data from source folder being replicated to target folder. Is there anything else I should for replication? I haven’t done process because you have mentioned that process is not required for replication?

    Like

    • deepdoradla June 14, 2016 / 9:06 am

      Here is the feed code for reference:

      minutes(5)
      UTC

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s