This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively.
Welcome back, friends, to the sixth installment of the Hadoopsters Guide to Apache Falcon. Previously, we introduced you to integrating Oozie with Falcon. In this iteration, we’ll take a look at our options regarding monitoring.
This tutorial will cover what ways, out of the box and custom-built, you have at your disposal to monitor the status and completion of your deployed Falcon workflows (feeds and processes). Remember, feeds are responsible for the retention and replication of a data set, and a process is responsible for the ingestion/transformation/doingOfThingsTo of the data set, whatever that my be.
With that in mind, let’s look at what tools we have at our disposal to follow those jobs.
Monitoring Falcon Workflows
As of print time, out of the box, you can monitor your Falcon workflows via:
- Falcon UI
- Oozie UI
- Hue (Oozie Tab)
- JMS Messages
You know how to deploy jobs now, or at least you should if you read the previous tutorials! If you’re comfortable with those, then logging into the native web UI is your next step! Apache Falcon comes with a lightweight web-based user interface that will showcase information about jobs you’ve deployed with Falcon. From it, you can view the details and properties of the job, as well as instance history and a dependency graph (clusters and feeds to processes, processes and feeds to clusters, and clusters and processes to feeds).
There is a new UI available in the latest version of Falcon (0.6.1 or 0.7), which is available for users of HDP 2.3 (hortonworks distro, my current favorite). It’s pretty slick.
Ask your local admin about its location on your cluster! Whatever node Falcon is installed on, Port 15000 might due the trick, it’s the default. In a sample case, it might be: mycompanyhdp002.mycompany.com:15000, if Falcon is installed on node 2 and running on its default port of 15000.
Your Falcon jobs are using Oozie as the workflow engine, so it stands to reason that you can track those Oozie jobs in the Oozie UI, right? Right! Log in to the Oozie web UI and find your jobs by name. You’ll notice two for each job/process you schedule: one for Falcon (that Falcon governs) and one for your workflow, which is actually a sub-workflow of the aforementioned Falcon workflow.
So, again, that’s two Oozie workflows:
- a Falcon-owned Oozie workflow with two actions:
- sub-workflow action (your original workflow)
- post-processing (a follow up action by Falcon to do cleanup and send alerts)
Hue (Oozie Tab)
You can use Hue too! Hue is a Cloudera-owned, Ambari-like user suite of web-based tools for interfacing with a Hadoop cluster. It provides access to Pig, Hive/Beeline, HDFS via a file browser, and, you guessed it, Oozie! Within the Hue navigation bar there’s a blue and yellow Oozie logo. Click it to view running/in-progress Falcon/Oozie jobs, as well as terminal ones (failed/succeeded).
Falcon natively publishes JMS messages from its service on Hadoop via a publisher/subscriber model. You can write code (like we have) to subscribe to Falcon’s messages and log them for use later. Heck, take our advice and store it in HDFS and put a web dashboard on top of it. We use it for monitoring the health of jobs and their state upon completion (failed or succeeded).
That’s just a quick glance at how you can monitor Falcon jobs. I would have written more, but this layover in Atlanta is killing my morale, so I’m going to take a break and come back with more information in a later tutorial. Thanks for reading! As always, leave questions/feedback/ideas in the comments — and check out my Falcon JMS Capturing code on Github!