This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively.
Welcome back, friends, to the seventh installment of the Hadoopsters Guide to Apache Falcon.
If you’ve been following the epic saga known as Landon’s Missed Connecting Flight in the Atlanta Airport, you’ll know that last time we introduced you (briefly) to monitoring jobs on Falcon with a suite of built-in tools on Hadoop. I even teased how you could use something so boring as JMS to your advantage with some free open source code!
But today, let’s do something practical: let’s fix/upgrade something in production.
There will come many times in your job/career/life-long relationship with Apache Falcon where you will need to update your code. Maybe it’s a bug fix, maybe it’s ever-changing business requirements, or maybe you just want that sucker to run more or less frequently!
Don’t worry. Falcon’s update command has your back. He’s the true homie. Whether it’s an update to the Falcon process.xml or an update to the Oozie workflow.xml, leveraging this handy built-in function of the Falcon CLI will no doubt come to your aide.
With that said, let’s clear our worried minds from the troubles of updating jobs in production. Let’s talk about Falcon Update.
Falcon Update has, up to this point, been described as the overall savior of Hadoop-kind, but really, it’s just an impressive CLI (command line interface) command. It looks like this.
Wow. That was impressive wasn’t it? That’s it folks, this concludes another Hadoopsters Guide to Apache Falcon.
Kidding. That’s just the base command. It wont’ do anything without the required arguments! A full command looks a little something like this:
falcon update entity -type feed -name myfeed -update -file myfeed.xml falcon update entity -type process -name myprocess -update -file myprocess.xml
See the 3 parts you can change (process, myprocess, myprocess.xml)? Those are the parts you modify when you want to run. Here are the rules
- If you’re updating a feed, use the word feed as the 1st argument
- If you’re updating a process, use the word process as the 1st argument
- Give the full name of the entity you want to update as the 2nd argument
- Give the pathname of the new version of the entity you’d like to run in place of the old one.
If you run that command, it should update the process in Falcon in-place, ensuring no downtime for the job. Future instances will reflect the changes you made. Falcon will not retroactively run instances with your new properties or tag choices. If you’d like that to happen, you can re-run an old instance (tutorial coming later).
What Can/Can’t I Update?
At the entity level, you can update Processes and Feeds. You cannot currently update Clusters. No word yet on if that will change.
At the tag level, you can change most things with Falcon Update. That means any <property> tag in the <properties> family of tags you can change. You can change the frequency. You can change largely anything except for the start and end time of the job. If you’d like to change that, simply delete and reschedule the entity with those new start and end times.
Other than that, you can update anything with Falcon. I use it largely to change the properties as mentioned before, or the frequency. Those are the more likely use cases for Falcon Update.
Can I update my Oozie Workflow?
Absolutely! Yes, you can. First, a little history:
When you deploy a Falcon workflow (submit and schedule, specifically), Falcon will build a home for this new workflow in its “store” on HDFS. This is usually under /apps/falcon/, and in a sub-folder titled after the name of the cluster entity associated with your workflow entities. If your cluster is called my-primary-cluster (as in the previous tutorials), your workflow contents will live in /apps/falcon/my-primary-cluster/staging.
At submit/schedule time, Falcon will copy the contents it requires (including the Oozie workflow you placed in HDFS) to this area for use in deploying instances. It does not reference the original workflow every time an instance runs. That’s important for those of you (including myself from 6 months ago) who want to make changes.
Now, onto how to update your Oozie workflow.xml with Falcon Update seamlessly…
Please refer to our previous tutorials to learn more about building and deploying an Oozie workflow to HDFS, but updating the workflow itself is easy. Make any changes you want to the workflow.xml, and put it back in HDFS. When you run the falcon update command, it will explicitly check to see if the Oozie workflow.xml differs from the one it currently has in its “store.” If it does, it will ingest the new one and future instances will leverage it.
You should see Falcon respond to your update command by printing a new Oozie bundle ID, which links to the newly ingested (read: your updated) workflow. Success! All new instances will use this workflow, and you’re done.
That’s just a quick glance at how you can update Falcon jobs. I would have written more, but this layover in Atlanta is still killing my morale, so I’m going to take a break and come back with more information in yet another later tutorial.
Thanks for reading! As always, leave questions/feedback/ideas in the comments — and check out my Falcon JMS Capturing code on Github!
<< Previous Tutorial | Next Tutorial >>