Guide to Apache Oozie #2: Understanding Workflows

oozie_282x1178

This series is designed to be a “get off the ground” guide to Apache Oozie, a job scheduling framework for Hadoop. Oozie offers multi-action workflow scheduling, ability to run actions in parallel, and great APIs. This guide is designed to help you answer your Oozie technical questions.

Last time, and also the first time, we introduced you to Apache Oozie at a high level. Today’s post will cover the finer details of Oozie, and some examples to get you started with building workflows.

First, let’s talk about how workflows are comprised. But before even that, let’s define some terms for reference:

  • Workflow – an XML file composed of nodes
  • Node – a grouping of related XML tags to form something whole (usually an action)
  • Action – a node that does a type of action in a Hadoop cluster

Workflows are composed of several sequential tags in an XML file. At the most nutshell level, an Oozie workflow is an .xml file called workflow.xml. It lives in HDFS most of the time – it can also live in a local space (linux side) of an edge or worker node — but HDFS is the standard for most applications.

Workflows are composed of nodes. Nodes are just groupings of XML tags that are related. Most nodes are actions, primarily. Actions are the substance; the core of what makes a workflow a workflow. After all, what you want is a string of sequential and potentially parallel actions that form a complete end-to-end job, right? Right.

So, what are actions, and what kind of actions does Oozie offer?

Oozie Actions

Oozie actions are just XML tags with different properties. Oozie has a bevy of baked-in actions for you to choose from and implement, each allowing you to perform common operations on a Hadoop cluster. There’s a pig action for pig scripts, a hive action for Hive queries and commands, shell and SSH actions for bash scripts/commands, a java action for calling an actionable Jar, and a Map-Reduce action for performing a Map-Reduce job.

There are a handful of other actions, such as newer additions like the Spark Action (to run an Apache Spark job), and classic favorites like the Sqoop Action (to leverage Apache Sqoop to ingest RDBMS data from outside the cluster). Whatever your use case, if you can run it on Hadoop, you can pretty much plug it into Oozie with one of these options.

Controlling the Flow

So we now know that an Oozie workflow is comprised of one or more actions, right? Right. While that’s all well and good, we need to learn how we control the flow of this workflow, as well as tell it how to react in certain scenarios. We don’t want it just sitting there drooling when an exception hits, right? Right.

Oozie actions inherently want to make decisions. It’s part of their nature. By default, a given action has two choices that you define explicitly in the workflow.xml: what to do when the action failed, and what to do when the action succeeded. Those choices are referred to as the <error to> and <ok to> tags, respectively. As you might guess, the <error to> tag tells the workflow which node/action to proceed to in the event of a failure/error, and the <ok to> tag tells the workflow which node/action to proceed to in the case of success (no problems, proceed as normal).

In most cases, the <error to> tag points to what’s called a “kill” tag. This is a non-action node in Oozie that tells the workflow to terminate immediately and report as a failure (the result of said failure being an internal “kill”). When the kill tag is reached, no other nodes or actions in the workflow will run – it has been killed after all. In a scenario where, for example, you have a workflow with 3 actions – each one dependent on the output of the previous action. In different terms, the first action reads a file and produces some output, the second action does the same, but uses the output of the first action and produces some new output, and the third action does the same, but uses the output of the second actions nad produces some new, final output.

For either of these actions, because of their dependence on one another downstream, it’s sensible to have the workflow fail if an issue occurs in processing.

On the other hand, we have the <ok to> tag, which represents the polar opposite turnout, but the same philosophy. If the action succeeds, it will proceed to the node/action you tell it to start next in the workflow sequence. Using the example of the 3 action workflow from before, the first action would error to kill, and ok to action 2. The second action would error to kill, and ok to action 3. The third action would error to kill, and ok to end.

Wait, what’s “end”? You never mentioned anything about that, random Hadoop tutorial author!! That’s right. But don’t worry, it’s, again, the polar opposite of kill, but philosophically the same.

The “end” node is a non-action node in Oozie that tells the workflow to terminate, but as a success. Get the difference? End is a successful completion of the workflow (a green light), and Kill is the non-successful, pre-emptive end of the workflow (a big fat red light).

So, going back to our 3 action example. If all went well with all 3 actions, the input would have gone through action 1, 2, and 3, and the workflow will have ended successfully. If any action had encountered an issue, it would have terminated early as a failure.

An Example Oozie Workflow

<?xml version="1.0" encoding="UTF-8"?>

<workflow-app name="my-workflow" xmlns="uri:oozie:workflow:0.2">

	<start to="my-first-hive-action" />

	<action name="my-first-hive-action">
		<hive xmlns="uri:oozie:hive-action:0.2">
			<job-tracker>cluster003.company.com:8050</job-tracker>
			<name-node>hdfs://cluster001.company.com:8020</name-node>
			<script>/scripts/select_from_table.hql</script>
			<param>hiveTable=some_random_database.some_random_table</param>
		</hive>
		<ok to="end"/>
		<error to="kill"/>
	</action>

	<kill name="kill">
		<message>Action failed, error
			message[${wf:errorMessage(wf:lastErrorNode())}]</message>
	</kill>

	<end name="end" />

</workflow-app>

So that’s what an easy Oozie workflow looks like! It’s just one action! What if we built that 3-action workflow I’ve been describing?

<?xml version="1.0" encoding="UTF-8"?>

<workflow-app name="my-workflow" xmlns="uri:oozie:workflow:0.2">

	<start to="my-first-hive-action" />

	<action name="my-first-hive-action">
		<hive xmlns="uri:oozie:hive-action:0.2">
			<job-tracker>cluster003.company.com:8050</job-tracker>
			<name-node>hdfs://cluster001.company.com::8020</name-node>
			<script>/scripts/select_from_table.hql</script>
			<param>hiveTable=some_random_database.some_random_table1</param>
		</hive>
		<ok to="my-second-hive-action"/>
		<error to="kill"/>
	</action>

		<action name="my-second-hive-action">
		<hive xmlns="uri:oozie:hive-action:0.2">
			<job-tracker>cluster003.company.com:8050</job-tracker>
			<name-node>hdfs://cluster001.company.com::8020</name-node>
			<script>/scripts/select_from_table.hql</script>
			<param>hiveTable=some_random_database.some_random_table2</param>
		</hive>
		<ok to="my-third-hive-action"/>
		<error to="kill"/>
	</action>

		<action name="my-third-hive-action">
		<hive xmlns="uri:oozie:hive-action:0.2">
			<job-tracker>cluster003.company.com:8050</job-tracker>
			<name-node>hdfs://cluster001.company.com::8020</name-node>
			<script>/scripts/select_from_table.hql</script>
			<param>hiveTable=some_random_database.some_random_table3</param>
		</hive>
		<ok to="end"/>
		<error to="kill"/>
	</action>

	<kill name="kill">
		<message>Action failed, error
			message[${wf:errorMessage(wf:lastErrorNode())}]</message>
	</kill>

	<end name="end" />

</workflow-app>

*Heads up, /scripts/select_from_table.hq is an HDFS reference, not a local one. This keeps our workflow able to fail-over and make use of the scale Hadoop offers, and not rely on a script’s location on a single node (like an edge node).

Now you know more about what composes an Oozie workflow, and are better prepared to build one yourself. In the next tutorial, I’ll show you how to submit an Oozie workflow to Oozie and watch it run before your very eyes!

<< Previous Tutorial | Next Tutorial >>

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s