Apache Airflow Basics

So Apache Airflow is getting pretty popular now (understatement) so I figured I’d take some time to explain what it is, how to install it, and shed some light into how it all works. It’s awesome, trust me. 

What is Airflow?

Apache Airflow is a workflow manager similar to Luigi or Oozie. Just like all job schedulers, you define a schedule, then the work to be done, and Airflow takes care of the rest. Where Airflow shines though, is how everything works together.

Airflow workflows are written in Python code. Oh and another thing: “workflows” in Airflow are known as “DAGs” or, “Directed Acyclic Graphs”. What does that mean? Directed == the work only flows in one direction, i.e. a distinct start and end. Acyclic == there aren’t any loops in the workflow. Graph == the workflow can be abstracted into nodes and edges, i.e. the nodes are where the work happens and the edges are the inputs/outputs of the various tasks.

Assuming you have everything installed, an Airflow DAG is super easy to create and schedule:

#!/bin/python

import airflow
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator

# 1. define arguments
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': airflow.utils.dates.days_ago(2),
    'schedule': '@daily',
    'email': ['airflow@hadoopsters.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    }

# 2. define functionality
def hello_world():
    return "Hello World!"

# 3. create dag
dag = DAG("example_dag", default_args=default_args)
my_first_operator = PythonOperator(task_id="first_task",
    python_callable=hello_world)

# 4. construct dag relationships
dag >> my_first_operator

While it’s not a slick, one-liner like things can be in Python, this example is a full fledged Airflow DAG. Heck, the majority of those 20-odd lines is just setting up things you don’t need but I wanted to include. I also wanted to build a

Here’s what it does:

  • It will be run every day at midnight
  • It will execute any arbitrary Python code in the “hello_world” function
  • It will automatically be rerun in the event of failure, 1 time after waiting 5 minutes
  • It will send an email in the event of a failure
  • It will keep historical logs of its runs
  • It will backfill the “days it’s missed” (two days ago, in this case)

All this functionality in four simple steps built in Python.

Here’s the best part:

Version Control. With Airflow, your workflows are now trackable code! You can check it in, make branches, plug it into CI/CD frameworks, conduct code reviews… you get the idea. This is so powerful and should make any proper Hadoopster very excited.

More to come soon, stay tuned!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s