Apache Airflow Basics

So Apache Airflow is getting pretty popular now (understatement) so I figured I’d take some time to explain what it is, how to install it, and shed some light into how it all works. It’s awesome, trust me.  Continue reading


Apache Crunch Tutorial 8: Writing to Orc File Format


This post is the eighth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

Continue reading

Preparing for the HDPCD Exam: Data Ingestion


In order to do big data, you need… DATA. No surprise there! Hadoop is a different beast than other environments and getting data into HDFS can be a bit intimidating if you’re not familiar. If only there were good documentation about these tasks…

Luckily there is good documentation! This post will cover the basics involved in ingesting data into a Hadoop cluster using the HDPCD Exam study guide.  Continue reading

Loading Data into Hive Using a Custom SerDe

Welcome back! If you read my previous post, you know that we’ve run into an issue with our Chicago crime data that we just loaded into HIve. Specifically, one of the columns has commas included implicitly in the row data. Read on to learn how to fix this!

Continue reading

Installing and Running Apache NiFi on your HDP Cluster

NiFi Workspace
Hey everyone,

I learned today about a cool ETL/data pipeline/make your life easier tool that was recently released by the NSA (not kidding) as a way to manage the flow of data in and out of system: Apache NiFi. To me, that functionality seems to match PERFECTLY with what people like to do with Hadoop. This guide will just set up NiFi, not do anything with it (that’ll come later!)

Continue reading