So Apache Airflow is getting pretty popular now (understatement) so I figured I’d take some time to explain what it is, how to install it, and shed some light into how it all works. It’s awesome, trust me. Continue reading
This post is the eighth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.
In order to do big data, you need… DATA. No surprise there! Hadoop is a different beast than other environments and getting data into HDFS can be a bit intimidating if you’re not familiar. If only there were good documentation about these tasks…
Luckily there is good documentation! This post will cover the basics involved in ingesting data into a Hadoop cluster using the HDPCD Exam study guide. Continue reading
Welcome back! If you read my previous post, you know that we’ve run into an issue with our Chicago crime data that we just loaded into HIve. Specifically, one of the columns has commas included implicitly in the row data. Read on to learn how to fix this!
I learned today about a cool ETL/data pipeline/make your life easier tool that was recently released by the NSA (not kidding) as a way to manage the flow of data in and out of system: Apache NiFi. To me, that functionality seems to match PERFECTLY with what people like to do with Hadoop. This guide will just set up NiFi, not do anything with it (that’ll come later!)