So Apache Airflow is getting pretty popular now (understatement) so I figured I’d take some time to explain what it is, how to install it, and shed some light into how it all works. It’s awesome, trust me. Continue reading
It’s time to break down what they mean, how to use them, and how to get the best of both worlds. Continue reading
Welcome back! If you read my previous post, you know that we’ve run into an issue with our Chicago crime data that we just loaded into HIve. Specifically, one of the columns has commas included implicitly in the row data. Read on to learn how to fix this!
After a brief hiatus in the great state of Alaska, I’m back to discuss actually analyzing data on your new Hadoop cluster that we set up together in previous blog posts. Specifically we’ll be looking at crime data from the City of Chicago from 2001 to the day this was first written, 8/26/2015. There’s a couple things we need to take care of before we get started though, Sherlock.
I learned today about a cool ETL/data pipeline/make your life easier tool that was recently released by the NSA (not kidding) as a way to manage the flow of data in and out of system: Apache NiFi. To me, that functionality seems to match PERFECTLY with what people like to do with Hadoop. This guide will just set up NiFi, not do anything with it (that’ll come later!)