How to Build Data History in Hadoop with Hive: Part 1


The Wind Up

One of the key benefits of Hadoop is its capacity for storing large quantities of data. With HDFS (the Hadoop Distributed File System), Hadoop clusters are capable of reliably storing petabytes of your data.

A popular usage of that immense storage capability is storing and building history for your datasets. You can not only utilize it to store years of data you might currently be deleting, but you can also build on that history! And, you can structure the data within a Hadoop-native tool like Hive and give analysts SQL-querying ability to that mountain of data! And it’s pretty cheap!

…And the Pitch!

In this tutorial, we’ll walk through why this is beneficial, and how we can implement it on a technical level in Hadoop. Something for the business guy, something for the developer tasked with making the dream come true.

The point of Hadoopsters is to teach concepts related to big data, Hadoop, and analytics. To some, this article will be too simple — low hanging fruit for the accomplished dev. This article is not necessarily for you, captain know-it-all — it’s for someone looking for a reasonably worded, thoughtfully explained how-to on building data history in native Hadoop. We hope to accomplish that here.

Let’s get going. Continue reading

How to Sqoop an RDBMS Source Directly to a Hive Table In Any Format

This tutorial will accomplish a few key feats that make ingesting data to Hive far less painless. In this writeup, you will learn not only how to Sqoop a source table directly to a Hive table, but also how to Sqoop a source table in any desired format (ORC, for example) instead of just plain old text.

Continue reading

Create a Hive UDF: More Flexible Array Access



This article will show you how to create a simple UDF that offers more flexibility in interacting with arrays in Hive, such as a negative indexing approach to element access. Continue reading

How to Create a Simple Hive UDF


There are many functions in Hive that can help analyze your data. But there are times when you need more functionality, sometimes custom. Or at least functionality that is possible without paragraphs of ugly, layered-sub-queried SQL.

That’s where Hive UDFs come in very handy. Continue reading

Preparing for the HDPCD Exam: Data Analysis With Hive


With your data now in HDFS in an “analytic-ready” format (it’s all cleaned and in common formats), you can now put a Hive table on top of it.

Apache Hive is a RDBMS-like layer for data in HDFS that allows you to run batch or ad-hoc queries in a SQL-like language. This post will go over what you need to know about Apache Hive in preparation for the HDPCD Exam.  Continue reading

Preparing for the HDPCD Exam: Data Transformation


So after getting data into HDFS, it’s often not pretty. At the very least, it’s a little disorganized, sparse, and in generally not ready for analytics. It’s a Certified Developer’s job to clean it up a little.

That’s where Apache Pig can come in handy! This post will cover the basics in transforming data in HDFS using Apache Pig for preparation of the HDPCD Exam.  Continue reading

Loading Data into Hive Using a Custom SerDe

Welcome back! If you read my previous post, you know that we’ve run into an issue with our Chicago crime data that we just loaded into HIve. Specifically, one of the columns has commas included implicitly in the row data. Read on to learn how to fix this!

Continue reading

Analyzing Chicago Crime Data with Apache Hive on HDP 2.3

After a brief hiatus in the great state of Alaska, I’m back to discuss actually analyzing data on your new Hadoop cluster that we set up together in previous blog posts. Specifically we’ll be looking at crime data from the City of Chicago from 2001 to the day this was first written, 8/26/2015. There’s a couple things we need to take care of before we get started though, Sherlock.

Continue reading