How to Build Data History in Hadoop with Hive: Part 1


The Wind Up

One of the key benefits of Hadoop is its capacity for storing large quantities of data. With HDFS (the Hadoop Distributed File System), Hadoop clusters are capable of reliably storing petabytes of your data.

A popular usage of that immense storage capability is storing and building history for your datasets. You can not only utilize it to store years of data you might currently be deleting, but you can also build on that history! And, you can structure the data within a Hadoop-native tool like Hive and give analysts SQL-querying ability to that mountain of data! And it’s pretty cheap!

…And the Pitch!

In this tutorial, we’ll walk through why this is beneficial, and how we can implement it on a technical level in Hadoop. Something for the business guy, something for the developer tasked with making the dream come true.

The point of Hadoopsters is to teach concepts related to big data, Hadoop, and analytics. To some, this article will be too simple — low hanging fruit for the accomplished dev. This article is not necessarily for you, captain know-it-all — it’s for someone looking for a reasonably worded, thoughtfully explained how-to on building data history in native Hadoop. We hope to accomplish that here.

Let’s get going. Continue reading

How to Sqoop an RDBMS Source Directly to a Hive Table In Any Format

This tutorial will accomplish a few key feats that make ingesting data to Hive far less painless. In this writeup, you will learn not only how to Sqoop a source table directly to a Hive table, but also how to Sqoop a source table in any desired format (ORC, for example) instead of just plain old text.

Continue reading

Apache Crunch Tutorial 8: Writing to Orc File Format


This post is the eighth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

Continue reading

Guide to Apache Falcon #3: Feed Entity Definitions


This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively. Continue reading