The Wind Up
One of the key benefits of Hadoop is its capacity for storing large quantities of data. With HDFS (the Hadoop Distributed File System), Hadoop clusters are capable of reliably storing petabytes of your data.
A popular usage of that immense storage capability is storing and building history for your datasets. You can not only utilize it to store years of data you might currently be deleting, but you can also build on that history! And, you can structure the data within a Hadoop-native tool like Hive and give analysts SQL-querying ability to that mountain of data! And it’s pretty cheap!
…And the Pitch!
In this tutorial, we’ll walk through why this is beneficial, and how we can implement it on a technical level in Hadoop. Something for the business guy, something for the developer tasked with making the dream come true.
The point of Hadoopsters is to teach concepts related to big data, Hadoop, and analytics. To some, this article will be too simple — low hanging fruit for the accomplished dev. This article is not necessarily for you, captain know-it-all — it’s for someone looking for a reasonably worded, thoughtfully explained how-to on building data history in native Hadoop. We hope to accomplish that here.
Let’s get going.
A Real Life Application: Sales Data for a Retailer
Part 1: Giving That Old Data a Good Home
Use Case: Let’s say we’re a retail company with 7 years of digitally available sales data. We keep 2 years of it in a RDBMS for analysts to use, and write anything older than that to tape, and purge it from the RDBMS. Why? Because, for reasons to numerous to go over here, it’s just awfully expensive to grow history in this RDMBS (largely due to hardware upgrade costs).
But good news! We bought a Hadoop cluster! We not only want to pour all of our data into it, we want that data we’ve purged to tape as well! WE WANT IT ALL, BABY! Thankfully, our Hadoop cluster has plenty of affordable storage space and the specs to handle querying it.
So… how do we do that?
Using a Hadoop-native tool called Sqoop, we copy the current 2 years of retail sales data from that distributed RDBMS to Hadoop, store the data in HDFS, and structure Hive schemas atop it. This made 2 years of the same sales data available in Hadoop for analysts to query with SQL (thanks, Hive!). Awesome, right?
But wait… we could query this same data on that RDBMS system… how is that different than just using it there? Well, besides the fact that we were trying to fill a data lake, it isn’t much different. You could argue that this was the first of many steps to get to the data lake/promised land. You could argue that we now have the ability to blend our sales data with other large datasets on the cheap!
Those are both true statements, and doing so in the RDBMS would carry significant costs. Scaling Hadoop, on the flip side, is cheap (relatively speaking).
So.. what should we do?
We get the 5 years of data that’s stored on tape into Hadoop. You see, most companies, especially older ones, have a habit of pushing data to tape when it reaches a certain age. In our case, anything older than 2 years isn’t retained. That’s data hungry analysts could be using.
So… let’s get it.
We use a series of custom-built or market-based ETL tools and some good ol’ one-time manual effort (perhaps a few weeks at most), to copy the data from tape, convert it to a delimited format (like CSV), and push it into HDFS. There, we added it to the other 2 years of sales data in Hive, and voila: 7 years of sales data available to analysts!
This is an instant win for many reasons, as analysts can, off the bat, analyze trends, growth, and patterns for far more data than they previously had reasonable access to. With Hadoop and Hive, we are able to give that to them in a format they knew, which is SQL.
In Part 2, we’ll talk technical — how to integrate datasets in Hive to best accommodate queries that scale and the users who will be using it.