Don’t Just Plug That Disk In

Image result for original hard drive
You have to be a bit surgical.

As you progress in your big data journey with Hadoop, you may find that your datanodes’ hard drives are gradually getting more and more full. A tempting thing to do is simply plug in more hard drives to your servers: you’ve got extra slots on your racks and adding entirely new nodes is an expensive (and a little tedious) task. This is particularly relevant when hard drives start failing on your data nodes.

Unless you want to spend a long time fixing your cluster’s data distribution, I urge you,

Don’t Just Plug That Disk In.

Continue reading

How to Build Data History in Hadoop with Hive: Part 1

hadoop_elephant_trex

The Wind Up

One of the key benefits of Hadoop is its capacity for storing large quantities of data. With HDFS (the Hadoop Distributed File System), Hadoop clusters are capable of reliably storing petabytes of your data.

A popular usage of that immense storage capability is storing and building history for your datasets. You can not only utilize it to store years of data you might currently be deleting, but you can also build on that history! And, you can structure the data within a Hadoop-native tool like Hive and give analysts SQL-querying ability to that mountain of data! And it’s pretty cheap!

…And the Pitch!

In this tutorial, we’ll walk through why this is beneficial, and how we can implement it on a technical level in Hadoop. Something for the business guy, something for the developer tasked with making the dream come true.

The point of Hadoopsters is to teach concepts related to big data, Hadoop, and analytics. To some, this article will be too simple — low hanging fruit for the accomplished dev. This article is not necessarily for you, captain know-it-all — it’s for someone looking for a reasonably worded, thoughtfully explained how-to on building data history in native Hadoop. We hope to accomplish that here.

Let’s get going. Continue reading

Spark History Server Automatic Cleanup

largelogpile
I wonder how much paper you’d need to print 1.5 Tb of logs…

If you’ve been running Spark applications for a few months, you might start to notice some odd behavior with the history server (default port 18080). Specifically, it’ll take forever to load the page, show links to applications that don’t exist or even crash. Three parameters take care of this once and for all. Continue reading

Move a Running Command to a Background Process

time-saving-strategies

You just kicked off a command on the command line and one of three things happens:

  1. You have to leave your computer and run off to a meeting or go talk to your boss,
  2. You realize you’ve made a terrible mistake and didn’t realize how much data or work that command has to deal with and it’s probably going to take a few hours,
  3. You were testing a command, liked that it was working, and want to let it run now

Now what?

Continue reading

How to Sqoop an RDBMS Source Directly to a Hive Table In Any Format

This tutorial will accomplish a few key feats that make ingesting data to Hive far less painless. In this writeup, you will learn not only how to Sqoop a source table directly to a Hive table, but also how to Sqoop a source table in any desired format (ORC, for example) instead of just plain old text.

Continue reading

Transplanting an Edge Node: Loxodontectomy

scalpel-10-blade-correct

Loxodontectomy: Elephant-removal (and replacement)

The rapid pace of the big data community can quickly leave Hadoop environments obsolete and out-of-date. Many great tools provide ways to simply upgrade your software without too much hassle. Unfortunately, earlier versions of the Hortonworks Data Platform (HDP) are a bit clunky to upgrade. A recent project of mine involved upgrading an older (<HDP2.1) version of HDP to v2.4. Upgrading the whole stack would have been very time consuming process (more than two weeks), so we decided to just transplant the edge node into a brand new cluster. Continue reading

Create a Hive UDF: More Flexible Array Access

 

webHeaderHadoopstersNew

This article will show you how to create a simple UDF that offers more flexibility in interacting with arrays in Hive, such as a negative indexing approach to element access. Continue reading

Apache NiFi 1.0.0-BETA!

Apache NiFi is changing the way people move huge amounts of data. It’s never been easier (or cheaper) to move and transform raw data into meaningful insight.

With that being said, the release notes for Apache NiFi 1.0.0-BETA were released a couple days ago! These are very exciting times as this project becomes more mature.

Here’s the biggest updates:

  • A brand new UI (!!!)
    • Cleaner lines, better toolbars, more modern look. Things will be easier to find.

6519-4-nifibetatemplateoverall

  • Zero master clustering
    • This means that NiFi’s become much more resilient to failure while staying lightweight. Individual clients “elect” a leader to handle coordination between clients. This is similar to how YARN High-Availability works.
  • Actual multi-user tenancy
    • There’s a drop-down for admins to create users, set permissions, visibility, etc for actual users. Before, there were really two types of users: Admins and Managers. Admins ruled the whole workflow, Managers could make changes to workflows. Now, individual components can be controlled for individual users.
  • Version control for templates
    • This is really huge. The upload/implement template function of NiFi was fairly well hidden and obscure before. With version control, I’d hope that a user would be able to check in/out various branches of the template XML files easily.

Beyond that, there’s a few more processors built-in now and a ton of bug fixes.

Please note, I’m not a contributor/committer of Apache NiFi, but I do follow and use the tool fairly regularly. It’s exciting to see the community making these changes. If you see something in the release notes that excites you, let us know below!

How to Create a Simple Hive UDF

java_hive

There are many functions in Hive that can help analyze your data. But there are times when you need more functionality, sometimes custom. Or at least functionality that is possible without paragraphs of ugly, layered-sub-queried SQL.

That’s where Hive UDFs come in very handy. Continue reading