Don’t Just Plug That Disk In

Image result for original hard drive
You have to be a bit surgical.

As you progress in your big data journey with Hadoop, you may find that your datanodes’ hard drives are gradually getting more and more full. A tempting thing to do is simply plug in more hard drives to your servers: you’ve got extra slots on your racks and adding entirely new nodes is an expensive (and a little tedious) task. This is particularly relevant when hard drives start failing on your data nodes.

Unless you want to spend a long time fixing your cluster’s data distribution, I urge you,

Don’t Just Plug That Disk In.

Continue reading

Spark History Server Automatic Cleanup

largelogpile
I wonder how much paper you’d need to print 1.5 Tb of logs…

If you’ve been running Spark applications for a few months, you might start to notice some odd behavior with the history server (default port 18080). Specifically, it’ll take forever to load the page, show links to applications that don’t exist or even crash. Three parameters take care of this once and for all. Continue reading

Move a Running Command to a Background Process

time-saving-strategies

You just kicked off a command on the command line and one of three things happens:

  1. You have to leave your computer and run off to a meeting or go talk to your boss,
  2. You realize you’ve made a terrible mistake and didn’t realize how much data or work that command has to deal with and it’s probably going to take a few hours,
  3. You were testing a command, liked that it was working, and want to let it run now

Now what?

Continue reading

How to Sqoop an RDBMS Source Directly to a Hive Table In Any Format

This tutorial will accomplish a few key feats that make ingesting data to Hive far less painless. In this writeup, you will learn not only how to Sqoop a source table directly to a Hive table, but also how to Sqoop a source table in any desired format (ORC, for example) instead of just plain old text.

Continue reading

Transplanting an Edge Node: Loxodontectomy

scalpel-10-blade-correct

Loxodontectomy: Elephant-removal (and replacement)

The rapid pace of the big data community can quickly leave Hadoop environments obsolete and out-of-date. Many great tools provide ways to simply upgrade your software without too much hassle. Unfortunately, earlier versions of the Hortonworks Data Platform (HDP) are a bit clunky to upgrade. A recent project of mine involved upgrading an older (<HDP2.1) version of HDP to v2.4. Upgrading the whole stack would have been very time consuming process (more than two weeks), so we decided to just transplant the edge node into a brand new cluster. Continue reading

Ambari Views: Introduction

Ambari-logo-300x141

If you’ve spent any time with a Hortonworks Data Platform cluster, you’re familiar with Ambari. It’s one of the finest, open source cluster management tools that allows you to easily first launch a cluster, add or remove nodes, change configurations and add services to your cluster. Using Ambari takes a lot of the guesswork out of managing a hadoop cluster and I absolutely love it.

The one downside of Ambari is that it can be tedious to add functionality to the core client. For that reason, the smart people building the tool in Apache decided to add something called an Ambari View. An Ambari View is a way to extend the functionality of Ambari without going down the rabbit hole of modifying Ambari’s source code. Views are essentially plug-and-play tools that only require restarting your cluster to work.

In the following blog post, I’ll discuss getting your View off the ground and show you several tips about actually using them.

Next Post: Apache Ambari: Hello World!

Moving Data Within or Between Hadoop Clusters with DistCP

hdfs-logo

Copying chunks of data in and around Hadoop is relatively trivial. But moving larger chunks can be time-consuming or needlessly complicated. Sometimes you even want to move data between Hadoop clusters (if you have 2 or more). With this article, I’ll show you a great way to handle all of these scenarios. Continue reading

Benefits of the Orc File Format in Hadoop, And Using it in Hive

logo

As a developer/engineer in the Hadoop and Big Data space, you tend to hear a lot about file formats. All have their own benefits and trade-offs: storage savings, split-ability, compression time, decompression time, and much more. All of these factors play a huge role in what file formats you use for your projects, or as a team or company-wide standard. Continue reading