How Random Sampling in Hive Works, And How to Use It

Random sampling is a technique in which each sample has an equal probability of being chosen. A sample chosen randomly is meant to be an unbiased representation of the total population.

In the big data world, we have an enormous total population: a population that can prove tricky to truly sample randomly. Thankfully, Hive has a few tools for realizing the dream of random sampling in the data lake. Continue reading

Advertisements

How to Build Optimal Hive Tables Using ORC, Partitions and Metastore Statistics

hive-logo

Creating Hive tables is a common experience to all of us that use Hadoop. It enables us to mix and merge datasets into unique, customized tables. And, there are many ways to do it.

We have some recommended tips for Hive table creation that can increase your query speeds and optimize and reduce the storage space of your tables. And it’s simpler than you might think. Continue reading

How to Join Static Data with Streaming Data (DStream) in Spark

Today we’ll briefly showcase how to join a static dataset in Spark with a streaming “live” dataset, otherwise known as a DStream. This is helpful in a number of scenarios: like when you have a live stream of data from Kafka (or RabbitMQ, Flink, etc) that you want to join with tabular data you queried from a database (or a Hive table, or a file, etc), or anything you can normally consume into Spark. Continue reading

Managing LDAP Users in Ambari

cc_0309careers-waiting-in-line_16x9
Adding users to a Hadoop cluster can be a little time-intensive.

I’ve managed Hadoop clusters for just a little while now and I’ve discovered the user management factor of Ambari is a little rough around the edges. Specifically, there’s no easy way to manage Ambari LDAP users from within Ambari despite LDAP being a very popular way to provision and manage user access.

There is the command ambari-server sync-ldap [--users user.csv | --groups groups.csv] for adding users or groups but that can be an issue if access to the ambari user or server is limited. Additionally, the command line utility doesn’t innately have any control over HDFS directories (either creating or deleting) upon a user- or group-sync, creating extra steps in the user creation process.

To address this, I present:
ambari-ldap-manager

Continue reading

How to Write ORC Files and Hive Partitions in Spark

sporc

ORC, or Optimized Row Columnar, is a popular big data file storage format. Its rise in popularity is due to it being highly performant, very compressible, and progressively more supported by top-level Apache products, like Hive, Crunch, Cascading, Spark, and more.

I recently wanted/needed to write ORC files from my Spark pipelines, and found specific documentation lacking. So, here’s a way to do it. Continue reading

How to Set or Change Log Level in Spark Streaming

logging-595x335
Logs can really add up. Let’s learn to make like a tree and reduce them via convenient built-in methods.

Apache Spark alone, by default, generates a lot of information in its logs. Spark Streaming creates a metric ton more (in fairness, there’s a lot going on). So, how do we lower that gargantuan wall of text to something more manageable?

One way is to lower the log level for the Spark Context, which is retrieved from the Streaming Context. Simply:

Pretty easy, right?

What I Learned Building My First Spark Streaming App

IMG_20170811_183243
Get it? Spark Streaming? Stream… it’s a stream…. sigh.

I’ve been working with Hadoop, Map-Reduce and other “scalable” frameworks for a little over 3 years now. One of the latest and greatest innovations in our open source space has been Apache Spark, a parallel processing framework that’s built on the paradigm Map-Reduce introduced, but packed with enhancements, improvements, optimizations and features. You probably know about Spark, so I don’t need to give you the whole pitch.

You’re likely also aware of its main components:

  • Spark Core: the parallel processing engine written in the Scala programming language
  • Spark SQL: allows you to programmatically use SQL in a Spark pipeline for data manipulation
  • Spark MLlib: machine learning algorithms ported to Spark for easy use by devs
  • Spark GraphX: a graphing library built on the Spark Core engine
  • Spark Streaming: a framework for handling data that is live-streaming at high speed

Spark Streaming is what I’ve been working on lately. Specifically, building apps in Scala that utilize Spark Streaming to stream data from Kafka topics, do some on-the-fly manipulations and joins in memory, and write newly augmented data in “near real time” to HDFS.

I’ve learned a few things along the way. Here are my tips: Continue reading

Cluster Usage with `yarn top`

Abraham Lincoln was the original inventor of the ‘top’ command in 1864 so he could keep better track of his many tophats. 

From the command line, it’s easy to see the current state of any running applications in your YARN cluster by issuing the yarn top  command.  Continue reading

Don’t Just Plug That Disk In

Image result for original hard drive
You have to be a bit surgical.

As you progress in your big data journey with Hadoop, you may find that your datanodes’ hard drives are gradually getting more and more full. A tempting thing to do is simply plug in more hard drives to your servers: you’ve got extra slots on your racks and adding entirely new nodes is an expensive (and a little tedious) task. This is particularly relevant when hard drives start failing on your data nodes.

Unless you want to spend a long time fixing your cluster’s data distribution, I urge you,

Don’t Just Plug That Disk In.

Continue reading

How to Build Data History in Hadoop with Hive: Part 2

hadoop_elephant_trex

Part 2: Growing the data

If you’ve yet to finish part one, we strongly encourage reading it. It’s not super long.

It’s time to get technical. Continue reading