Random sampling is a technique in which each sample has an equal probability of being chosen. A sample chosen randomly is meant to be an unbiased representation of the total population.
In the big data world, we have an enormous total population: a population that can prove tricky to truly sample randomly. Thankfully, Hive has a few tools for realizing the dream of random sampling in the data lake. Continue reading →
Creating Hive tables is a common experience to all of us that use Hadoop. It enables us to mix and merge datasets into unique, customized tables. And, there are many ways to do it.
We have some recommended tips for Hive table creation that can increase your query speeds and optimize and reduce the storage space of your tables. And it’s simpler than you might think. Continue reading →
Today we’ll briefly showcase how to join a static dataset in Spark with a streaming “live” dataset, otherwise known as a DStream. This is helpful in a number of scenarios: like when you have a live stream of data from Kafka (or RabbitMQ, Flink, etc) that you want to join with tabular data you queried from a database (or a Hive table, or a file, etc), or anything you can normally consume into Spark. Continue reading →
I’ve been working with Hadoop, Map-Reduce and other “scalable” frameworks for a little over 3 years now. One of the latest and greatest innovations in our open source space has been Apache Spark, a parallel processing framework that’s built on the paradigm Map-Reduce introduced, but packed with enhancements, improvements, optimizations and features. You probably know about Spark, so I don’t need to give you the whole pitch.
You’re likely also aware of its main components:
Spark Core: the parallel processing engine written in the Scala programming language
Spark SQL: allows you to programmatically use SQL in a Spark pipeline for data manipulation
Spark MLlib: machine learning algorithms ported to Spark for easy use by devs
Spark GraphX: a graphing library built on the Spark Core engine
Spark Streaming: a framework for handling data that is live-streaming at high speed
Spark Streaming is what I’ve been working on lately. Specifically, building apps in Scala that utilize Spark Streaming to stream data from Kafka topics, do some on-the-fly manipulations and joins in memory, and write newly augmented data in “near real time” to HDFS.
One of the key benefits of Hadoop is its capacity for storing large quantities of data. With HDFS (the Hadoop Distributed File System), Hadoop clusters are capable of reliably storing petabytes of your data.
A popular usage of that immense storage capability is storing and building history for your datasets. You can not only utilize it to store years of data you might currently be deleting, but you can also build on that history! And, you can structure the data within a Hadoop-native tool like Hive and give analysts SQL-querying ability to that mountain of data! And it’s pretty cheap!
…And the Pitch!
In this tutorial, we’ll walk through why this is beneficial, and how we can implement it on a technical level in Hadoop. Something for the business guy, something for the developer tasked with making the dream come true.
The point of Hadoopsters is to teach concepts related to big data, Hadoop, and analytics. To some, this article will be too simple — low hanging fruit for the accomplished dev. This article is not necessarily for you, captain know-it-all — it’s for someone looking for a reasonably worded, thoughtfully explained how-to on building data history in native Hadoop. We hope to accomplish that here.
This tutorial will accomplish a few key feats that make ingesting data to Hive far less painless. In this writeup, you will learn not only how to Sqoop a source table directly to a Hive table, but also how to Sqoop a source table in any desired format (ORC, for example) instead of just plain old text.
There are many functions in Hive that can help analyze your data. But there are times when you need more functionality, sometimes custom. Or at least functionality that is possible without paragraphs of ugly, layered-sub-queried SQL.
If you’ve spent any time with a Hortonworks Data Platform cluster, you’re familiar with Ambari. It’s one of the finest, open source cluster management tools that allows you to easily first launch a cluster, add or remove nodes, change configurations and add services to your cluster. Using Ambari takes a lot of the guesswork out of managing a hadoop cluster and I absolutely love it.
The one downside of Ambari is that it can be tedious to add functionality to the core client. For that reason, the smart people building the tool in Apache decided to add something called an Ambari View. An Ambari View is a way to extend the functionality of Ambari without going down the rabbit hole of modifying Ambari’s source code. Views are essentially plug-and-play tools that only require restarting your cluster to work.
In the following blog post, I’ll discuss getting your View off the ground and show you several tips about actually using them.