Create a Hive UDF: More Flexible Array Access

 

webHeaderHadoopstersNew

This article will show you how to create a simple UDF that offers more flexibility in interacting with arrays in Hive, such as a negative indexing approach to element access. Continue reading

How to Create a Simple Hive UDF

java_hive

There are many functions in Hive that can help analyze your data. But there are times when you need more functionality, sometimes custom. Or at least functionality that is possible without paragraphs of ugly, layered-sub-queried SQL.

That’s where Hive UDFs come in very handy. Continue reading

Ambari Views: Introduction

Ambari-logo-300x141

If you’ve spent any time with a Hortonworks Data Platform cluster, you’re familiar with Ambari. It’s one of the finest, open source cluster management tools that allows you to easily first launch a cluster, add or remove nodes, change configurations and add services to your cluster. Using Ambari takes a lot of the guesswork out of managing a hadoop cluster and I absolutely love it.

The one downside of Ambari is that it can be tedious to add functionality to the core client. For that reason, the smart people building the tool in Apache decided to add something called an Ambari View. An Ambari View is a way to extend the functionality of Ambari without going down the rabbit hole of modifying Ambari’s source code. Views are essentially plug-and-play tools that only require restarting your cluster to work.

In the following blog post, I’ll discuss getting your View off the ground and show you several tips about actually using them.

Next Post: Apache Ambari: Hello World!

Apache Crunch Tutorial 8: Writing to Orc File Format

LITTLE_CRUNCH

This post is the eighth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

Continue reading

Preparing for the HDPCD Exam: Data Analysis With Hive

HWX_Badges_Cert_Color_Dev

With your data now in HDFS in an “analytic-ready” format (it’s all cleaned and in common formats), you can now put a Hive table on top of it.

Apache Hive is a RDBMS-like layer for data in HDFS that allows you to run batch or ad-hoc queries in a SQL-like language. This post will go over what you need to know about Apache Hive in preparation for the HDPCD Exam.  Continue reading

Preparing for the HDPCD Exam: Data Transformation

HWX_Badges_Cert_Color_Dev

So after getting data into HDFS, it’s often not pretty. At the very least, it’s a little disorganized, sparse, and in generally not ready for analytics. It’s a Certified Developer’s job to clean it up a little.

That’s where Apache Pig can come in handy! This post will cover the basics in transforming data in HDFS using Apache Pig for preparation of the HDPCD Exam.  Continue reading

Guide to Apache Falcon #2: Cluster Entity Definitions

falcon-logo

This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively. Continue reading

How to Run a Jar in Oozie with Java Actions

oozie_282x1178

You probably know how jars work. Jars, short for Java Archives, are zipped up packages of Java class files with or without dependencies included. In most cases, it’s just your application code, and dependencies live elsewhere and are exported into a classpath. While we’ll cover that topic another day, let’s focus on the task at hand: getting your Jar running in Oozie. Continue reading