Create a Hive UDF: More Flexible Array Access

 

webHeaderHadoopstersNew

This article will show you how to create a simple UDF that offers more flexibility in interacting with arrays in Hive, such as a negative indexing approach to element access. Continue reading

Apache NiFi 1.0.0-BETA!

Apache NiFi is changing the way people move huge amounts of data. It’s never been easier (or cheaper) to move and transform raw data into meaningful insight.

With that being said, the release notes for Apache NiFi 1.0.0-BETA were released a couple days ago! These are very exciting times as this project becomes more mature.

Here’s the biggest updates:

  • A brand new UI (!!!)
    • Cleaner lines, better toolbars, more modern look. Things will be easier to find.

6519-4-nifibetatemplateoverall

  • Zero master clustering
    • This means that NiFi’s become much more resilient to failure while staying lightweight. Individual clients “elect” a leader to handle coordination between clients. This is similar to how YARN High-Availability works.
  • Actual multi-user tenancy
    • There’s a drop-down for admins to create users, set permissions, visibility, etc for actual users. Before, there were really two types of users: Admins and Managers. Admins ruled the whole workflow, Managers could make changes to workflows. Now, individual components can be controlled for individual users.
  • Version control for templates
    • This is really huge. The upload/implement template function of NiFi was fairly well hidden and obscure before. With version control, I’d hope that a user would be able to check in/out various branches of the template XML files easily.

Beyond that, there’s a few more processors built-in now and a ton of bug fixes.

Please note, I’m not a contributor/committer of Apache NiFi, but I do follow and use the tool fairly regularly. It’s exciting to see the community making these changes. If you see something in the release notes that excites you, let us know below!

How to Create a Simple Hive UDF

java_hive

There are many functions in Hive that can help analyze your data. But there are times when you need more functionality, sometimes custom. Or at least functionality that is possible without paragraphs of ugly, layered-sub-queried SQL.

That’s where Hive UDFs come in very handy. Continue reading

Ambari Views: Introduction

Ambari-logo-300x141

If you’ve spent any time with a Hortonworks Data Platform cluster, you’re familiar with Ambari. It’s one of the finest, open source cluster management tools that allows you to easily first launch a cluster, add or remove nodes, change configurations and add services to your cluster. Using Ambari takes a lot of the guesswork out of managing a hadoop cluster and I absolutely love it.

The one downside of Ambari is that it can be tedious to add functionality to the core client. For that reason, the smart people building the tool in Apache decided to add something called an Ambari View. An Ambari View is a way to extend the functionality of Ambari without going down the rabbit hole of modifying Ambari’s source code. Views are essentially plug-and-play tools that only require restarting your cluster to work.

In the following blog post, I’ll discuss getting your View off the ground and show you several tips about actually using them.

Next Post: Apache Ambari: Hello World!

Apache Crunch Tutorial 9: Reading & Ingesting Orc Files

LITTLE_CRUNCH

This post is the ninth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

Continue reading

Guide to Apache Oozie #2: Understanding Workflows

oozie_282x1178

This series is designed to be a “get off the ground” guide to Apache Oozie, a job scheduling framework for Hadoop. Oozie offers multi-action workflow scheduling, ability to run actions in parallel, and great APIs. This guide is designed to help you answer your Oozie technical questions.

Continue reading

Guide to Apache Oozie #1: Introducing Oozie

oozie_282x1178

This series is designed to be a “get off the ground” guide to Apache Oozie, a job scheduling framework for Hadoop. Oozie offers multi-action workflow scheduling, ability to run actions in parallel, and great APIs. This guide is designed to help you answer your Oozie technical questions. Continue reading

Getting Started with Apache Zeppelin

If Hindenburg taught us anything, you don’t want to mix Zeppelins with Sparks any day.

Apache Zeppelin, however, is a wonderful tool that combines Apache Spark with interactive data analytics and shareable notebooks and makes your big data usable!

Just don’t fill your computer with hydrogen gas.

Let’s get it installed and then do stuff! Continue reading