How to Create a Simple Hive UDF

java_hive

There are many functions in Hive that can help analyze your data. But there are times when you need more functionality, sometimes custom. Or at least functionality that is possible without paragraphs of ugly, layered-sub-queried SQL.

That’s where Hive UDFs come in very handy.

A UDF, or a user-defined function, is just that: a custom function written by the user that serves an explicit purpose. A UDF is (most commonly) written in Java, and at its simplest, is not much more than a few simple lines of code that takes a record (or multiple records) as input, and provides output.

To create a UDF, you must create a Java class that inherits the traits of the UDF library. You can download our free open source template project from Github (coming soon), which can be modified out of the box for your needs. But most importantly, it will help you finish and understand this tutorial on UDFs.

In a nutshell, your Java class:

  • Imports the apache.hadoop.hive.ql.exec.UDF class
  • Includes an evaluate method that takes, as arguments, the data type it will be accepting as a Hive function. This is where the “custom functionality” comes into play.

In our example code below, our class is called HelloColumn, and our internal method is called evaluate (this is the standard naming for our class). The class extends UDF.

package com.hadoopsters.hive;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

//Simple class to print a record with a prepended string.

public class HelloColumn extends UDF {

public Text evaluate(Text input) {
if(input == null) return null;
return new Text("Hello, " + input.toString());
}
}

Our evaluate function takes one argument, Text (from org.apache.hadoop.io.Text) called input, which represents the data record we would pass from Hive into this function. This will make more sense toward the end of the tutorial when we actually use this function.

This part is optional but highly recommended: We then check input to see if it’s null. If it is, return null. We clearly can’t do anything with that record, or don’t want to. Unless that’s your use case, in which case, do something else with this condition.

Under normal circumstance where we have received something legitimate (not null) in input, we can do something to/with it. In this case, we straight up just return it with the word “Hello, “ in front of it. That’s it. Dead simple UDF right here, folks.

Once you’ve built this class, you’ll want to package it into a JAR, and get it into Hadoop (edge node or HDFS(recommend HDFS)) for adding to Hive. Once it’s in HDFS, use these commands in Hive.

use my_database;

CREATE FUNCTION HelloColumn as 'com.hadoopsters.hive.HelloColumn' USING JAR 'hdfs://myhadoopcluster.mycompany.com/my/hive/udfs/hive-udf-template-1.0-SNAPSHOT.jar';

This results in a function that takes text as an input, and will return back that same text with our new phrase in front of it. Take for example, if we ran a query like this:

Select HelloColumn(name) from employees;

Let’s pretend that there are 3 records in the employees table, with the names Carl, Regina, and Susie as the respective values for the name column in those records. When we run the above query, we would get the following output back:

Hello, Carl

Hello, Regina

Hello, Susie

This is exactly how other Hive functions behave, like sum, max, or trim.

Sidenotes:

  • You can create temporary functions (just add the word temporary in front of function in your create statement) as well as just permanent functions.
  • Functions, like tables, when created, are tied to databases. They’re not limited to use with that database, but they must be associated with one when created, even if it’s default. That’s just how it works.
  • When you want to use a function, make sure you use the “use my_database” command in Hive before you do, otherwise your function won’t be recognized – just like a table.
  • You can drop functions just like you do tables: drop function my_function;

Let me know how it works for you! Start playing around with this concept to see what you can make!

Next time, we’ll make a function that does something cooler than return the phrase “hello” in front of it. Next time we’ll create an array function to extend how we can use and manipulate arrays in Hive – particularly with negative/wrap-around indexing!

Advertisements

One thought on “How to Create a Simple Hive UDF

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s