Create a Hive UDF: More Flexible Array Access

 

webHeaderHadoopstersNew

This article will show you how to create a simple UDF that offers more flexibility in interacting with arrays in Hive, such as a negative indexing approach to element access.

An array function in Hive, for the sake of this article, is a function that creates or allows you to manipulate/access values in an array. Potentially the most common of these, split, allows you to split a string column on a delimiter, like a comma, and place the split values that are in between those delimiters into an array. You can then access the elements of that array if you know their exact position.

Assume in the below example that you have a table of sales data, which includes a string column of comma-seperated products called purchasedProducts. Assume also that we want the first element of the list (the first product someone purchased). Like so:

select split(purchasedProducts, ',')[0] from sales;

That’s probably the most common, recurring scenario in data processing, and we frankly do it a lot in the big data & programatic world.

Normal array accessing with brackets is fine if you know exactly how many elements your array has or will have. But what if you want something more dynamic? What if you always want the last element in the array, but the length of the array always changes? You could use a length function and sub-query that in some frankly ugly SQL.

But what if we could do that in a short, one-word function instead? What if we could walk backwards through an array with negative indexing, like you can in Python? Wouldn’t that make for some shorter functional code? Wouldn’t that look and feel better? Yes.

And by creating a super simple UDF, that’s easily attainable. This tutorial will show you how. If you’ve never built a UDF before, I recommend you start with the previous tutorial, which offers more insight into what UDFs are, how they’re created, and how they’re deployed, registered, and used on a cluster. Spoiler: it’s not hard at all.

For this tutorial, we’re going to just give you the Java UDF code for an array that supports negative indexing (that’s how we solve the length problem, read up on the topic if you’re unfamiliar here), and let you do with it what you please.

Here’s the code (explained below):

package com.hadoopsters.hive;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;

import java.util.List;

/**
 * This Hive UDF was designed to fetch a single value from an array, and supports negative indexing.
 * The user provides the array and the desired position. Eg: 0, 1, 2, -1 (last item), -2 (second to last item), etc.
 */
@Description(
        name="fetch_from_array",
        value="Returns x, where x is the nth value in the array. Supports negative indexing.",
        extended="select fetch_from_array(lpd, 0), fetch_from_array(lpd, -1) from didb.ke_dedupe_hlpd limit 1;"
)
public class FetchFromArray extends UDF {
  
  public String evaluate(List<String> given_array, Integer index) {

    // Evalulate the inputs
    if(given_array == null || index > given_array.size()) return null;

    // Prepare an Integer to store Desired Value
    Integer desired_position;

    // If Given Index position is Negative, Count back from end of Array
    if(index < 0){
      desired_position = given_array.size()+index;
    }
    // Otherwise, use the provided Index
    else {
      desired_position = index;
    }

    // Return the cleaned HLPD that has any shared attributes with the LPD removed
    return given_array.get(desired_position).toString();
  }
}

The above class, FetchFromArray

  • inherits the UDF functionality from the Apache Hive libraries
  • houses a method called evaluate that takes a String record (data) and an Integer argument (desired element by its position in the array)
    • the record is evaluated to ensure it’s not null, and that the desired position does not exceed the size of the array
    • if the desired position (index) is negative, it means the user wants to go to the end of the array and count backwards (-1 is the last element, -2 is second to last, etc)
    • if the desired position is 0 or otherwise, it means the user wants the position of the element requested (0 for first, 1 for second, 2 for third, etc)
    • return the value from the desired position in the array

That’s it! Pretty simple, right? Now, let’s use it by pushing it to the cluster (HDFS), and use it in Hive on one of our data sets.

Thanks for reading! Code is available on our GitHub (coming soon), and is free to use, change and share. Go nuts, Hadoopsters.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s