Create a Hive UDF: More Flexible Array Access



This article will show you how to create a simple UDF that offers more flexibility in interacting with arrays in Hive, such as a negative indexing approach to element access. Continue reading


Guide to Apache Falcon #3: Feed Entity Definitions


This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively. Continue reading

What is a Hadoop Developer?


The Big Data industry has a problem: what makes a Hadoop Developer? Is it someone who has general knowledge about the many tools available in a typical Hadoop ecosystem? Or is it someone who regularly commits to the Apache projects and pushes Hadoop to new levels? I think it’s somewhere in the middle of both. Continue reading

Apache Crunch Tutorial #5: Hadoop Configurations


This post is the fifth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

Continue reading

Benefits of the Orc File Format in Hadoop, And Using it in Hive


As a developer/engineer in the Hadoop and Big Data space, you tend to hear a lot about file formats. All have their own benefits and trade-offs: storage savings, split-ability, compression time, decompression time, and much more. All of these factors play a huge role in what file formats you use for your projects, or as a team or company-wide standard. Continue reading