Apache Crunch Tutorial 9: Reading & Ingesting Orc Files

LITTLE_CRUNCH

This post is the ninth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

Continue reading

Advertisements

Guide to Apache Falcon #5: Tighter Oozie Integration

falcon-logo

This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively. Continue reading

Guide to Apache Falcon #4: Process Entity Definitions

falcon-logo

This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively. Continue reading

Guide to Apache Falcon #3: Feed Entity Definitions

falcon-logo

This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively. Continue reading

Guide to Apache Falcon #2: Cluster Entity Definitions

falcon-logo

This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively. Continue reading

Preparing for the HDPCD Exam: Data Ingestion

HWX_Badges_Cert_Color_Dev

In order to do big data, you need… DATA. No surprise there! Hadoop is a different beast than other environments and getting data into HDFS can be a bit intimidating if you’re not familiar. If only there were good documentation about these tasks…

Luckily there is good documentation! This post will cover the basics involved in ingesting data into a Hadoop cluster using the HDPCD Exam study guide.  Continue reading

Guide to Apache Falcon #1: Introducing Falcon

falcon-logo

This series is designed to be the ultimate guide on Apache Falcon, a data governance pipeline for Hadoop. Falcon excels at giving you control over workflow scheduling, data retention and replication, and data lineage. This guide will (hopefully) excel at helping you understand and use Falcon effectively. Continue reading

What is a Hadoop Developer?

webHeaderHadoopstersNew

The Big Data industry has a problem: what makes a Hadoop Developer? Is it someone who has general knowledge about the many tools available in a typical Hadoop ecosystem? Or is it someone who regularly commits to the Apache projects and pushes Hadoop to new levels? I think it’s somewhere in the middle of both. Continue reading

Moving Data Within or Between Hadoop Clusters with DistCP

hdfs-logo

Copying chunks of data in and around Hadoop is relatively trivial. But moving larger chunks can be time-consuming or needlessly complicated. Sometimes you even want to move data between Hadoop clusters (if you have 2 or more). With this article, I’ll show you a great way to handle all of these scenarios. Continue reading

Analyzing Chicago Crime Data with Apache Hive on HDP 2.3

After a brief hiatus in the great state of Alaska, I’m back to discuss actually analyzing data on your new Hadoop cluster that we set up together in previous blog posts. Specifically we’ll be looking at crime data from the City of Chicago from 2001 to the day this was first written, 8/26/2015. There’s a couple things we need to take care of before we get started though, Sherlock.

Continue reading