Preparing for the HDPCD Exam: Data Ingestion


In order to do big data, you need… DATA. No surprise there! Hadoop is a different beast than other environments and getting data into HDFS can be a bit intimidating if you’re not familiar. If only there were good documentation about these tasks…

Luckily there is good documentation! This post will cover the basics involved in ingesting data into a Hadoop cluster using the HDPCD Exam study guide. 

The first part of the study guide provided indicates that a Certified Developers should be able to read data from any relational database, load data from any log source, and interact with the WebHDFS API.

Many of these commands, if you’re familiar with Hadoop will seem trivial to you, but just like high school mathematics, you need the basics before you can understand the complex stuff later on.

  • Input a local file into HDFS using the Hadoop file system shell

hdfs dfs -put <filename> <location>

  • Make a new directory in HDFS using the Hadoop file system shell

  • Note that the -p flag will make an entire path for you if the directories don’t exist in the first place.
hdfs dfs -mkdir [-p] <path>

  • Import data from a table in a relational database into HDFS

sqoop import --connect jdbc:<type>://<hostname>/<database> --username <username> --password <password> --table <table_name>

  • Import the results of a query from a relational database into HDFS

  • The “WHERE $CONDITIONS” is imperative when using Sqoop. It allows the Map/Reduce capabilities of Sqoop to actually interact with the RDBMS
sqoop import --query 'select * from <db>.<table_name> WHERE $CONDITIONS' --split-by <db>.<table_name>.<column_name>

  • Import a table from a relational database into a new or existing Hive table

sqoop import --connect <connect_string> --username <username> --password <password> --table <table_name> --hive-import [--hive-overwrite] --hive-table <hive_table>

  • Insert or update data from HDFS into a table in a relational database

sqoop export --connect <connect_str> --table <table_name> --export-dir <export_path>

  • Use WebHDFS to create and write to a file in HDFS

  • I honestly didn’t know this before writing this post, I’ve always used Hue or just the command line to make files in HDFS. It’s good to know though.
curl -i -X PUT 'http://<namenode>:<8020>/webhdfs/v1/<PATH>?op=CREATE

  • Given a Flume configuration file, start a Flume agent

bin/flume-ng agent -n $agent_name -c conf -f conf/

  • Given a configured sink and source, configure a Flume memory channel with a specified capacity

  • The denoted part of this configuration file is what actually configures the “Flume memory channel”. I included the rest of the file because out of context, it’s very confusing.
# Name the components on this agent
# Note the name of the agent, a1, should be specified as $agent_name in the above command
a1.sources = r1
a1.sinks = k1

# Specify your channel's names below:
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
# Memory channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Memory channel

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 = c1

So that’s it! The first part of the HDPCD Exam study guide only covers the basics of data ingestion. Hive and Pig are next!

<< Previous || Next >>


One thought on “Preparing for the HDPCD Exam: Data Ingestion

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s