In order to do big data, you need… DATA. No surprise there! Hadoop is a different beast than other environments and getting data into HDFS can be a bit intimidating if you’re not familiar. If only there were good documentation about these tasks…
Luckily there is good documentation! This post will cover the basics involved in ingesting data into a Hadoop cluster using the HDPCD Exam study guide.
The first part of the study guide provided indicates that a Certified Developers should be able to read data from any relational database, load data from any log source, and interact with the WebHDFS API.
Many of these commands, if you’re familiar with Hadoop will seem trivial to you, but just like high school mathematics, you need the basics before you can understand the complex stuff later on.
Input a local file into HDFS using the Hadoop file system shell
hdfs dfs -put <filename> <location>
Make a new directory in HDFS using the Hadoop file system shell
- Note that the -p flag will make an entire path for you if the directories don’t exist in the first place.
hdfs dfs -mkdir [-p] <path>
Import data from a table in a relational database into HDFS
sqoop import --connect jdbc:<type>://<hostname>/<database> --username <username> --password <password> --table <table_name>
Import the results of a query from a relational database into HDFS
- The “WHERE $CONDITIONS” is imperative when using Sqoop. It allows the Map/Reduce capabilities of Sqoop to actually interact with the RDBMS
sqoop import --query 'select * from <db>.<table_name> WHERE $CONDITIONS' --split-by <db>.<table_name>.<column_name>
Import a table from a relational database into a new or existing Hive table
sqoop import --connect <connect_string> --username <username> --password <password> --table <table_name> --hive-import [--hive-overwrite] --hive-table <hive_table>
Insert or update data from HDFS into a table in a relational database
sqoop export --connect <connect_str> --table <table_name> --export-dir <export_path>
Use WebHDFS to create and write to a file in HDFS
- I honestly didn’t know this before writing this post, I’ve always used Hue or just the command line to make files in HDFS. It’s good to know though.
curl -i -X PUT 'http://<namenode>:<8020>/webhdfs/v1/<PATH>?op=CREATE [&overwrite=<true|false>][&blocksize=<LONG>][&replication=<SHORT>] [&permission=<OCTAL>][&buffersize=<INT>]'
Given a Flume configuration file, start a Flume agent
bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties
Given a configured sink and source, configure a Flume memory channel with a specified capacity
- The denoted part of this configuration file is what actually configures the “Flume memory channel”. I included the rest of the file because out of context, it’s very confusing.
# Name the components on this agent # Note the name of the agent, a1, should be specified as $agent_name in the above command a1.sources = r1 a1.sinks = k1 # Specify your channel's names below: a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory # Memory channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Memory channel # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
So that’s it! The first part of the HDPCD Exam study guide only covers the basics of data ingestion. Hive and Pig are next!