If you have a Hadoop cluster, it’s rare that you don’t have some traditional row-column data you want to query. To do queries on that RDBMS (Relational Database Management System) data, you’ll want to pull that data from its system (perhaps a SQL Server, Oracle Database, or Teradata warehouse), and store it on Hadoop.
Thankfully, one of the oldest and most reliable Apache projects for doing this very thing is at your disposal: Apache Sqoop. Sqoop is a command-line tool for transferring data between a relational database (SQL) and Hadoop’s HDFS (Hadoop Distributed File System). It supports incremental loads of single tables, free form SQL queries, and even exports back out to SQL systems (if you want to pull data, do some processing on Hadoop, and push it back).
In today’s example, I’ll show you how to pull an entire Teradata table to Hadoop in a just a few short steps. Sqoop has two primary modes: import and export. As you’d expect, an Import command allows you to import data to Hadoop from RDBMS, and an Export command allows you to push data from HDFS to an RDBMS system. Let’s focus on importing for now.
Assuming you have Sqoop installed on your system, you just need a single command to get a table pulled to Hadoop from Teradata. Work with your friendly neighborhood admin to get it if not, and if you’re your own admin, installation steps are here.
A Sqoop Import command looks like this:
/usr/bin/sqoop import \ --driver com.teradata.jdbc.TeraDriver \ --connect jdbc:teradata://myTeradataHostname/Database=myDatabase \ --username myUsername \ --password -P \ --table tableIWantToPull \ --target-dir /path/in/hdfs/to/store/table/data \ --split-by columnToSplitBy -m numMappers \ --fields-terminated-by '^' \ --fetch-size=100000
Now while that’s all well and cool, you might not know what each piece does — and it’s important that you do. Let’s break it down, line by line:
|1||/usr/bin/sqoop import||This statement tells the Sqoop library what command you’d like to use (in this scenario, it’s import). What precedes it is simply the location where Sqoop Library is stored.|
|2||–connect jdbc:teradata://myTeradataHostname/Database=myDatabase||Parameter that tells Sqoop where the database is you want to connect to (myTeradataHostname) and the name of the database you want to query (myDatabase).|
|3||–connection-manager org.apache.sqoop.teradata.TeradataConnManager||This line specifies what we call a “connector.” It allows us to use Sqoop with a variety of relational databases. This line specifies the Java package where the Teradata Connection Manager for Sqoop is located. Another option I’ve used is the Teradata proprietary connector.|
|4||–username usernameGoesHere||This is where you fill in your Teradata username.|
|5||–password -P||This is where you could paste in your raw Teradata password, but it’s more secure to use -P. The -P command has you enter in the password at execution time, instead of storing it in the command (insecure).|
|6||–table tableIWantToPull||This is where you’d replace the variable in bold (tableIWantToPull ) with the name of the table or view you wish to import.|
|7||–target-dir /path/in/hdfs/to/store/table/data||This parameter is where you specify the directory in HDFS that you’d like the imported data to land.|
|8||–split-by primaryKeyorUniqueColumnNameGoesHere -m 10||There are two components to this parameter: split-by and -m. Split-by allows you to specify how to chunk up the records you are importing so it can be processed as a mapreduce job on Hadoop. It’s good to use a parameter that is more unique, such as a group number or calendar date. It’s followed by the -m parameter, which allows you to specify how many mappers (task workers) to suggest that the job use. Keep note that each mapper will make an individual connection to Teradata, so make sure to be aware of connection limits.|
|9||–fields-terminated-by ‘^’||This parameter tells Sqoop how the different columns are terminated/separated in the SQL data when it lands on HDFS. This carrot is a safe choice, as opposed to commas, but tabs are also safe ‘\t’.|
|10||–fetch-size=100000||This parameter tells Sqoop roughly how many records each mapper can fetch and process if it needs to. This line is fairly standard for Teradata. Optional.|