How to Sqoop an RDBMS Source Directly to a Hive Table In Any Format

This tutorial will accomplish a few key feats that make ingesting data to Hive far less painless. In this writeup, you will learn not only how to Sqoop a source table directly to a Hive table, but also how to Sqoop a source table in any desired format (ORC, for example) instead of just plain old text.

Many Sqoop tutorials will show you how to Sqoop a data source as raw data to HDFS, and then slap a Hive Schema atop it. That’s easy. For more information on Sqoop (if needed), check out our earlier tutorial.

In this tutorial, we take the approach of making the Hive table first (based on knowledge of the source table), and then Sqoop-ing the data directly to it with Hcatalog integration in Sqoop.

Why is this useful? Because this method:

  1. Allows you to couple the target table schema with your hive table schema, which enables Sqoop to compare your table with the target table, column-to-column, and alert you if something is off (a column type is wrong) or a column is missing on either side at runtime.
  2. Allows you the benefit of Sqoop-ing the target data into your Hive table in any format you have available: Text, Orc, etc, and even with compression. The key is in how your table’s schema is defined in Hive. It’s surprisingly easy.

The Hive Code

So let’s say we have a table in Teradata, and we’ve made a Hive Schema that matches it column-to-column: pretty familiar stuff. Here’s what that may look like:

create table my_database.my_table
(
product_id INT,
product_desc CHAR(100)
)
STORED AS ORC
LOCATION '/somewhere/in/hdfs/is/my_table/'
TBLPROPERTIES ('orc.compress' = 'SNAPPY');

The Sqoop Code

And our Sqoop statement looks like this (functional version first, pretty one second):

sqoop import -Dmapred.job.queue.name=default --driver com.teradata.jdbc.TeraDriver --connect jdbc:teradata://myTDserver.mycompany.com/Database=some_remote_database --username billy --password p4ssw0rd --table some_remote_table --hcatalog-database my_hive_db --hcatalog-table my_hive_table --split-by product_id
sqoop import -Dmapred.job.queue.name=default
--driver com.teradata.jdbc.TeraDriver
--connect jdbc:teradata://myTDserver.mycompany.com/Database=some_remote_database
--username billy
--password p4ssw0rd
--table some_remote_table
--hcatalog-database my_hive_db
--hcatalog-table my_hive_table
--split-by product_id

The Result

If we were to run that Hive command to create that table, and then run this Sqoop command to fill it with data, we’ll now have a complete copy of our desired target table in Hive, in ORC format, that’s snappy compressed, and at the location of our specification in HDFS.

That’s right, you don’t have to Sqoop tables as text, Sqoop (using Hcatalog integration commands in Sqoop) can write out in a number of supported formats (we’re using ORC above with snappy compression).

Give it a shot, and see how much easier it makes your data integration life! This tutorial has been tested and validated on a Hortonworks 2.x stack of Hadoop.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s