Preparing for the HDPCD Exam: Data Analysis With Hive

HWX_Badges_Cert_Color_Dev

With your data now in HDFS in an “analytic-ready” format (it’s all cleaned and in common formats), you can now put a Hive table on top of it.

Apache Hive is a RDBMS-like layer for data in HDFS that allows you to run batch or ad-hoc queries in a SQL-like language. This post will go over what you need to know about Apache Hive in preparation for the HDPCD Exam. 

 

hive-logo

The third and final part of the study guide provided indicates that a Certified Developers should be able to analyze and query data in HDFS using Apache Hive. Hive is RDBMS service used in Hadoop for running SQL-like queries on massive amounts of data. Living in Hadoop, it naturally uses M/R architecture to execute queries and return results. It has many different optimizations available to speed query times but those are out of scope in this article. Here, we will examine the basic file types and parameters that are most common in Hive.

For all of this, the Apache Hive documentation is your holy text. Live by it.

Due to how Hive scripts work, I’m going to give an example early on, then explain things as the post continues. This will make it easier to visualize how all the parts work together.

Many of the ad-hoc commands should be run either on command-line, in a third party SQL assistant program (like SquirrelSQL), or through Beeline, the secure command-line interface with Hive.

  • script.hql

create [external] table <db>.<tablename>{
c1 <type>,
...,
cn <type>
}
partitioned by (<c#>) clustered by (<c#>) into ### buckets stored as [ORC|AVRO|TEXTFILE...]
as select * from <table_name>;

load [local] inpath '<path>' into <tablename>;

insert into table <table_name> [partition (col)] values (c1, c2, ..., cn)[, (more rows)];

  • Write and execute a Hive query

select * from db.test_table;

  • Define a Hive-managed table

  • That’s on the first line of the Hive script above. If you’re familiar at all with RDBMS systems, you should be right at home with Hive. This type of table will allow Hive to control the data internally.
create table db.test_table(c1 string, c2 string);

  • Define a Hive external table

  • That’s the same as above but this time include the “external” and “location” keywords in the create statement. The “location” keyword specifies where the externally located data should live in HDFS. This is very useful if you don’t want to lose data if a table were to be accidentally dropped, for example.
create external table db.test_name(c1 string, c2 string) location '/hdfs/path/to/file';

  • Define a partitioned Hive table

create table db.test_name(c1 string, c2 string) partitioned by c1;

  • Define a bucketed Hive table

  • Buckets are kind of like partitions but they’re different. Think of buckets as the number of reducers you always want to group data by per query.
create table db.test_name(c1 string, c2 string) into 24 buckets;

  • Define a Hive table that uses the ORCFile format

  • ORC is Optimized Row Columnar format. Very good on read times, slower on write times. Great with compression! Read more about it here. Note: you can’t just insert regular text into ORC tables. Load text into a temporary “regular” hive table then insert from there into the ORC table. Sexy.
create table db.test_orc(c1 string, c2 string) stored as orc;

  • Create a new ORCFile table from the data in an existing non-ORCFile Hive table

    • Hey I just mentioned this. Here’s how you do it.
create table db.test_orc(c1 string, c2 string) stored as orc as select c1, c2 from db.test_text;

  • Specify the storage format of a Hive table

    • We already did this up top, just switch out your storage type in the table creation. I guess if you’re really savvy though, you’ll be writing your own file format. Remember to specify your SerDes
    • insert create table db.test_orc(c1 string, c2 string) stored as StorageType; row format serde 'com.org.java.class.whatever.StorageType';
      

  • Specify the delimiter of a Hive table

    • Here we use the unicode character “\001” or Ctrl-A. These are commonly used in raw CSV files that you ingest.
create table db.test_orc(c1 string, c2 string) row format fields delimited by '\001';

  • Load data into a Hive table from a local directory

load local inpath '/local/path' into db.test_table;

  • Load data into a Hive table from an HDFS directory

  • Exactly the same as above but without the “local” keyword
load inpath '/hdfs/path' into db.test_table;

  • Load data into a Hive table as the result of a query

insert into db.text_table select * from db.orc_table; 

  • Load a compressed data file into a Hive table

    • Hadoop and Hive naturally can handle many different compressed files. Gzip files are what plays the best though since they’re splittable (hello map/reduce). Anyways, you can just point to these files with load or location statements and it’ll work naturally. If you are once again using a file format that isn’t supported naturally, you again need to just specify SerDes to handle serialization and deserialization for you. Fun stuff!

  • Update a row in a Hive table

    • Updating rows in Hive is a relatively new thing. Some people even think it shouldn’t have been implemented in the first place since it breaks one of the core concepts of Hadoop: write data once and only once–if something changes in a record, make a new one. But whatever, it’s good to know.
update <table_name> set column = value [where CLAUSE];

  • Delete a row from a Hive table

    • Again, deleting a row from Hive is a relatively new concept. It is also slightly controversial since it too violates the whole “write once; leave alone” ideology of Hadoop.
delete from <table_name> [where CLAUSE];

  • Insert a new row into a Hive table

    • Granular level loading: make sure the number of values you’re inserting matches the number the Hive table is expecting.
insert into db.text_table values (c1, c2);

  • Join two Hive tables

    • Literally identical to any other join in any other SQL like language. A little less functionality though. Only equality joins though (at time of writing).
SELECT a.* FROM a JOIN b ON (a.id = b.id);

  • Run a Hive query using Tez

    • The Tez execution engine is the classic Map/Reduce engine but on steroids: instead of writing results to disk, it’s able to store results into memory. This speeds execution time greatly! Also it has advanced logic that will cut out unnecessary steps in a M/R execution plan. Highly recommended if it fits your use cases.
set hive.execution.engine=tez;

  • Run a Hive query using vectorization

    • Hive vectorization alleviates some of the load on CPUs for scans, filters, aggregates, and joins and speeds things up. Instead of looking at the table, row by row, Hive takes chunks (or vectors) of the table and runs the simple math on those chunks instead. That’s a hand-wavy explanation but it gets the point across. If you forget that this is set to true, it can break other queries and cause you all sorts of pain and agony. I generally only set it if I really need it.
set hive.vectorized.execution.enabled = true;

  • Output the execution plan for a Hive query

    • The ‘Explain’ command is nifty: it shows you what Hive is planning on doing when it sees your specific query. That involves showing you the number of stages that each mapper will go through to execute the query. This is good to know if you’re trying to optimize a query or something.
explain [query];

  • Use a subquery within a Hive query

    • Selection Inception. Coming this summer to a theatre near you.
select col from ( select a+b as col from <table_name2> ) <table_name1>;

  • Output data from a Hive query that is totally ordered across multiple reducers

    • This is order by: don’t let the bad grammar confuse you.
select * from text_table order by c1 asc;

  • Set a Hadoop or Hive configuration property from within a Hive query

    • We did this before with Tez and Hive Vectorization but here it is again.
set hive.execution.engine=tez;

That’s it! The third and final part of the HDPCD Exam study guide covers using Hive to analyze data in HDFS. Obviously, Hive is just one of the now many tools that you can use to run analysis in Hadoop. But it is a core tool and is in use at many companies all around the world.

That’s all folks! See you next time!

<< Previous

Advertisements

One thought on “Preparing for the HDPCD Exam: Data Analysis With Hive

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s