It’s time to break down what they mean, how to use them, and how to get the best of both worlds.
If you use Hadoop, you probably use Hive. After all, it’s one of if not the most prominent of the RDBMS-ish tools that the Hadoop ecosystem offers. It’s not truly RDBMS, and it’s not as tightly built (yet) as some of the leading (but older) RDBMS systems like Teradata, MySQL, DB2, etc.
But it’s pretty darn good, and is definitely sufficient for SQL-style queries of data sets both gargantuan and minuscule on your cluster. You can build and define schemas/tables, and you can put data underneath them in the file system, HDFS. But one key feature of Hive tends to trip users, analysts, and developers up: table type, of which there are two: internal and external.
In this post, I’ll briefly explore what makes an internal table what it is, and what makes an external table what it is: on the surface, and at a deeper level, as it relates to the Hive metastore.
Internal vs External: The Difference
The main difference between an internal table and an external table is simply this:
An internal table is also called a managed table, meaning it’s “managed” by Hive. That doesn’t mean much more than when you drop the table, both the schema/definition AND the data are dropped. Again, when you drop an internal table, Hive will delete both the schema/table definition, and it will also physically delete the data/rows (truncation) associated with that table from the Hadoop Distributed File System (HDFS).
An external table is not “managed” by Hive. When you drop an external table, the schema/table definition is deleted and gone, but the data/rows associated with it are left alone. The table’s rows are not deleted.
TL;DR: When you drop an internal table, the table and its data are deleted. When you drop and external table, the table definition is dropped, but the data is not touched. You could drop an external schema, recreate it, and be right back where you were – data just as you remember it.
Though the persistence of data is a big difference between the usage of the two table types in Hive, it’s not the only consideration you should make when choosing between the two for your data science project, reporting script, or ETL pipeline.
By default, when you create an internal table, its location will be in the Hive Warehouse (typically at the following HDFS location: /apps/hive/warehouse/). An external table requires you to specify a location in HDFS where the data for the table you’re creating will live. You can freely insert and modify these tables with insert into, insert overwrite, and drop, regardless of whether they’re internal or external.
When to use an Internal Table
- When your data is temporary.
- When Hive is really the only tool using/manipulating the data.
- When you want Hive to completely manage the lifecycle of the table and its data (ref: SO).
When to use an External Table
- When an external process is landing the data to the cluster (like SFTP, Flume, or Sqoop), and you’re really only using Hive as one way to view the data.
- When your data is used by other applications besides Hive.
- When you want to make multiple schemas atop the same data location without the fear of one person’s table drop resulting in loss of the data.
The Best of Both Worlds
By default, when you create an internal table, its location will be in the Hive Warehouse (typically at the following HDFS location: /apps/hive/warehouse/). However, you can specify an external location for an internal table. THAT’S RIGHT! It doesn’t have to be in the Hive Warehouse to be Internal. Kind of weird sounding logic, isn’t it?
Here’s how you do it:
-- Creates an Internal table from a Select statement, and uses an external location! create table my_db.my_table STORED AS ORC LOCATION '/my/cool/table/location/' as SELECT click.ip, click.email, max(date_time) as last_seen FROM my_db.some_cool_table click WHERE ip is not null and email is not null and feed_date between '2016-05-01' and '2016-05-31' GROUP BY ip, email
This results in a table with the MOST flexibility. Let me tell you why:
- It’s outside of the Hive warehouse, so we can fully control permissions of the custom HDFS directory we’ve assigned it. This lets us control who can read/write/execute in that data location.
- It’s stored as ORC, a highly compressed, efficient data storage format. Read-based queries on this table will be lightning fast (compared to text).
- It’s an Internal table, which allows us to drop the table and its data all at once if we need.
- This is particularly useful if we need to offer this table as readable to users who don’t have access to my particular hive database in the Hive Warehouse, but we maintain the freedom to drop the table and the data together.
If you’d prefer this to be an external table, where the data is protected from being dropped if someone were to run a drop table command, just add the word ‘external’ after the word ‘create’ in the above statement. *At this time, CTAS does not work with external tables. Instead, you’ll have to create an external table schema first, and then insert into it.
That looks like this:
create external table my_db.my_table ( ip string, email string, last_seen timestamp ) STORED AS ORC LOCATION '/my/cool/table/location/' insert overwrite my_db.my_table SELECT click.ip, click.email, max(date_time) as last_seen FROM my_db.some_cool_table click WHERE ip is not null and email is not null and feed_date between '2016-05-01' and '2016-05-31' GROUP BY ip, email
I hope this has shed some light on the not-so-mystifying world of Hive tables. Leave any questions/feedback in the comments!