Creating Hive tables is a common experience to all of us that use Hadoop. It enables us to mix and merge datasets into unique, customized tables. And, there are many ways to do it.
We have some recommended tips for Hive table creation that can increase your query speeds and optimize and reduce the storage space of your tables. And it’s simpler than you might think.
For analysts and data scientists, implementing these methods means you can spend more time learning from the data than learning how to use it, or waiting for queries to return. For engineers and developers, this means you can lower your block count and file sizes, and make your analysts and data scientists happy at the same time.
In a nutshell, the steps are:
Create Your Table in ORC Format
This is the standard way of creating a basic Hive table. The only difference? Instead of using the default storage format of TEXT, this table uses ORC, a columnar file format in Hive/Hadoop that uses compression, indexing, and separated-column storage to optimize your Hive queries and data storage. With this created, data can be freely inserted into it, and data will be converted to this ORC format on-the-fly!Create Table Statement
You can even do it through a convenient CTAS statement. The CTAS method will not work if you want to create a partitioned, external, or list-bucketing table. Use the CTAS method for simple datasets like lookups, or quick and dirty temporary tables, not enormous ones you plan to keep — even small tables can gain from ORC’s benefits.Create Table As SELECT
Partition Your Table
Partitioning your tables is a fantastic way to improve the processing times of queries on your table. This is because of how Hive scans partitions to execute job tasks in parallel — partitioning your data logically assists the job planner in that process.Partitioning a Table
If you want a turbo-boost to your queries, use partitioning and the ORC format on your tables. It’s highly recommended. If we had a Hadoopsters Approves stamp, we’d use it here.Partitioning + ORC
Analyze Your Table When You Make Changes To It
Analyzing a table (also known as computing statistics) is a built-in Hive operation that you can execute to collect metadata on your table. This can vastly improve query times on the table because it collects the row count, file count, and file size (bytes) that make up the data in the table and gives that to the query planner before execution. By running this query, you collect that information and store it in the Hive Metastore (metadata store), which will make future queries on this table more optimal. Every time the contents of the table change, stats should be recollected.Analyze a Table
You can (and really should) analyze partitioned tables as well. In fact, it’s particularly helpful for tables with partitions, as, again, it assists in query planning and optimization when querying the table.Analyze a Partitioned TableYou can also analyze the columns of your table and/or partitions. This is a more intense stat-collecting function that collects metadata on columns you specify, and stores that information in the Hive Metastore for query optimization. That information includes, per column, The number of distinct values, The number of NULL values, Minimum or maximum K values where K could be given by a user, Histogram: frequency and height balanced, Average size of the column, Average or sum of all values in the column if their type is numerical, and Percentiles of the value.Analyzing Table ColumnsMake sure (when you’re ready to query your table) you have these Hive settings enabled to ensure you make the most use out of your calculated stats!
Use ORC, Partitioning and Analyzing for a Powerful Combo
- Build your table with partitions, ORC format, and SNAPPY compression.
- Analyze your table when you make changes or add a partition, and analyze the partition.
- Analyze the columns you use most often (or all of them) at the partition level when you add a partition.