Finding Physical Records in Hive with Virtual Columns

hive-logo

How many times have you been querying your data in Hive, only to come across some gnarly looking (or mostly null) records? Maybe it’s only one or two entries, perhaps it’s thousands – either way, you have some (seemingly) busted data, and you’re not happy about it.

Let’s talk about some ways to pinpoint problems in data in Hive, leveraging the tools available on our stack.

Pinpointing the Issue

First of all, let’s fix our mindset: the data might not actually be the problem – it could be with how Hive simply interprets it. After all, Hive is a RDBMS-like system that extends the schema-on-read philosophy that Hadoop champions.

So, it could be that your trouble data has a newline in it, it might have the same delimiter embedded within a string column that’s used to delimit the lines themselves, and so on. I’ve seen data with carriage returns embedded in string descriptions because previous users copy-pasted data from an OS editor like Windows, resulting in Hive showing “nulls” for those records, even though the actual text content of the data was concrete at first glance.

All we have to do is leverage some cool Hive features to help guide us to the location in the raw data where the issue is (at the byte level, to be more specific). In shorter terms, we’re going to let Hive tell us exactly where the records live in HDFS at the most specific level (again, to the byte).

Leveraging Virtual Columns in Hive

Hive has a concept of virtual columns, which are metadata-like pieces of information that can be included with any old query. Here’s an example: say I have a bunch of comma-separated records in a Hive table, and four (4) of them are showing up as nulls. Let’s say I can, for the sake of brevity, see those four (4) records by doing something as simple as this:

select * from my_table limit 10;

Let’s say six (6) of these records are good, four (4) are filled with bad bad nulls. Obviously this is a dead simple query and I’d be using more WHERE clauses in all likelihood to find my trouble records, but just roll with it.

Say that query shows me those records. I know I can find these records in a single text file if I either open it up in an editor or cat it on command line and CONTROL+F for it.

But what if this data is the result of a map-reduce operation and comes in multiple files? What if it’s just spread across multiple files in general? Well, you’re in luck. Hive has a virtual column (the first of two I’ll show you) called INPUT__FILE__NAME. Yes, those are double underscores between the words input, file and name. If you spell it any other way, Hive will throw an error.

Using this column in your select query will return the absolute path to the location of the file in which the record you’re viewing lives. See and believe:

hive> select INPUT__FILE__NAME, * from my_table limit 10;
OK
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00004	3	MILL 	MILLWORK
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000	90	MGMT 	MANAGEMENT
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000	19	STOR 	HOME ORGANIZATION
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00001	103	ADJ3 	ADJ BUDGET BALANCING - 3
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000	99	INCHR	INCORRECTLY ASSIGNED HOURS
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00002	9	FURN 	FURNITURE
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00002	107	ADJ7 	ADJ BUDGET BALANCING - 7
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000	5	ELCR 	ROUGH ELECTRICAL
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000	80	SUPP 	STORE SUPPORT
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00003	50	SERV 	STORE SERVICES

Now that’s super handy. But what if I’m crazy, and I want to see TO THE BYTE where each record I’m selecting lives? Well I am crazy, and I DO want to see the byte offset of those troublesome little records – which I can achieve with another virtual column called BLOCK__OFFSET__INSIDE__FILE.

hive> select INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE, * from my_table limit 10;
OK
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00004	31	3	MILL 	MILLWORK
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000	62	90	MGMT 	MANAGEMENT
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000	94	19	STOR 	HOME ORGANIZATION
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00001	125	103	ADJ3 	ADJ BUDGET BALANCING - 3
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000	156	99	INCHR	INCORRECTLY ASSIGNED HOURS
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00002	188	9	FURN 	FURNITURE
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00002	219	107	ADJ7 	ADJ BUDGET BALANCING - 7
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000	250	5	ELCR 	ROUGH ELECTRICAL
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000	282	80	SUPP 	STORE SUPPORT
hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00003	313	50	SERV 	STORE SERVICES

Isn’t that cool? That new row shows what byte each record starts at, also known as its global file position or byte offset. And while you could peruse that file byte by byte (character by character), that’s painful.

Pro tip, use Hue’s file viewer to view the file itself (if it’s raw text), and you can navigate to the very byte you want to inspect. Using that, you can pull the row out to your favorite text editor like Sublime and look for newlines, carriage returns, or whatever else might be causing Hive to throw nulls in your face.

It doesn’t solve every problem, but it does help pinpoint it. I hope you find a way to use Hive’s virtual columns to find issues in your data (but I hope more that you simply have no issues in your data)!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s