As a developer/engineer in the Hadoop and Big Data space, you tend to hear a lot about file formats. All have their own benefits and trade-offs: storage savings, split-ability, compression time, decompression time, and much more. All of these factors play a huge role in what file formats you use for your projects, or as a team or company-wide standard.
One of the newer, more Hadoop-centric file formats is ORC. And you should probably give it a look if you’re running or looking to run a Hadoop cluster soon.
ORC, or Optimized Row Columnar, is a file format that provides a highly efficient way to store Hive data on Hadoop. It became a top-level project for Apache last year, and was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data in HDFS. In terms of raw speed, it can be up to 3x as fast as raw text. It claims to be the smallest columnar storage format available for Hadoop — and after roughly 8 months of experimenting with it in a number of pipelines and situations, that claim might very well be true.
You can learn more about ORC’s architecture, flexibility, type support, and much more via the official Apache website. For now, I’ll focus this specific post on the statistical benefits I’ve observed in using the format, as well as how you can leverage it in your next project if you’re interested.
Let’s start with talking about how you can start playing with ORC right now, on your own Hadoop setup, with no additional installation required. I’m talking about Hive tables in ORC format.
Convert a Hive Table to ORC
Converting the contents of a Hive table on Hadoop from one standard format (text, avro, etc) to ORC is trivial in Hive, since ORC became supported in version 0.11. All you need is a Hive table (external or internal, it doesn’t matter) with data already in it. From there, you simply define a table with the same column attributes, but stored as ORC. Here’s an example of what that new ORC table definition could look like:
create external table tempOrcExample ( emp_nbr int, dpt_nbr int, cal_dt string, tot_hrs_wrk_nbr string ) stored as orc location'/tmp/orc-example/';
Once you’ve defined this table in Hive (you can literally copy and paste the above code on command line, either by first typing hive and then pasting the above code, or by doing it in quotes inside a hive -e command), you just need to insert the contents of the old table into your new ORC table. Hive will automatically read the data out of the old table, and write it in ORC format for you in the new table. That can be done with the command below:
insert into table tempOrcExample select * from old_table_with_data;
Performance: ORC vs Text
Both formats have their pros and cons, but the big winner for me so far has been ORC for a myriad of reasons, at least when it comes to storing Hive data on Hadoop.
- pro: high level of compression
- pro: high level of speed on read/write
- con: schema is partially-embedded in data
- pro: flexible usage across applications
- pro: easy to read across applications (and human eyes) in raw
- con: heavyweight
- con: no compression
- con: slow by comparison
Statistics & Benchmarks
Speed of Writing – Winner (Text)
In my current favorite development framework, Apache Crunch (which I’ve started a tutorial series on recently), data can be written out as both raw text or ORC. Several benchmark tests were run to determine the write speed on a moderately beefy Hadoop cluster.
- For a data set of 300,000 lines (in the form of raw text), Crunch could ingest it, apply business rules, run calculations, and generate 12,020,373 records
- in raw text in 35 minutes exactly.
- in ORC form in 38 minutes exactly.
File Size – Winner (ORC)
- For the formerly mentioned data set of 12,020,373 records:
- Raw text had a file size in HDFS of 495.4 MB.
- ORC file had a file size in HDFS of 4.0 MB.
Hive Query-ability – Winner (ORC)
- For the formerly mentioned data set of 12,020,373 records, the query ‘select count(*) from table’ was run:
- Table atop raw text file ran with 37 mappers, 1 reducer, and ran for 45 seconds.
- Table atop ORC file ran with 1 mapper, 1 reducer, and ran for 7 seconds.
In general, myself and my colleagues have seen impressive read and write times for ORC between Hive tables, and the level of compression is alarmingly good.
Overall, the results aren’t totally in for whether you should go text or ORC for your next project — mostly because requirements always change, and formats are evolving. Heck, you might need a mix of both in all likelihood. ORC is still in development and undergoing enhancements, but looks to be a huge player in the Hadoop file format discussion going forward — and I’m very excited to see how it fairs.
For more tutorials, tips and tricks on Hadoop like this one, subscribe to this blog! Become a Hadoopster!