So after getting data into HDFS, it’s often not pretty. At the very least, it’s a little disorganized, sparse, and in generally not ready for analytics. It’s a Certified Developer’s job to clean it up a little.
That’s where Apache Pig can come in handy! This post will cover the basics in transforming data in HDFS using Apache Pig for preparation of the HDPCD Exam.
The second part of the study guide provided indicates that a Certified Developers should be able to transform and manipulate data in HDFS using Apache Pig. Pig is a scripting language built specifically for Hadoop. It natively does operations in Map/Reduce and is able to handle tremendous amounts of data.
Due to how Pig scripts work, I’m going to give an example early on, then explain things as the post continues. This will make it easier to visualize how all the parts work together.
-- script.pig a = load 'filename' using PigStorage(<delim_char>) [as (c1:<type>, ..., cn:<type>)]; b = foreach a generate c1; c = group a by c1; d = filter c by c1 == <num>; e = order a by <field> ASC|DESC; f = distinct a; g = group a by c1 parallel 18; h = join a by a.c1 [LEFT|RIGHT|FULL] [OUTER], b by b.c1 [USING 'replicated' | 'skewed' | 'merge'] ; store d into <hdfs_path>; dump b;
Write and execute a Pig script
Load data into a Pig relation without a schema
- That’s on line 2 of script.pig. Removing the brackets from the command loads data without a schema.
a = load 'filename' using PigStorage(<delim_char>);
Load data into a Pig relation with a schema
- That’s the same as above, line 2 but this time include a schema (but don’t include the brackets in your statement).
a = load 'filename' using PigStorage(<delim_char>) [as (c1:<type>, ..., cn:<type>)];
Load data from a Hive table into a Pig relation
- This is a little different so I’ve explicitly outlined it below:
pig -useHCatalog #first invoke the pig shell with the -useHCatalog flag A = LOAD 'db.tablename' USING org.apache.hive.hcatalog.pig.HCatLoader(); #this loads data store A into 'db.tablename2' using org.apache.hive.hcatalog.pig.HCatStorer(); #this stores data
Use Pig to transform data into a specified format
- This topic is a little cryptic but it just means manipulate an object based on what it is made of. In this case, the generate command is “returning” c1 from a
b = foreach a generate c1;
Transform data to match a given Hive schema
- This is the same as above, just make sure you generate in the right order that the Hive schema is anticipating.
b = foreach a generate c1, c2, c3, ..., cN;
Group the data of one or more Pig relations
c = group a by c1;
Use Pig to remove records with null values from a relation (see comments for other options)
d = filter c by c1 != NULL;
Store the data from a Pig relation into a folder in HDFS
store d into /path/to/location;
Store the data from a Pig relation into a Hive table
store A into 'db.tablename2' using org.apache.hive.hcatalog.pig.HCatStorer(); #this stores data
Sort the output of a Pig relation
- Be sure to specify ASCending or DESCending!
e = order a by c1 [ASC|DESC];
Remove the duplicate tuples of a Pig relation
f = distinct a;
Specify the number of reduce tasks for a Pig MapReduce job
g = group a by c1 parallel 18;
Join two datasets using Pig
h = join a by a.c1 LEFT, b by b.c1;
Perform a replicated join using Pig
h = join a by a.c1 LEFT, b by b.c1 USING 'replicated';
Run a Pig job using Tez
pig -x tez script.pig
Within a Pig script, register a JAR file of User Defined Functions
register ./myjar.jar; define toUpper `myjar.UPPER`; A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE toUpper(name); DUMP B;
Within a Pig script, define an alias for a User Defined Function
define toUpper `myjar.UPPER`;
Within a Pig script, invoke a User Defined Function
A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE toUpper(name); DUMP B;
That should be all! The second part of the HDPCD Exam study guide covers using Pig to transform data in HDFS. Obviously, Pig is just one of the now many tools that you can use to transform and manipulate data in Hadoop. But it is a core tool and is in use at many companies all around the world.
Data analysis with Hive is coming next! Stay tuned!