Preparing for the HDPCD Exam: Data Transformation

HWX_Badges_Cert_Color_Dev

So after getting data into HDFS, it’s often not pretty. At the very least, it’s a little disorganized, sparse, and in generally not ready for analytics. It’s a Certified Developer’s job to clean it up a little.

That’s where Apache Pig can come in handy! This post will cover the basics in transforming data in HDFS using Apache Pig for preparation of the HDPCD Exam. 

download

The second part of the study guide provided indicates that a Certified Developers should be able to transform and manipulate data in HDFS using Apache Pig. Pig is a scripting language built specifically for Hadoop. It natively does operations in Map/Reduce and is able to handle tremendous amounts of data.

Due to how Pig scripts work, I’m going to give an example early on, then explain things as the post continues. This will make it easier to visualize how all the parts work together.

  • script.pig

-- script.pig
a = load 'filename' using PigStorage(<delim_char>) [as (c1:<type>, ..., cn:<type>)];
b = foreach a generate c1;
c = group a by c1;
d = filter c by c1 == <num>;
e = order a by <field> ASC|DESC;
f = distinct a;
g = group a by c1 parallel 18;
h = join a by a.c1 [LEFT|RIGHT|FULL] [OUTER], b by b.c1 [USING 'replicated' | 'skewed' | 'merge'] ;
store d into <hdfs_path>;

dump b;

  • Write and execute a Pig script

pig script.pig

  • Load data into a Pig relation without a schema

  • That’s on line 2 of script.pig. Removing the brackets from the command loads data without a schema.
a = load 'filename' using PigStorage(<delim_char>);

  • Load data into a Pig relation with a schema

  • That’s the same as above, line 2 but this time include a schema (but don’t include the brackets in your statement).
a = load 'filename' using PigStorage(<delim_char>) [as (c1:<type>, ..., cn:<type>)];

  • Load data from a Hive table into a Pig relation

  • This is a little different so I’ve explicitly outlined it below:
pig -useHCatalog #first invoke the pig shell with the -useHCatalog flag
A = LOAD 'db.tablename' USING org.apache.hive.hcatalog.pig.HCatLoader(); #this loads data
store A into 'db.tablename2' using org.apache.hive.hcatalog.pig.HCatStorer(); #this stores data

  • Use Pig to transform data into a specified format

  • This topic is a little cryptic but it just means manipulate an object based on what it is made of. In this case, the generate command is “returning” c1 from a
b = foreach a generate c1;

  • Transform data to match a given Hive schema

  • This is the same as above, just make sure you generate in the right order that the Hive schema is anticipating.
b = foreach a generate c1, c2, c3, ..., cN;

  • Group the data of one or more Pig relations

c = group a by c1;

  • Use Pig to remove records with null values from a relation (see comments for other options)

d = filter c by c1 != NULL;

  • Store the data from a Pig relation into a folder in HDFS

store d into /path/to/location;

  • Store the data from a Pig relation into a Hive table

store A into 'db.tablename2' using org.apache.hive.hcatalog.pig.HCatStorer(); #this stores data

  • Sort the output of a Pig relation

  • Be sure to specify ASCending or DESCending!
e = order a by c1 [ASC|DESC];

  • Remove the duplicate tuples of a Pig relation

f = distinct a;

  • Specify the number of reduce tasks for a Pig MapReduce job

g = group a by c1 parallel 18;

  • Join two datasets using Pig

h = join a by a.c1 LEFT, b by b.c1;

  • Perform a replicated join using Pig

h = join a by a.c1 LEFT, b by b.c1 USING 'replicated';

  • Run a Pig job using Tez

pig -x tez script.pig

  • Within a Pig script, register a JAR file of User Defined Functions

register ./myjar.jar;
define toUpper `myjar.UPPER`;
A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);
B = FOREACH A GENERATE toUpper(name);
DUMP B;

  • Within a Pig script, define an alias for a User Defined Function

define toUpper `myjar.UPPER`;

  • Within a Pig script, invoke a User Defined Function

A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);
B = FOREACH A GENERATE toUpper(name);
DUMP B;

That should be all! The second part of the HDPCD Exam study guide covers using Pig to transform data in HDFS. Obviously, Pig is just one of the now many tools that you can use to transform and manipulate data in Hadoop. But it is a core tool and is in use at many companies all around the world.

Data analysis with Hive is coming next! Stay tuned!

<< Previous || Next >>

Advertisements

8 thoughts on “Preparing for the HDPCD Exam: Data Transformation

  1. Marco Martinez April 20, 2016 / 5:21 am

    Hi! I’m preparing for this certification and I’m writing a similar guide! I noticed that you are using < and > symbols to show specific commands, maybe using [ ] would help? just a thought. However I think that this posts are great! it’s really nice to have something to compare my answers and preparation, better yet coming from a Master Hadoopster!

    Like

    • James Barney April 20, 2016 / 2:29 pm

      Thanks for the input! I was originally intending to use the tags as things the user would have to specify whereas [ ] would be options that were either required or available with a certain command/feature. Good idea!

      Like

      • Marco Martinez April 20, 2016 / 3:31 pm

        Oh I’m sorry, I meant that using < (using ampersand and lt) is showing the code itself rather than < on the code snippets, maybe that could confuse new people?

        Like

  2. Chenna April 20, 2016 / 8:13 pm

    Hi,
    Thanks for posting this. Very helpful indeed. But for “Use Pig to remove records with null values from a relation” given != NULL is not working instead I used :
    filtered_data = FILTER data BY NOT(c1 IS NULL);

    This gave me correct answer.

    Thanks!

    Like

    • James Barney April 21, 2016 / 7:02 pm

      Thanks! What version of Pig are you using? It could be that things have changed in Pig since the writing of this article. I’ll make a note anyway!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s