Apache Crunch Tutorial 2: Setting up a Project

LITTLE_CRUNCH

This post is the second in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.

In my previous and first tutorial on Apache Crunch, I talked about the benefits of Crunch, and some basic driver code to help you understand what Crunch can do at an entry level. In today’s entry, I’d like to walk you through getting Crunch installed on your local machine so you can start playing with it yourself. If you’ve done this already, you’ll love the next tutorial on Java objects and materialization (coming soon).

Let’s talk about what you need:

Setup

Install Eclipse or IntelliJ (but seriously though, if you’re just getting set up, get IntelliJ, it’s amazing). You’ve got this step down, I’m sure.

Install Maven. You can do this in less than 5 minutes on any home operating system (Windows, Mac, Linux), by following these steps on the Maven site. If you’re on Mac or Linux, this is an even simpler process:

Open Terminal (or equivalent command line). Enter these commands exactly (one at a time):

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install maven

The first command will ask for you to confirm by pressing Enter, do so. It’ll also ask for a password to confirm, enter it. You should see an ‘Installation Successful’ statement upon completion. The second command will install Maven using what you installed in the first command, Brew. It’s about 10mb in size, and should say as much when it completes.

Create a Crunch Project (comprehensive guide here). You can do this through a few short command line commands. Note: for any command in the code below that is bolded, it means you can customize it. For example, you don’t have to call your package com.hadoopsters.bigdata, you can call it mycompany.banana.suitcase, but it’s best to follow Java package naming conventions. The same applies for crunchdemo, you can call it MyCrunchDemoSupreme, it’s up to you.

  1. Open Terminal (or equivalent command line). Navigate to your development work area, such as an Eclipse Workspace or code project folder on your Mac.
  2. Enter these command exactly (one at a time):
    mvn archetype:generate -Dfilter=org.apache.crunch:crunch-archetype 
    1
    23
    com.hadoopsters.bigdata
    crunchdemo
    
  3. Prompt will say 1.0-SNAPSHOT, but just hit ENTER.
  4. Prompt will say com.bigdata.crunch, but just hit ENTER.
  5. Prompt will say “Y:”, but just hit ENTER.
  6. Your Crunch project should be installed in the current folder in a directory called crunchdemo (or whatever you named it).

Expected output:

[INFO] Generating project in Interactive mode
[INFO] No archetype defined. Using maven-archetype-quickstart (org.apache.maven.archetypes:maven-archetype-quickstart:1.0)
Choose archetype:
1: remote -> org.apache.crunch:crunch-archetype (Create a basic, self-contained job for Apache Crunch.)
Choose a number or apply filter (format: [groupId:]artifactId, case sensitive contains): : 1
Choose org.apache.crunch:crunch-archetype version:
1: 0.4.0-incubating
2: 0.5.0-incubating
3: 0.6.0
4: 0.7.0
5: 0.7.0-hadoop2
6: 0.8.0
7: 0.8.0-hadoop2
8: 0.8.1
9: 0.8.1-hadoop2
10: 0.8.2
11: 0.8.2-hadoop2
12: 0.8.3
13: 0.8.3-hadoop2
14: 0.8.4
15: 0.8.4-hadoop2
16: 0.9.0
17: 0.9.0-hadoop2
18: 0.10.0
19: 0.10.0-hadoop2
20: 0.11.0
21: 0.11.0-hadoop2
22: 0.12.0
23: 0.12.0-hadoop2
24: 0.13.0
Choose a number: 24: 
Downloading: https://repo.maven.apache.org/maven2/org/apache/crunch/crunch-archetype/0.13.0/crunch-archetype-0.13.0.jar
Downloaded: https://repo.maven.apache.org/maven2/org/apache/crunch/crunch-archetype/0.13.0/crunch-archetype-0.13.0.jar (15 KB at 19.1 KB/sec)
Downloading: https://repo.maven.apache.org/maven2/org/apache/crunch/crunch-archetype/0.13.0/crunch-archetype-0.13.0.pom
Downloaded: https://repo.maven.apache.org/maven2/org/apache/crunch/crunch-archetype/0.13.0/crunch-archetype-0.13.0.pom (4 KB at 13.3 KB/sec)
Define value for property 'groupId': : com.hadoopsters.bigdata
Define value for property 'artifactId': : crunchdemo
Define value for property 'version': 1.0-SNAPSHOT: :
Define value for property 'package': com.hadoopsters.bigdata: :
Confirm properties configuration:
groupId: com.hadoopsters.bigdata
artifactId: crunchdemo
version: 1.0-SNAPSHOT
package: com.hadoopsters.bigdata
Y: :
[INFO] ----------------------------------------------------------------------------
[INFO] Using following parameters for creating project from Archetype: crunch-archetype:0.13.0
[INFO] ----------------------------------------------------------------------------
[INFO] Parameter: groupId, Value: com.hadoopsters.bigdata
[INFO] Parameter: artifactId, Value: crunchdemo
[INFO] Parameter: version, Value: 1.0-SNAPSHOT
[INFO] Parameter: package, Value: com.hadoopsters.bigdata
[INFO] Parameter: packageInPathFormat, Value: com/hadoopsters/bigdata
[INFO] Parameter: package, Value: com.hadoopsters.bigdata
[INFO] Parameter: version, Value: 1.0-SNAPSHOT
[INFO] Parameter: groupId, Value: com.hadoopsters.bigdata
[INFO] Parameter: artifactId, Value: crunchdemo
[INFO] project created from Archetype in dir: /Users/landon/Desktop/DevWorkspace/crunchdemo
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:35 min
[INFO] Finished at: 2015-10-01T22:42:09-04:00
[INFO] Final Memory: 13M/120M
[INFO] ------------------------------------------------------------------------

If everything went well, you should have a Crunch project ready to go! Let’s see what’s in it by importing the project to IntelliJ.

Import into IntelliJ (as Maven project)

Screen Shot 2015-10-01 at 11.02.07 PM

Screen Shot 2015-10-01 at 11.02.52 PM

Screen Shot 2015-10-01 at 11.03.17 PM

Screen Shot 2015-10-01 at 11.06.24 PM

Screen Shot 2015-10-01 at 11.06.34 PM

Screen Shot 2015-10-01 at 11.09.02 PM

Now you have a Crunch project, and can start playing with things in the MemPipeline on your local machine (or Map/Reduce and Spark if you’re so bold, though I’d recommend getting familiar with Crunch in local form first). Definitely walk through the Wordcount Example on the Apache Crunch website, and see how it works!

Next time, we’ll write our first Crunch program in a MemPipeline, and explore more advanced topics like Java objects and materialization.

<< Previous Tutorial  |  Next Tutorial >>

Advertisements

2 thoughts on “Apache Crunch Tutorial 2: Setting up a Project

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s