After a brief hiatus in the great state of Alaska, I’m back to discuss actually analyzing data on your new Hadoop cluster that we set up together in previous blog posts. Specifically we’ll be looking at crime data from the City of Chicago from 2001 to the day this was first written, 8/26/2015. There’s a couple things we need to take care of before we get started though, Sherlock.
So if you shut down your cluster last time, please reboot it by starting up VirtualBox and double clicking your Virtual Machine (VM). Next, let’s make sure everything is up and running on the cluster. Do this by navigating to Ambari (user: admin, password: admin) again and verifying that there aren’t any errors or stopped services. If there are any, just try restarting them by selecting the option on each service. Once it’s all green, continue below!
You’ll notice that every time you start up your VM, you’re logged in as a user, root. This user is the superuser in the Linux VM that we have our Hadoop ecosystem running on. If you want more information, read here, but for now just know that this user has the ability to read, write, and delete any file on the file system. As a result, doing things as root can be dangerous if you don’t know what you’re doing (you could accidentally delete key system files). It’s also kind of cheating, since you’re an all-powerful user in a system. You will basically never be able to log in as root in any real, production system. So let’s make a new user account for us to play around with as we begin: sherlock.
To create this user, ensure you are root and execute the following commands:
useradd sherlock passwd sherlock
The second command will allow you to set a password for your new user. Since this is a closed VM, I just set mine to nothing to save time later by hitting ‘Enter’ two times, despite Linux’s warnings of a bad password.
You shouldn’t get any errors and there will now be a new user in your Linux VM. Next issue the following command to create a space for this user in HDFS:
hadoop fs -mkdir /user/dev/
****Note that this directory in HDFS is NOT created automatically for you upon creating a new user****
sherlock DOES however have a home directory created automatically in the Linux environment: /home/sherlock/. You can switch users to sherlock from root by typing:
Once you’re sherlock, we need to get us some data. For this specific tutorial, we’re going to look at the crime data freely available from the City of Chicago. Download it by clicking the following link:
Once that’s done, you should have a csv file somewhere in your downloads with the name Crime_-_2001_to_present.csv. How do we move this to our cluster though?
There are a number of ways to do this, either through WinSCP (on Windows) or through a nice service that comes with your cluster, Hue. Access it by navigating to this link: http://127.0.0.1:8000. You should be logged in automatically as the user, hue. Hue is a great way to interact with your cluster as a developer. It’s a nice GUI that allows you to look at the file system, move files around in HDFS, schedule jobs to crunch data, and interact with data on the fly. It’s also how we get our data onto our cluster in this example.
Click on the brown file browser icon at the top of the webpage, and select ‘Upload>Files’ and select Crime_-_2001_to_present.csv.
That should finish relatively quickly. Congratulations, you’ve just uploaded your first data file to your hadoop cluster!
Let’s go back to the command line and our super-sleuth, sherlock.
Make a copy of the data file:
hadoop fs -cp /user/hue/Crime_-_2001_to_present.csv /user/sherlock/chicago_crimes
Next, let’s start the Hive Command Line Interface:
And create a new database and table:
CREATE DATABASE chicago_crimes; CREATE external TABLE chicago_crimes.all_crimes( id string, casenumber string, caldate string, block string, iucr string, primarytype string, description string, locationdescription string, arrest boolean, domestic boolean, beat string, district string, ward string, communityarea string, fbicode string, xcoordinate string, ycoordinate string, year string, updatedon string, latitude decimal(10,0), longitude decimal(10,0), location string ) row format delimited fields terminated by ',' stored as textfile location '/user/sherlock/chicago_crimes';
If you’re familiar with RDBMS systems then you’ll be able to figure out what’s going on here. The only real different thing to know about is the ‘external’ keyword used before the table name. That tells Hive to leave the source data alone in its current location. If you omit the ‘external’ keyword, your data will be move from its current location to the Hive metastore.
Anyways, once you submit that query, let’s make sure the data is all there:
select count(*) from chicago_crimes.all_crimes;
I got 288,126 reported crimes. Due to the dynamic nature of that file you downloaded earlier, you might get more than me.
There’s around 2.7 million people living in Chicago in 2015 according to US Census data. That’s about 1 crime for every 10 people in Chicago.
This is obviously just the beginning of our adventure with this data. If you want to see what we’ll be tackling next, stick this query into Hive and try to see what is “wrong” with our data in this table:
select * from chicago_crimes.all_crimes where locationdescription == "SCHOOL";
HINT: Loading data is NEVER as easy as you want it to be.