Hadoop

Your typical, everyday data hacker. Image: elhombredenegro / https://www.flickr.com/photos/77519207@N02/6818192898/in/photostream/

Your typical, everyday data hacker. Image: elhombredenegro / Flickr

“I don’t know where to start”

I recently spoke to about a dozen aspiring data scientists, and a very common concern was, “There are just so many different programming languages, and so many different software packages and databases. I feel overwhelmed, and I don’t know where to start!”

In this post, I will explain everything you need to learn to get started as a data “hacker.”

What is a data hacker?

Harlan Harris and Vincent Granville have both written articles about the different types of data scientists. Harris’s article is more about the roles of data scientists, whereas Granville’s article is more about the skills of data scientists. Harris breaks data scientists into 4 types: Data Businesspeople (technical managers, data entrepreneurs), Data Creatives (hackers, jack-of-all-trades types), Data Developers (big data programmers), and Data Researchers (usually PhDs in computer science or statistics working in academia or in large research labs). I consider myself a jack-of-all-trades, so I think I fit into the Data Creative type. In this article, I will focus on how to become a data hacker (called a Data Creative by Harlan Harris).

So how do I become a Data Hacker?

Hackers tend to have a broad set of technical skills, although they may not be true experts at any one skill:
– Statistical Programming
– Machine Learning
– Visualization
– Reporting/Dashboarding
– Databases
– Big Data
– Data “Munging”

This is a long list! How does someone actually learn all of these things in a reasonable amount of time? The key is to pick a single comprehensive technology stack, and do everything using that stack.

The R-Hadoop technology stack

R is a free, open-source statistical programming language originally based on the S programming language. Here are a few reasons why R is a great place to start for data analysis:
– It’s completely free: SAS and SPSS are expensive to get started with, and you often need to buy new methods if you want to try them out
– It’s comprehensive: almost any statistical or machine-learning task you could think of has pre-built libraries for you to use in R.
– R is easy to learn, and especially good for hacking: you don’t need to have a lot of programming experience to get started doing useful work in R
– R is a full-fledged programming language: unlike SAS or SPSS, R is not just a procedural language for doing data analysis
– R is great for getting a job, especially in the tech industry

Hadoop is a free, open-source distributed computing framework. Hadoop is used for all aspects of Big Data: storage, databases, analysis, and even modeling. Hadoop is used at many of the top companies in the world, including Facebook, Twitter, and LinkedIn. When you hear about Hadoop, you typically hear about MapReduce, which is a framework that allows you to solve (extremely) large-scale data processing problems on a cluster of commodity computers. Here are a few reasons why Hadoop is a great way to get started with Big Data:
– Again, it’s completely free
– It’s easy to get started, even if you don’t have your own cluster of computers: check out Cloudera for an online trial and a VM you can download for free
– Hadoop is comprehensive: almost any Big Data storage or processing problem can be solved within the Hadoop ecosystem
– Hadoop is great for getting a job: it seems like it’s on every data science job ad nowadays!

The R-Hadoop stack allows to do almost anything you need to for data hacking:
– Statistical Programming: R has packages for data exploration, statistical tests, regression, and everything else you could imagine.
– Machine Learning: The caret package is a wrapper for dozens of machine learning algorithms, and makes it easy to train, tune, and test machine-learning models.
– Visualization: The ggplot2 package allows you to make professional-looking, fully customizable 2D plots.
– Reporting/Dashboarding: The knitr package allows you to generate beautiful, dynamic reports with R. The shiny package is a web framework for building stylish, interactive web apps with R.
– Databases: Hive is a highly-scalable data warehousing system built on Hadoop for ad-hoc SQL-style querying of huge datasets (developed at Facebook). Cassandra (used by Netflix) and HBase (used by Twitter) are other database solutions for other purposes built on Hadoop.
– Big Data: This is what Hadoop was made for. Hadoop allows you to store and process essentially unlimited amounts of data on commodity hardware (you don’t need a supercomputer anymore). And depending on how big you mean by “big” data, R has some spectacular libraries for working directly with it, like data.table.
– Data “Munging”: Data munging refers to “cleaning” data and rearranging data in more useful ways (think parsing unusual date formats, removing malformed values, turning columns into rows, etc.). Both R and Hadoop have facilities for this. R is awesome and easy for small to moderate-sized data sets, and Hadoop allows you to write your own programs to clean and rearrange really big data sets when needed.

Note: You should do all of this on a Unix-based system

R and Hadoop can both be run on Windows, but it is much more natural and easier to use a Unix-based system. That might be a bit of a learning curve, but the rewards of learning Unix are incredibly high, and it’s great for a resume.

What the R-Hadoop stack is not great for

R and Hadoop can cover most use-cases, but there are situations where you’ll want to use something else. For example, Python has libraries that make text-mining much easier and more scalable than it is in R. And sometimes, if you’re building a web app, Shiny is just not flexible enough, so you’ll want to use a more traditional web framework. But for most purposes, most of the time, you can get by with R and Hadoop.

Why just stick to one technology stack?

Some of you might be saying: “Shouldn’t you always just use the right tool for the job? Why would you let yourself get so sucked in to one ecosystem?” That is a very good point, but I think that there are huge advantages to focusing on one thing, especially when you are starting out. First of all, you will waste lots of time if you switch training paths and resources all of the time, because the startup costs of learning a new technology are so high. Secondly, it is extremely useful and motivating to focus on one technology, because getting good at one thing is the fastest way to be able to solve real-world problems (instead of the toy examples you usually use when learning a new technology). And finally, R and Hadoop are often the best tool for the job, anyway!

So how do I get started?

Here’s how I recommend you get started: first, start with “toy” examples and a little formal training. Then, pick a simple project that interests you, and then try it out! You will learn a lot by overcoming the natural roadblocks that arise in working with real data. Here are some good resources to get you started:

Intro to R

Code School’s TryR is a free, short, interactive online intro to R
Coursera’s Computing for Data Analysis is a free, comprehensive online course on R

Intro to Hadoop

Cloudera Live and Cloudera’s VM are great intros to what Hadoop is all about

Project ideas

Find a dataset of historical results from the World Series, create a couple of visualizations using ggplot2, and create a simple web-app in Shiny to display the visualizations.
Build a classification model to identify survivors of the Titanic using the Kaggle Titanic dataset and R’s caret package
Pull the 1990 and 2000 US Census data into a Hive database on Amazon Web Services. Can you find any surprising demographic differences between 1990 and 2000 with a few Hive queries?