Why you should learn Spark if you want to be a data scientist

You should learn Spark
About a year ago, I said that the best place to start for most aspiring data scientists is to learn R and Hadoop. Hadoop is still useful and is a very marketable skill to have, but Spark is quickly emerging as the new Big Data framework of choice. In this post, I will talk through some new developments that make it a great career choice to spend some time learning Spark.

Don’t lots of people still use Hadoop?

Absolutely. Hadoop is still extremely popular and useful. You should still understand the basics of Hadoop. But in the last couple years, Spark has become probably the trendiest (and most lucrative) big data technology in the world. Add in the fact that Spark is way faster than Hadoop for many things and that you can write Spark programs in Python (and R, but less completely), and it is a no brainer to focus on Spark.

What is Spark?

Spark is a big data computation framework like Hadoop. In fact, Spark can use HDFS (the Hadoop Distributed Filesystem) as its filesystem. The main reason that Spark is so much faster than Hadoop is that Hadoop repeatedly reads and writes the intermediate steps of MapReduce jobs to disk, whereas Spark caches most of its computations in memory. The reads and writes to disk are slow, which is why even simple Hive queries can take minutes or more to complete.

Spark’s main abstraction is the Resilient Distributed Dataset (RDD). As indicated by its name, an RDD is a dataset that is distributed across the Spark compute nodes. When an RDD is created, it is stored in memory, which allows you to query or transform it repeatedly without writing to or reading from disk. This in-memory caching also makes Spark ideal for training machine-learning models, since training ML models typically involves iterative computations on a single dataset (e.g. repeatedly adjusting weights in an artificial neural network via gradient descent).

What about Hive or “SQL on Hadoop”?

One of the coolest things about Spark is that it has built-in data connectors to many different kinds of data sources, including Hive. But Spark takes it one step further with SparkSQL, which can be much faster than Hive.

How to run Spark

Note: if you just want a taste of Spark, you can read the code examples in Getting Started with Apache Spark

Option A: Install it locally

Like Hadoop, Spark is meant to be run on a cluster of machines. Nevertheless, you can run Spark on a single machine for learning and testing purposes. There are several steps involved, but I was able to get Spark up and running on my Windows laptop in an hour or so. If things don’t work perfectly for trickier dependencies, you can always try using the virtualization and containering sofware Docker. It allows you to use a completely fresh and standard Linux instance on your Windows or Mac PC (and you could take the same thing you build locally and deploy it easily to the cloud).

Option B: Use AWS EMR (or another cloud computing service)

I really like this option, because it allows you to get practice with another extremely marketable technology, and you don’t have to install a bunch of stuff on your computer. Amazon Web Services (AWS) Elastic MapReduce (EMR) is a web service that allows you to spin up your own Hadoop and Spark clusters using a point and click web interface. This is the way that I started. There are a few steps you have to do first, including signing up for an AWS account and setting up SSH keys to connect to EMR. However, spending a few hours getting started on AWS will help you over and over again if you are trying to learn data science. And you can put EMR on your resume, too. Best of all, there is a limited free tier for first time users.

Resources for learning Spark

I always recommend learning by doing as much as possible. With Spark, there are some good online tutorials to help you get started. But first, it might help to spend an hour learning about how Spark works, so that the tutorials make a little more sense. Another tip: start with things you know, and try to learn one thing at a time. For example, if you already know Hive, try using Spark to query Hive. If you already know Python, use PySpark instead of trying to also learn Scala.

Spark explained

These two links are a good place to start to get a basic understanding of what Spark is.
The 5-Minute Guide to Understanding the Significance of Apache Spark
Spark overview from e-book “Mastering Apache Spark”

Tutorials and Examples

Most people learn best by example, so I have included a few good tutorials with plenty of examples.
Getting Started with Apache Spark: Includes a few nice, simple examples of using Spark (requires an email address to access)
Quick Start: A very light quick start that you can use after you install Spark on your computer
Spark Examples: A collection of several useful examples to get the basics of Spark


Spark is the trendiest technology out there for data scientists. And compared to Hadoop, it is not too hard to get started. I highly recommend that you give it a try.

Action steps

  1. Figure out how to run Spark, either locally or in AWS
  2. Spend one hour (but not more than that) reading about Spark’s architecture and its execution model. Don’t worry
    if you don’t understand everything right away.
  3. Pick one (and only one) Spark tutorial to try out, and follow every step. This is not the time to get too creative.
    Just stay focused and stick to one thing at a time.
  4. Try a simple project of your own. Keep it simple to start with, so you can boost your confidence and motivation.

Getting started with machine learning with Python and scikit-learn

Why Python?

For most people, I recommend getting started with R, because the tools in R for exploratory data analysis and visualization are easier and more comprehensive than the tools in Python. However, if you have a computer science background or if you want to jump on the fast track to high-performance machine-learning, then you might want to start with Python. Python is an awesome programming language, because it is easy to write, readable, and well-documented, and it is very fast if you do it right. It also has extensive libraries for scientific computing, stats, and machine learning.

The Python scientific stack

Python has a comprehensive and integrated scientific computing stack that has an incredible combination of performance, ease of use, and depth. It is made up of several libraries and utilities, including:

  • numpy: Fast and easy array computations and manipulations. Includes “broadcasting” and “fancy” indexing, which give Python arrays some of the simple syntax of R vectors.
  • scipy: Scientific computing functionality, including optimized matrix operations and data structures, numerical optimization, calculus functions, etc.
  • pandas: Data frames in Python. A little more complicated than R data frames, but with much better performance (more like R data tables than data frames in practice)
  • scikit-learn: Machine learning in Python. Fast and easy to use.
  • Jupyter (previously IPython Notebook): A browser-based notebook for scientific computing in Python and other languages.
  • Matplotlib: A nice plotting library. Many people use Seaborn as a user-friendly alternative.

A couple of quick tutorials to get started

Kaggle has a great tutorial series on getting started with Python. It takes you through the basics of loading data, manipulating data, transforming data, and building a random forest machine learning model. It uses the Titanic survival dataset to walk you through all of these skills in a practical case study. It is probably best to start at part I, although it is probably OK to skip part I if you are impatient.

The Kaggle tutorials should get you started and start building your confidence. After these, I recommend the scikit-learn quick start tutorial. This gives a bit of the bigger picture on scikit-learn and the concepts of machine learning. The scikit-learn documentation taught me so much, and I highly recommend it. The scikit-learn website also contains lots of code examples, although the examples can seem a bit complex at first.

Where to go from here

If you really sit down and work through these tutorials, you will be ready to try some more examples on your own. I recommend checking out a couple more straightforward Kaggle competitions like Give Me Some Credit. Machine learning is a craft, and it takes practice to get good at it, but the payoff can be huge.

Creating a great data science resume


“I haven’t heard back from any companies”

I hear a familiar story from a lot of aspiring data scientists: “I have sent out my resume to 25 companies, and I haven’t heard back from any of them! I have pretty good skills, and I think I have a pretty good resume. I don’t know what’s going on!”

Your resume probably sucks

My immediate conclusion after hearing your story: your resume probably sucks. If you are not getting any responses from any companies, and your skills are a reasonable match for the job description, then it almost certainly means that you are getting sabotaged by a bad resume.

What is the purpose of a resume?

The only real purpose of a resume is to get job interviews. That’s it. The purpose of a resume is not to:

  • list all of your job experience
  • list all of your technical skills
  • show off your great educational background

Your resume should explicitly include only the exact items that will help you get a job interview.

What makes a good resume?

A good resume tells a story that is targeted to the job description and company. And furthermore, someone reading the resume should be able to understand that story in less than 20 seconds. If you keep these principles in mind, it is actually not too hard to write a decent resume.

Crafting your story

The first thing you need to do when creating your resume is to come up with your story. This part is a little tricky, but it is extremely important and even a little fun. Your story should be simple and compelling, and it should be a good fit for the job description. A good strategy for coming up with a great resume story is to think of two or three things that are interesting about yourself and tie them together. You should start out with a simple story in plain English (something I learned in Ramit Sethi’s excellent Dream Job course).

For example, a story that works for me is, “I am an experienced data scientist, I have a great math background, and I am good at explaining complicated stuff.” If I were applying to a more software development-focused data science job, a possible story for me could be, “I have experience building really fast and accurate machine-learning models in Python. I also understand big data technology like Hadoop.” For a more business-focused role, a story could be, “I have experience using stats and machine-learning to find useful insights in data. I also have experience presenting those insights with dashboards and automated reports, and I am good at public speaking.” When you come up with your story, don’t be afraid to try some different ones on for size. All of the three stories I just wrote are true about me. It’s all about positioning yourself the right way for the company.

What skills and technologies should I list?

People often ask me what skills and technologies they should have on their resume. There are really three main questions here.

  1. How proficient do I have to be before I put a skill or technology on my resume?
  2. Which things should I emphasize?
  3. Which things should I not include?

Question 1: What am I allowed to include?

My general rule of thumb is that you should not put something on your resume unless you have actual used it. Just having read about it does not count. Generally, you don’t have to have used it in a massive scale production environment, but you should have at least used it in a personal project.

Question 2: What should I emphasize?

In order to decide what to emphasize, you have two great sources of information. One is the job description itself. If the job description is all about R, you should obviously emphasize R. Another, more subtle, source is the collection of skills that current employees list on LinkedIn. If someone is part of your network or has a public profile, you can see their LinkedIn profile (if you can’t see their profile, it might be worth getting a free trial for LinkedIn premium). If all of the team members have 30 endorsements for Hive, then they probably use Hive at work. You should definitely list Hive if you know it.

Question 3: What should I not include?

Because your resume is there to tell a targeted story in order to get an interview, you really should not have any skills or technologies listed that do not fit with that story. For example, if your story is all about being a “PhD in Computer Science with deep understanding of neural networks and the ability to explain technical topics,” you probably should not include your experience with WordPress. Including general skills like HTML and CSS is probably good, but you probably do not need to list that you are an expert in Knockout.JS and elastiCSS. This advice is doubly true for non-technical skills like “customer service” or “phone direct sales.” Including things like that actually makes the rest of your resume look worse, because it emphasizes that you have been focused on a lot of things other than data science, and — worse — that you do not really understand what the team is looking for. If you want to include something like that to add color to your resume, you should add it in the “Additional Info” section at the end of the resume, not in the “Skills and Technologies” section.

What I have no experience?

If you have no working experience as a data scientist, then you have to figure out how to signal that you can do the job anyway. There are three main ways to do this: independent projects, education, and competence triggers.

Independent projects

If you don’t have any experience as a data scientist, then you absolutely have to do independent projects. Luckily, it is very easy to get started. The simplest way to get started is do a Kaggle competition. Kaggle is a competition site for data science problems, and there are lots of great problems with clean datasets. I wrote a step-by-step tutorial for trying your first competition using R. I recommend working through a couple of Kaggle tutorials and posting your code on Github. Posting your code is extremely important. In fact, having a Github repository posted online is a powerful signal that you are a competent data scientist (it is a competence trigger, which we will discuss in a moment).

Kaggle is the simplest way to complete independent projects, but there are many other ways. There are three parts to completing an independent data science project:

  1. Coming up with an idea
  2. Acquiring the data
  3. Analyzing the data and/or building a model

Kaggle is great, because steps 1 and 2 are completed for you. But a huge amount of data science is exactly those parts, so Kaggle can’t fully prepare you for a job as a data scientist. I will help you now with steps 1 and 2 by giving you a list of a few ideas for independent data science projects. I encourage you to steal these.

  1. Use Latent Semantic Analysis to extract topics from tweets. Pull the data using the Twitter API.
  2. Use a bag of words model to cluster the top questions on /r/AskReddit. Pull the data using the Reddit API.
  3. Identify interesting traffic volume spikes for certain Wikipedia pages and correlate them to news events. Access and analyze the data by using AWS Open Datasets and Amazon Elastic MapReduce.
  4. Find topic networks in Wikipedia by examining the link graph in Wikipedia. Use another AWS Open Datasets.

I mention a few other sample projects in Becoming a Data Hacker.


Another way to prove your ability is through your educational background. If you have a Masters or a PhD in a relevant field, you should absolutely list relevant coursework and brag about your thesis. Make sure that you put your thesis work in the context of data science as much as possible. Be creative! If you really can’t think of any way that your thesis is relevant to data science, then you problem should not make a big deal out of it on your resume.

Competence triggers and social proof

Competence triggers are usually discussed in the context of interviews, but they play a particularly important role in data science resumes. Competence triggers are behaviors or attributes of a person that “trigger” others to see them as competent. In an interview, a typical competence trigger is having a strong, firm handshake or being appropriately dressed. There are a few key competence triggers that will really boost your resume:

  • A Github page
  • A Kaggle profile
  • A StackExchange or Quora profile
  • A technical blog

Why do these boost your resume so much? The reason is that data scientists use these tools to share their own work and find answers to questions. If you use these tools, then you are signaling to data scientists that you are one of them, even if you haven’t ever worked as a data scientist. Even better, a good reputation on sites like StackExchange or Quora gives you social proof.

Don’t worry about doing all of these at once. I absolutely think you should have a Github page, and you should post code from your independent projects there. If you have performed decently well in a couple of Kaggle competitions, then your Kaggle profile will be impressive, too. Answering questions on StackExchange or Quora can be a bit of a distraction from your real work, so it should not be a priority. And starting your own blog is great, but probably not necessary. As an alternative to a blog, you can focus on writing good documentation in a README in your Github repositories.

Resume rules of thumb

As you write your resume, there are a few basic rules of thumb to keep in mind.

  1. Keep it to one side of one page: Most recruiters only look at a resume for a few seconds. They should be able to see that you are a good candidate immediately, without turning the page.
  2. Use simple formatting: Don’t do anything too fancy. It should not be hard to parse what your resume says.
  3. Use appropriate industry lingo, but otherwise keep it simple: Again, this goes to readability.
  4. Don’t use weird file types: PDF is good, but you should probably also attach a DOCX file. You basically should not use any other file formats, because your resume is useless if people can’t open it.

Do I need to include a cover letter?

A lot of job applications say that a cover letter is optional. Typically, you should include a cover letter anyway. Make sure the cover letter is not too generic. Actually explain why you would be a good fit for the role and the company. Do a little research, and be positive. Remember the rule about resumes, though: don’t make the cover letter too long.

If you are just “casually” sending your resume to a current employee, it is okay to skip the cover letter. This is another example of how networking is a critical. The best way to get an interview is to be recommended by a current employee. If you can do this, then your resume will float to the top of the pile automatically.

My annotated resume

If you want to see what my resume looks like, here’s a link to it in Google Drive. It’s not perfect, but it doesn’t have to be. Keep that in mind. Remember, your resume is there to get you an interview. It is not your magnum opus. Happy hunting!

PS: Be sure to sign up for my email list if you want more content like this, and please leave any questions in the Comments.

From Grad Student to Data Scientist

From Grad Student to Data Scientist

“I don’t wanna be a professor”

A few days ago, I got an email from a friend of a friend named Jeremy. Jeremy is finishing up his PhD in Cognitive Science, and he wants to become a data scientist or a software engineer. In other words, Jeremy does not want to become a professor or postdoc. Luckily, he has learned quite a lot about machine learning in the course of his studies (you can find him on LinkedIn if you are hiring). But he still had a lot of questions for me, and I think a lot of readers have the same questions as he did.

Am I qualified?

Grad students are a bit of an anxious bunch (at least I was). They spend all of their time around other grad students, post docs, and professors (ie. Really Smart People). Unfortunately, spending all of your time around Really Smart People can make you think that you don’t really belong, and even that you might be an intellectual fraud. Therefore, lots of grad students I know don’t realize how smart and capable they are.

Here’s the truth: If you have spent a few years in a tough, technical grad program studying tough, technical stuff, then you are smart enough for data science. Don’t believe those guys you read on StackOverflow or Quora who say that you have to have implemented stochastic gradient descent in assembly and deployed it to Hadoop on a Raspberry Pi (and published your results in Science) in order to be a “real” data scientist. That’s just not true. In fact, most data scientists are just pretty good hackers who can learn as they go and get the job done. If you can do that, then you are qualified to become a data scientist. Of course, that doesn’t mean that you will be any good at data science when you’re starting out, but you already know that you can learn fast.

What job title should I go for?

Most likely, your first job title should be “Data Scientist.” As a basic rule of thumb, if you have a PhD, try not to start as a “Junior Data Scientist”. Just plain “Data Scientist” is a great title, and “Senior Data Scientist” is probably out of reach for a first job. With a Masters or less, “Junior Data Scientist” is okay, but you’ll want to remove that “Junior” quickly. “Data Analyst” or “Analyst” or “Senior Analyst” can be really good, depending on the company (with a PhD, go for Senior Analyst). But you will want to fight hard for the Data Scientist title to expand your options for changing jobs in the future.

What skills do I need to have?

To really call yourself a Data Scientist, you need to be able to work with huge and messy datasets, build machine-learning models, or (preferably) both. Caveat: this advice is generic, and if you have a very specific type of company you are interested in, make sure you learn the skills you need for that company. For example, if you are trying to get a job in financial engineering, you should probably understand time series.

Big Data

You should be able to handle millions of records of structured data (like CSVs containing mostly numbers) or unstructured data (like raw text). How you do this is mostly up to you. Maybe you have some experience with high-performance computing, or maybe you have used Hadoop or Spark. Maybe you are just really good at writing optimized Python or R code.

Machine Learning

You should be familiar with the main types of machine-learning models (classification, clustering, etc.), and you should have actually implemented them. You should know how to choose the right type of model for the problem, and (even more importantly) you should know how to evaluate the models appropriately (training and test sets, cross-validation, AUC, etc.). I commonly see people who are dabbling in machine-learning who build models with no understanding of how to tell if the models are any good. This is probably worse than not building a model at all.

What technology do I need to know?

Big Data Tools: Hadoop, Spark, SQL, …?

Learning some Big Data tools is not absolutely essential, but it makes a huge difference on a resume. To get your first job, you do not need to be an expert on Hadoop or Spark, but you should at least learn a little SQL to query databases. You can learn that at SQL Zoo. If you want to impress, you should download the Cloudera VM and work through the Hadoop tutorial. You can also try out the Spark Sandbox from Hortonworks. Hadoop is still the standard, but Spark is incredibly trendy. If you can actually do a real project with Hadoop or Spark, that’s even better. If you are very tech-savvy, you can spin up an Amazon Web Services instance and try out some open datasets.

Machine-Learning: R or Python?

To land your first job, you need to be good at one of two languages: R or Python. R is amazing for reporting, dashboarding and prototyping. It has many more models and statistical tests built-in than Python. Python is amazing for building fast models out of the box, and it has the added advantage of being more understandable by engineers.

If you want to do more reporting, presentations and business-focused data science, or if you will just be building prototypes of systems, then R is great. But if you want to work on a Data Science team that deploys models to production, you will most likely need to learn Python.

As a rule of thumb, if you are a Computer Science grad student, you should just learn Python. You will like it better than R, and it will fit better with the rest of your skill set. You should learn the packages numpy, scipy, pandas, and scikit-learn. If you are an Econ grad student, you should probably learn R first. Your skill set probably lends itself to more business-focused data science, and R is absolutely incredible for this. You should learn the packages ggplot2, plyr (or dplyr), reshape2, caret, and Shiny. Many Data Scientists (including me) know both R and Python really well.

CV or Resume?

One really important difference between the academic world and the business world is that most (American) companies ask for a resume, not a CV. I’m not being pedantic: there is a huge difference between a resume and a CV. I often see resumes from grad students that are way too long and way too academic. Most academic CVs are just lists of degrees, papers, conference presentations, committee memberships, and so on. Many CVs are several pages long. As Ramit Sethi says, unless you are Bill Gates or Barack Obama, your resume should probably only be one page (that is, one side of one sheet of paper). It’s not that you don’t have two pages of accomplishments, it’s more that a resume is not a list of accomplishments. Your resume is supposed to be a piece of paper that a hiring manager can look at for ten seconds and decide if you merit an interview. That is it! So you should only include things on your resume that fit that purpose. It should be clear at a glance that you have a great educational background and that you have the skills and leadership qualities needed for the job.

Bias against Nerds

One of the biggest hurdles you will have to overcome in transitioning from grad school to the business world is the bias that a lot of business people have against academics. Business people do believe that you are smart, but many of them reflexively question your soft skills. Most successful people did not get there by writing academic papers. They got there through business knowledge, hard experience and finely-honed instincts. So it’s understandable that they wouldn’t understand why a 28 year-old computer scientist who has been studying GPU architectures could be worth 100k a year. You have to convince them that you won’t just continue your thesis research on their dime. After all, the business makes money from sales, not from cool algorithms (at least not directly). The way to overcome their concerns is to do your best to really understand their business, and to come prepared to explain how your work in data can make a contribution (at least look at their website beforehand, for goodness sakes). Also, please don’t wear your awesome Firefly t-shirt to the interview.

A few extra ideas

In the process of reviewing this post, Jeremy mentioned a few things to me that he had personally discovered:

  • Networking: “Hit up all of your contacts”
  • Indeed.com: “Great job alerts for specific job titles”
  • Kaggle and LinkedIn job postings: Tend to be super targeted (and therefore useful)

Jeremy also mentioned that some people had recommended the Insight Data Fellows program. Sounds great, but I don’t have any personal experience there.

This sounds intimidating…

Yes, this does seem daunting, but remember: it can’t be harder than writing a thesis. You have energy, passion, and drive. Not to mention a top-quality brain. You just have to put it all together, and in no time, you can be have the sexiest job of the 21st century. If you have any questions, please leave a comment. And stay tuned for more data science career tips in the future.

Machine Learning with R: An Irresponsibly Fast Tutorial

Machine learning without the hard stuff


As I said in Becoming a data hacker, R is an awesome programming language for data analysts, especially for people just getting started. In this post, I will give you a super quick, very practical, theory-free, hands-on intro to writing a simple classification model in R, using the caret package. If you want to skip the tutorial, you can find the R code here. Quick note: if the code examples look weird for you on mobile, give it a try on a desktop (you can’t do the tutorial on your phone, anyway!).

The caret package

One of the biggest barriers to learning for budding data scientists is that there are so many different R packages for machine learning. Each package has different functions for training the model, different functions for getting predictions out of the model and different parameters in those functions. So in the past, trying out a new algorithm was often a huge ordeal. The caret package solves this problem in an elegant and easy-to-use way. Caret contains wrapper functions that allow you to use the exact same functions for training and predicting with dozens of different algorithms. On top of that, it includes sophisticated built-in methods for evaluating the effectiveness of the predictions you get from the model. I recommend that you do all of your machine-learning work in caret, at least as long as the algorithm you need is supported. There's a nice little intro paper to caret here.

The Titanic dataset

Most of you have heard of a movie called Titanic. What you may not know is that the movie is based on a real event, and Leonardo DiCaprio was not actually there. The folks at Kaggle put together a dataset containing data on who survived and who died on the Titanic. The challenge is to build a model that can look at characteristics of an individual who was on the Titanic and predict the likelihood that they would have survived. There are several useful variables that they include in the dataset for each person:

  • pclass: passenger class (1st, 2nd, or 3rd)
  • sex
  • age
  • sibsp: number of Siblings/Spouses Aboard
  • parch: number of Parents/Children Aboard
  • fare: how much the passenger paid
  • embarked: where they got on the boat (C = Cherbourg; Q = Queenstown; S = Southampton)

So what is a classification model anyway?

For our purposes, machine learning is just using a computer to “learn” from data. What do I mean by “learn?” Well, there are two main different possible types of learning:

  • supervised learning: Think of this as pattern recognition. You give the algorithm a collection of labeled examples (a training set), and the algorithm then attempts to predict labels for new data points. The Titanic Kaggle challenge is an example of supervised learning, in particular classification.
  • unsupervised learning: Unsupervised learning occurs when there is no training set. A common type of unsupervised learning is clustering, where the computer automatically groups a bunch of data points into different “clusters” based on the data.

Installing R and RStudio

In order to follow this tutorial, you will need to have R set up on your computer. Here's a link to a download page: Inside R Download Page. I also recommend RStudio, which provides a simple interface for writing and executing R code: download it here. Both R and RStudio are totally free and easy to install.

Installing the required R packages

Go ahead and open up RStudio (or just R, if you don't want to use RStudio). For this tutorial, you need to install the caret package and the randomForest package (you only need to do this part once, even if you repeat the tutorial later).

Loading the required R packages

Now we have to load the packages into the working environment (unlike installing the packages, this step has to be done every time you restart your R session).

Loading in the data

Go the Kaggle download page to find the dataset. Download train.csv and test.csv, and be sure to save them to a place you can remember (I recommend a folder on your desktop called “Titanic”). You might need to sign up for Kaggle first (you should be using Kaggle anyway!)

To load in the data, you first set the R working directory to the place where you downloaded the data.

For example, I downloaded mine to a directory on my Desktop called Titanic, so I typed in

Now, in order to load the data, we will use the read.table function

This command reads in the file “train.csv”, using the delimiter “,”, including the header row as the column names, and assigns it to the R object trainSet.

Let's read in the testSet also:

Now, just for fun, let's take a look at the first few rows of the training set:

You'll see that each row has a column “Survived,” which is 1 if the person survived a 0 if they didn't. Now, let's compare the training set to the test set:

The big difference between the training set and the test set is that the training set is labeled, but the test set is unlabeled. On Kaggle, your job is to make predictions on the unlabeled test set, and Kaggle scores you based on the percentage of passengers you correctly label.

Testing for useful variables

The single most important factor in being able to build an effective model is not picking the best algorithm, or using the most advanced software package, or understanding the computational complexity of the singular value decomposition. Most of machine learning is really about picking the best features to use in the model. In machine learning, a “feature” is really just a variable or some sort of combination of variables (like the sum or product of two variables).

So in a classification model like the Titanic challenge, how do we pick the most useful variables to use? The most straightforward way (but by no means the only way) is to use crosstabs and conditional box plots.

Crosstabs for categorical variables

Crosstabs show the interactions between two variables in a very easy to read way. We want to know which variables are the best predictors for “Survived,” so we will look at the crosstabs between “Survived” and each other variable. In R, we use the table function:

Looking at this crosstab, we can see that “Pclass” could be a useful predictor of “Survived.” Why? The first column of the crosstab shows that of the passengers in Class 1, 136 survived and 80 died (ie. 63% of first class passengers survived). On the other hand, in Class 2, 87 survived and 97 died (ie. only 47% of second class passengers survived). Finally, in Class 3, 119 survived and 372 died (ie. only 24% of third class passengers survived). Damn, that's messed up.

We definitely want to use Pclass in our model, because it definitely has strong predictive value of whether someone survived or not. Now, you can repeat this process for the other categorical variables in the dataset, and decide which variables you want to include (I'll show you which ones I picked later in the post).

Plots for continuous variables

Plots are often a better way to identify useful continuous variables than crosstabs are (this is mostly because crosstabs aren't so natural for numerical variables). We will use “conditional” box plots to compare the distribution of each continuous variable, conditioned on whether the passengers survived or not ('Survived' = 1 or 'Survived' = 0). You may need to install the *fields* package first, just like you installed *caret* and *randomForest*.

The box plot of age for those who survived and and those who died are nearly the same. That means that Age probably did not have a large effect on whether one survived or not. The y-axis is Age and the x-axis is Survived (Survived = 1 if the person survived, 0 if not).

plot of chunk unnamed-chunk-10

Also, there are lots of NA’s. Let’s exclude the variable Age, because it probably doesn’t have a big impact on Survived, and also because the NA’s make it a little tricky to work with.

As you can see, the boxplots for Fares are much different for those who survived and those who died. Again, the y-axis is Fare and the x-axis is Survived (Survived = 1 if the person survived, 0 if not).

plot of chunk unnamed-chunk-10

Also, there are no NA’s for Fare. Let’s include this variable.

Training a model

Training the model uses a pretty simple command in caret, but it's important to understand each piece of the syntax. First, we have to convert Survived to a Factor data type, so that caret builds a classification instead of a regression model. Then, we use the train command to train the model (go figure!). You may be asking what a random forest algorithm is. You can think of it as training a bunch of different decision trees and having them vote
(remember, this is an irresponsibly fast tutorial). Random forests work pretty well in *lots* of different situations, so I often try them first.

Evaluating the model

For the purposes of this tutorial, we will use cross-validation scores to evaluate our model. Note: in real life (ie. not Kaggle), most data scientists also split the training set further into a training set and a validation set, but that is for another post.

What is cross-validation?

Cross-validation is a way to evaluate the performance of a model without needing any other data than the training data. It sounds complicated, but it's actually a pretty simple trick. Typically, you randomly split the training data into 5 equally sized pieces called “folds” (so each piece of the data contains 20% of the training data). Then, you train the model on 4/5 of the data, and check its accuracy on the 1/5 of the data you left out. You then repeat this process with each split of the data. At the end, you average the percentage accuracy across the five different splits of the data to get an average accuracy. Caret does this for you, and you can see the scores by looking at the model output:

There are few things to look at in the model output. The first thing to notice is where it says “The final value used for the model was mtry = 5.” The value “mtry” is a hyperparameter of the random forest model that determines how many variables the model uses to split the trees. The table shows different values of mtry along with their corresponding average accuracies (and a couple other metrics) under cross-validation. Caret automatically picks the value of the hyperparameter “mtry” that was the most accurate under cross validation. This approach is called using a “tuning grid” or a “grid search.”

As you can see, with mtry = 5, the average accuracy was 0.8170964, or about 82 percent. As long as the training set isn't too fundamentally different from the test set, we should expect that our accuracy on the test set should be around 82 percent, as well.

Making predictions on the test set

Using caret, it is easy to make predictions on the test set to upload to Kaggle. You just have to call the predict method on the model object you trained. Let's make the predictions on the test set and add them as a new column.

Uh, oh! There is an error here! When you get this type of error in R, it means that you are trying to assign a vector of one length to a vector of a different length, so the two vectors don't line up. So how do we fix this problem?

One annoying thing about caret and randomForest is that if there is missing data in the variables you are using to predict, it will just not return a prediction at all (and it won't throw an error!). So we have to find the missing data ourselves.

As you can see, the variable “Fare” has one NA value. Let's fill (“impute”“) that value in with the mean of the "Fare” column (there are better and fancier ways to do this, but that is for another post). We do this with an ifelse statement. Read it as follows: if an entry in the column “Fare” is NA, then replace it with the mean of the column (also removing the NA's when you take the mean). Otherwise, leave it the same.

Okay, now that we fixed that missing value, we can try again to run the predict method.

Let's remove the unnecessary columns that Kaggle doesn't want, and then write the testSet to a csv file.

Uploading your predictions to Kaggle

Uploading predictions is easy. Just go to the Kaggle page for the competition, click “Make a submission” on the sidebar, and select the file submission.csv. Click “Submit,” and then Kaggle will score your results on the test set.


Well, we didn't win, but we did pretty well. In fact, we beat several hundred other people and one of the benchmarks created by Kaggle! Our accuracy on the test set was 77%, which is pretty close to the cross-validation results of 82%. Not bad for our first model ever.

You can find the R code here.

Improving the model

This post only scratches the surface of what you can do with R and caret. Here are a few ideas for things to try in order to improve the model.

  • Try including different variables in the model: leave some out or add some in
  • Try combining variables into more useful variables: sometimes you can multiply or add variables together, or concatenate different categorical variables together
  • Try transforming the existing variables in clever ways: maybe turn a numerical variable into a categorical variable based on different ranges (e.g. 0-10, 10-90, 90-100)
  • Try a different algorithm: maybe neural networks, logistic regression or gradient boosting machines work better. Better yet, train a few different types of models and combine the results by averaging the probabilities (this is called ensembling)

Next steps

Okay, so you've done one machine learning classification tutorial and submitted a solution to Kaggle. That's an awesome start, and it's more than the vast majority of people ever do. So what's next? Here are a few things you can do:

  • Try another Kaggle competition! There are a few competitions out there that are great for learning, like Give me Some Credit] or Don’t Get Kicked. The forums contain lots of great advice and example solutions.
  • Learn more about predictive analytics and caret. The book Applied Predictve Modeling was written by Max Kuhn, the creator of caret. I haven't read it, but it comes highly recommended. His blog has also been incredibly useful to me.
  • Keep reading this blog! I will continue to post about practical machine learning. If you'd like, you can subscribe to my email list on the sidebar so that you never miss a post.