career

Data Science is Cool

A cool data scientist. Image: Eleazar Paradise / https://www.flickr.com/photos/eleazarparadise/4970198148/in/photostream/

A cool data scientist. Image: Eleazar Paradise / Flickr

Let’s face it, data science is cool. And since it’s cool, there are lots of great jobs out there for data people. Unfortunately, lots of people who would love data science just don’t know how to land their first data science job. In this post, I will tell you the three critical steps you need to know to get started.

My Story

The Five Stages of Stormtrooper Grief. Image: JD Hancock / Flickr

Two years ago, I was in a PhD program studying math. I really loved learning math (and still do), but I wasn’t sure that I wanted to become a professor. At one point, I had a project fall apart completely after a year of work. I went through all five stages of grief in about 3 hours, and then I decided:

I Need to Do Something Else!

I knew I needed to do something other than becoming a professor, but I still wasn’t sure what to do. There were a few options I could think of. I could become a teacher, maybe in a high school, but that came with a whole host of difficulties (not least of which was the low pay). I could become a software developer, but software development wouldn’t use any of the math or stats skills I learned in school (and besides, I was not a great programmer).

Enter Data Science

It was at this point that I just started googling randomly. Eventually, I heard about a field called “Marketing Analytics,” where you would use statistical modeling to design targeted marketing programs. Now this was cool! I found the site Meetup.com and started to go to local data-focused meetups. It was here that I first learned about about Big Data, machine-learning, and “Data Science” (which didn’t sound very science-y at all, but did sound cool). I decided that I wanted a job doing Data Science, and soon!

Okay, so how do I actually get a job?

For me, getting a job in data science was 80 percent networking, 20 percent skills, and not the other way around. I think many people make the mistake of thinking that they have to be world-class experts at convolutional neural networks and Hadoop Map-Reduce before they even talk to a single employer. On the contrary, I think that for getting into the field of data science, your professional connections and your credibility are even more important than your skills. Here’s why: data science is such a huge and poorly-defined field that there is not one well-defined technical skill set that every “data scientist” shares. For most jobs, it is much more important to demonstrate to employers that you have general, transferrable data-related technical skills than to be an expert in the specific technologies they mention in the job description (hint: the “requirements” on the job description are usually way above and beyond the actual requirements needed to do the job — “They are more like guidelines”)

Step 1: The right kind of networking

You always hear, “It’s not what you know, it’s who you know.” People often say that in a cynical or resigned way, and they are really trying to imply that you can’t get a good job without already being a member of the local yacht club. On the contrary, I think there’s a more positive, proactive way to manage the “who you know” part of getting a data science job. Ramit Sethi calls it natural networking. When many people network, they are just there to use the people in their network for a job referral or to make a sale (I’ve been a victim of a sales call disguised as “networking” several times). The key to natural networking is to build up and maintain a network of professional contacts with whom you are building an authentic, two-way relationship. This means, for example, that your network should consist of people that you can help, not just people who you think can help you. For example, I was referred to my current job by some people that I met at a Meetup, and now I have paid it forward a few times by referring others to jobs and offering career advice. It’s a pretty simple and fun way to contribute, and it is an essential part of building an authentic professional network.

So how should you get started building your network? There are three strategies that I found useful in my own job search:

Go to local data science Meetups at least once a month
Systematically reach out on email and LinkedIn to at least 1 person a week
Try to have at least one coffee meeting/informational interview a month (preferably one a week)

Step 2: What skills do I really need?

The skills you need really depend on the job you’re looking for. If you’re looking to be a high-performance data software engineer, you probably need to know Java and to really understand Hadoop Map-Reduce. If you want to do marketing analytics for an ad agency, you might need to know SAS. But what if you aren’t sure where to start? My recommendation is to learn R and a little bit about Hadoop. I talk about this at length, including the best R packages to learn, in my article Becoming a Data Hacker. The key point of the article is to pick one technology stack to focus on, and to do at least one project in that technology stack.

Step 3: The interview

From the company’s perspective, most job interviews are all about establishing two things:
– Can the candidate do the job?
– Is the candidate a good fit for the team?

The first question mainly comes down to technical skills and subject-matter knowledge, and the second one mostly comes down to personality and social skills. The best way to nail the first question is to have done real, independent data science projects, so you can talk about data science in convincing detail. It helps to have also read widely and gone to data science meetups, so that you can talk about data science in a general context.

The best way to nail the second question is to know how to answer the innocuous-sounding “tell us about yourself” questions appropriately. This is harder than it sounds, and it’s where interview practice is critical. For example, a typical interview question is: “What is your biggest weakness?” A naive person might just answer the question directly: “I procrastinate too much.” An inexperienced cynic might try to give a weakness that’s really a strength: “Sometimes I work too hard.” But a master interviewer would say something like: “Well, I am pretty good at managing projects, but I struggle sometimes with managing people. To improve this, I am taking an online management course, and I’ve taken on small leadership roles within my team lately.” The master interviewer gives an honest answer, but they also assuage the fears of the interview team, and they demonstrate self-awareness by showing that they are aware of the problem and trying to fix it. Always think about what the interviewer is really asking, and answer the real question, not just the surface question.

Next steps

There’s more to come on this topic. But this short post should be enough to get you started. For now, here are a few action steps:

Sign up for your first meetup. Go to meetup.com and search for data-related meetup groups in your local area. Here are some keywords to search: “big data,” “hadoop,” “analytics,” “R,” “data science,” “noSQL”
Pick a small project to try for learning purposes (some ideas at the bottom of this post). It doesn’t have to be particular serious or hard, and it will be an awesome way to learn some new skills and beef up your credentials.

Your typical, everyday data hacker. Image: elhombredenegro / https://www.flickr.com/photos/77519207@N02/6818192898/in/photostream/

Your typical, everyday data hacker. Image: elhombredenegro / Flickr

“I don’t know where to start”

I recently spoke to about a dozen aspiring data scientists, and a very common concern was, “There are just so many different programming languages, and so many different software packages and databases. I feel overwhelmed, and I don’t know where to start!”

In this post, I will explain everything you need to learn to get started as a data “hacker.”

What is a data hacker?

Harlan Harris and Vincent Granville have both written articles about the different types of data scientists. Harris’s article is more about the roles of data scientists, whereas Granville’s article is more about the skills of data scientists. Harris breaks data scientists into 4 types: Data Businesspeople (technical managers, data entrepreneurs), Data Creatives (hackers, jack-of-all-trades types), Data Developers (big data programmers), and Data Researchers (usually PhDs in computer science or statistics working in academia or in large research labs). I consider myself a jack-of-all-trades, so I think I fit into the Data Creative type. In this article, I will focus on how to become a data hacker (called a Data Creative by Harlan Harris).

So how do I become a Data Hacker?

Hackers tend to have a broad set of technical skills, although they may not be true experts at any one skill:
– Statistical Programming
– Machine Learning
– Visualization
– Reporting/Dashboarding
– Databases
– Big Data
– Data “Munging”

This is a long list! How does someone actually learn all of these things in a reasonable amount of time? The key is to pick a single comprehensive technology stack, and do everything using that stack.

The R-Hadoop technology stack

R is a free, open-source statistical programming language originally based on the S programming language. Here are a few reasons why R is a great place to start for data analysis:
– It’s completely free: SAS and SPSS are expensive to get started with, and you often need to buy new methods if you want to try them out
– It’s comprehensive: almost any statistical or machine-learning task you could think of has pre-built libraries for you to use in R.
– R is easy to learn, and especially good for hacking: you don’t need to have a lot of programming experience to get started doing useful work in R
– R is a full-fledged programming language: unlike SAS or SPSS, R is not just a procedural language for doing data analysis
– R is great for getting a job, especially in the tech industry

Hadoop is a free, open-source distributed computing framework. Hadoop is used for all aspects of Big Data: storage, databases, analysis, and even modeling. Hadoop is used at many of the top companies in the world, including Facebook, Twitter, and LinkedIn. When you hear about Hadoop, you typically hear about MapReduce, which is a framework that allows you to solve (extremely) large-scale data processing problems on a cluster of commodity computers. Here are a few reasons why Hadoop is a great way to get started with Big Data:
– Again, it’s completely free
– It’s easy to get started, even if you don’t have your own cluster of computers: check out Cloudera for an online trial and a VM you can download for free
– Hadoop is comprehensive: almost any Big Data storage or processing problem can be solved within the Hadoop ecosystem
– Hadoop is great for getting a job: it seems like it’s on every data science job ad nowadays!

The R-Hadoop stack allows to do almost anything you need to for data hacking:
– Statistical Programming: R has packages for data exploration, statistical tests, regression, and everything else you could imagine.
– Machine Learning: The caret package is a wrapper for dozens of machine learning algorithms, and makes it easy to train, tune, and test machine-learning models.
– Visualization: The ggplot2 package allows you to make professional-looking, fully customizable 2D plots.
– Reporting/Dashboarding: The knitr package allows you to generate beautiful, dynamic reports with R. The shiny package is a web framework for building stylish, interactive web apps with R.
– Databases: Hive is a highly-scalable data warehousing system built on Hadoop for ad-hoc SQL-style querying of huge datasets (developed at Facebook). Cassandra (used by Netflix) and HBase (used by Twitter) are other database solutions for other purposes built on Hadoop.
– Big Data: This is what Hadoop was made for. Hadoop allows you to store and process essentially unlimited amounts of data on commodity hardware (you don’t need a supercomputer anymore). And depending on how big you mean by “big” data, R has some spectacular libraries for working directly with it, like data.table.
– Data “Munging”: Data munging refers to “cleaning” data and rearranging data in more useful ways (think parsing unusual date formats, removing malformed values, turning columns into rows, etc.). Both R and Hadoop have facilities for this. R is awesome and easy for small to moderate-sized data sets, and Hadoop allows you to write your own programs to clean and rearrange really big data sets when needed.

Note: You should do all of this on a Unix-based system

R and Hadoop can both be run on Windows, but it is much more natural and easier to use a Unix-based system. That might be a bit of a learning curve, but the rewards of learning Unix are incredibly high, and it’s great for a resume.

What the R-Hadoop stack is not great for

R and Hadoop can cover most use-cases, but there are situations where you’ll want to use something else. For example, Python has libraries that make text-mining much easier and more scalable than it is in R. And sometimes, if you’re building a web app, Shiny is just not flexible enough, so you’ll want to use a more traditional web framework. But for most purposes, most of the time, you can get by with R and Hadoop.

Why just stick to one technology stack?

Some of you might be saying: “Shouldn’t you always just use the right tool for the job? Why would you let yourself get so sucked in to one ecosystem?” That is a very good point, but I think that there are huge advantages to focusing on one thing, especially when you are starting out. First of all, you will waste lots of time if you switch training paths and resources all of the time, because the startup costs of learning a new technology are so high. Secondly, it is extremely useful and motivating to focus on one technology, because getting good at one thing is the fastest way to be able to solve real-world problems (instead of the toy examples you usually use when learning a new technology). And finally, R and Hadoop are often the best tool for the job, anyway!

So how do I get started?

Here’s how I recommend you get started: first, start with “toy” examples and a little formal training. Then, pick a simple project that interests you, and then try it out! You will learn a lot by overcoming the natural roadblocks that arise in working with real data. Here are some good resources to get you started:

Intro to R

Code School’s TryR is a free, short, interactive online intro to R
Coursera’s Computing for Data Analysis is a free, comprehensive online course on R

Intro to Hadoop

Cloudera Live and Cloudera’s VM are great intros to what Hadoop is all about

Project ideas

Find a dataset of historical results from the World Series, create a couple of visualizations using ggplot2, and create a simple web-app in Shiny to display the visualizations.
Build a classification model to identify survivors of the Titanic using the Kaggle Titanic dataset and R’s caret package
Pull the 1990 and 2000 US Census data into a Hive database on Amazon Web Services. Can you find any surprising demographic differences between 1990 and 2000 with a few Hive queries?

How do I get my first job in data science?