From Grad Student to Data Scientist

From Grad Student to Data Scientist

“I don’t wanna be a professor”

A few days ago, I got an email from a friend of a friend named Jeremy. Jeremy is finishing up his PhD in Cognitive Science, and he wants to become a data scientist or a software engineer. In other words, Jeremy does not want to become a professor or postdoc. Luckily, he has learned quite a lot about machine learning in the course of his studies (you can find him on LinkedIn if you are hiring). But he still had a lot of questions for me, and I think a lot of readers have the same questions as he did.

Am I qualified?

Grad students are a bit of an anxious bunch (at least I was). They spend all of their time around other grad students, post docs, and professors (ie. Really Smart People). Unfortunately, spending all of your time around Really Smart People can make you think that you don’t really belong, and even that you might be an intellectual fraud. Therefore, lots of grad students I know don’t realize how smart and capable they are.

Here’s the truth: If you have spent a few years in a tough, technical grad program studying tough, technical stuff, then you are smart enough for data science. Don’t believe those guys you read on StackOverflow or Quora who say that you have to have implemented stochastic gradient descent in assembly and deployed it to Hadoop on a Raspberry Pi (and published your results in Science) in order to be a “real” data scientist. That’s just not true. In fact, most data scientists are just pretty good hackers who can learn as they go and get the job done. If you can do that, then you are qualified to become a data scientist. Of course, that doesn’t mean that you will be any good at data science when you’re starting out, but you already know that you can learn fast.

What job title should I go for?

Most likely, your first job title should be “Data Scientist.” As a basic rule of thumb, if you have a PhD, try not to start as a “Junior Data Scientist”. Just plain “Data Scientist” is a great title, and “Senior Data Scientist” is probably out of reach for a first job. With a Masters or less, “Junior Data Scientist” is okay, but you’ll want to remove that “Junior” quickly. “Data Analyst” or “Analyst” or “Senior Analyst” can be really good, depending on the company (with a PhD, go for Senior Analyst). But you will want to fight hard for the Data Scientist title to expand your options for changing jobs in the future.

What skills do I need to have?

To really call yourself a Data Scientist, you need to be able to work with huge and messy datasets, build machine-learning models, or (preferably) both. Caveat: this advice is generic, and if you have a very specific type of company you are interested in, make sure you learn the skills you need for that company. For example, if you are trying to get a job in financial engineering, you should probably understand time series.

Big Data

You should be able to handle millions of records of structured data (like CSVs containing mostly numbers) or unstructured data (like raw text). How you do this is mostly up to you. Maybe you have some experience with high-performance computing, or maybe you have used Hadoop or Spark. Maybe you are just really good at writing optimized Python or R code.

Machine Learning

You should be familiar with the main types of machine-learning models (classification, clustering, etc.), and you should have actually implemented them. You should know how to choose the right type of model for the problem, and (even more importantly) you should know how to evaluate the models appropriately (training and test sets, cross-validation, AUC, etc.). I commonly see people who are dabbling in machine-learning who build models with no understanding of how to tell if the models are any good. This is probably worse than not building a model at all.

What technology do I need to know?

Big Data Tools: Hadoop, Spark, SQL, …?

Learning some Big Data tools is not absolutely essential, but it makes a huge difference on a resume. To get your first job, you do not need to be an expert on Hadoop or Spark, but you should at least learn a little SQL to query databases. You can learn that at SQL Zoo. If you want to impress, you should download the Cloudera VM and work through the Hadoop tutorial. You can also try out the Spark Sandbox from Hortonworks. Hadoop is still the standard, but Spark is incredibly trendy. If you can actually do a real project with Hadoop or Spark, that’s even better. If you are very tech-savvy, you can spin up an Amazon Web Services instance and try out some open datasets.

Machine-Learning: R or Python?

To land your first job, you need to be good at one of two languages: R or Python. R is amazing for reporting, dashboarding and prototyping. It has many more models and statistical tests built-in than Python. Python is amazing for building fast models out of the box, and it has the added advantage of being more understandable by engineers.

If you want to do more reporting, presentations and business-focused data science, or if you will just be building prototypes of systems, then R is great. But if you want to work on a Data Science team that deploys models to production, you will most likely need to learn Python.

As a rule of thumb, if you are a Computer Science grad student, you should just learn Python. You will like it better than R, and it will fit better with the rest of your skill set. You should learn the packages numpy, scipy, pandas, and scikit-learn. If you are an Econ grad student, you should probably learn R first. Your skill set probably lends itself to more business-focused data science, and R is absolutely incredible for this. You should learn the packages ggplot2, plyr (or dplyr), reshape2, caret, and Shiny. Many Data Scientists (including me) know both R and Python really well.

CV or Resume?

One really important difference between the academic world and the business world is that most (American) companies ask for a resume, not a CV. I’m not being pedantic: there is a huge difference between a resume and a CV. I often see resumes from grad students that are way too long and way too academic. Most academic CVs are just lists of degrees, papers, conference presentations, committee memberships, and so on. Many CVs are several pages long. As Ramit Sethi says, unless you are Bill Gates or Barack Obama, your resume should probably only be one page (that is, one side of one sheet of paper). It’s not that you don’t have two pages of accomplishments, it’s more that a resume is not a list of accomplishments. Your resume is supposed to be a piece of paper that a hiring manager can look at for ten seconds and decide if you merit an interview. That is it! So you should only include things on your resume that fit that purpose. It should be clear at a glance that you have a great educational background and that you have the skills and leadership qualities needed for the job.

Bias against Nerds

One of the biggest hurdles you will have to overcome in transitioning from grad school to the business world is the bias that a lot of business people have against academics. Business people do believe that you are smart, but many of them reflexively question your soft skills. Most successful people did not get there by writing academic papers. They got there through business knowledge, hard experience and finely-honed instincts. So it’s understandable that they wouldn’t understand why a 28 year-old computer scientist who has been studying GPU architectures could be worth 100k a year. You have to convince them that you won’t just continue your thesis research on their dime. After all, the business makes money from sales, not from cool algorithms (at least not directly). The way to overcome their concerns is to do your best to really understand their business, and to come prepared to explain how your work in data can make a contribution (at least look at their website beforehand, for goodness sakes). Also, please don’t wear your awesome Firefly t-shirt to the interview.

A few extra ideas

In the process of reviewing this post, Jeremy mentioned a few things to me that he had personally discovered:

  • Networking: “Hit up all of your contacts”
  • Indeed.com: “Great job alerts for specific job titles”
  • Kaggle and LinkedIn job postings: Tend to be super targeted (and therefore useful)

Jeremy also mentioned that some people had recommended the Insight Data Fellows program. Sounds great, but I don’t have any personal experience there.

This sounds intimidating…

Yes, this does seem daunting, but remember: it can’t be harder than writing a thesis. You have energy, passion, and drive. Not to mention a top-quality brain. You just have to put it all together, and in no time, you can be have the sexiest job of the 21st century. If you have any questions, please leave a comment. And stay tuned for more data science career tips in the future.

Sign up for my email list for updates.

* = required field

Trackbacks

Leave a Reply

Your email address will not be published. Required fields are marked *