Why you should learn Spark if you want to be a data scientist

About a year ago, I said that the best place to start for most aspiring data scientists is to learn R and Hadoop. Hadoop is still useful and is a very marketable skill to have, but Spark is quickly emerging as the new Big Data framework of choice. In this post, I will talk through some new developments that make it a great career choice to spend some time learning Spark.

Don’t lots of people still use Hadoop?

Absolutely. Hadoop is still extremely popular and useful. You should still understand the basics of Hadoop. But in the last couple years, Spark has become probably the trendiest (and most lucrative) big data technology in the world. Add in the fact that Spark is way faster than Hadoop for many things and that you can write Spark programs in Python (and R, but less completely), and it is a no brainer to focus on Spark.

What is Spark?

Spark is a big data computation framework like Hadoop. In fact, Spark can use HDFS (the Hadoop Distributed Filesystem) as its filesystem. The main reason that Spark is so much faster than Hadoop is that Hadoop repeatedly reads and writes the intermediate steps of MapReduce jobs to disk, whereas Spark caches most of its computations in memory. The reads and writes to disk are slow, which is why even simple Hive queries can take minutes or more to complete.

Spark’s main abstraction is the Resilient Distributed Dataset (RDD). As indicated by its name, an RDD is a dataset that is distributed across the Spark compute nodes. When an RDD is created, it is stored in memory, which allows you to query or transform it repeatedly without writing to or reading from disk. This in-memory caching also makes Spark ideal for training machine-learning models, since training ML models typically involves iterative computations on a single dataset (e.g. repeatedly adjusting weights in an artificial neural network via gradient descent).

What about Hive or “SQL on Hadoop”?

One of the coolest things about Spark is that it has built-in data connectors to many different kinds of data sources, including Hive. But Spark takes it one step further with SparkSQL, which can be much faster than Hive.

How to run Spark

Note: if you just want a taste of Spark, you can read the code examples in Getting Started with Apache Spark

Option A: Install it locally

Like Hadoop, Spark is meant to be run on a cluster of machines. Nevertheless, you can run Spark on a single machine for learning and testing purposes. There are several steps involved, but I was able to get Spark up and running on my Windows laptop in an hour or so. If things don’t work perfectly for trickier dependencies, you can always try using the virtualization and containering sofware Docker. It allows you to use a completely fresh and standard Linux instance on your Windows or Mac PC (and you could take the same thing you build locally and deploy it easily to the cloud).

Option B: Use AWS EMR (or another cloud computing service)

I really like this option, because it allows you to get practice with another extremely marketable technology, and you don’t have to install a bunch of stuff on your computer. Amazon Web Services (AWS) Elastic MapReduce (EMR) is a web service that allows you to spin up your own Hadoop and Spark clusters using a point and click web interface. This is the way that I started. There are a few steps you have to do first, including signing up for an AWS account and setting up SSH keys to connect to EMR. However, spending a few hours getting started on AWS will help you over and over again if you are trying to learn data science. And you can put EMR on your resume, too. Best of all, there is a limited free tier for first time users.

Resources for learning Spark

I always recommend learning by doing as much as possible. With Spark, there are some good online tutorials to help you get started. But first, it might help to spend an hour learning about how Spark works, so that the tutorials make a little more sense. Another tip: start with things you know, and try to learn one thing at a time. For example, if you already know Hive, try using Spark to query Hive. If you already know Python, use PySpark instead of trying to also learn Scala.

Spark explained

These two links are a good place to start to get a basic understanding of what Spark is.
– The 5-Minute Guide to Understanding the Significance of Apache Spark
– Spark overview from e-book “Mastering Apache Spark”

Tutorials and Examples

Most people learn best by example, so I have included a few good tutorials with plenty of examples.
– Getting Started with Apache Spark: Includes a few nice, simple examples of using Spark (requires an email address to access)
– Quick Start: A very light quick start that you can use after you install Spark on your computer
– Spark Examples: A collection of several useful examples to get the basics of Spark

Conclusion

Spark is the trendiest technology out there for data scientists. And compared to Hadoop, it is not too hard to get started. I highly recommend that you give it a try.

Action steps

Figure out how to run Spark, either locally or in AWS
Spend one hour (but not more than that) reading about Spark’s architecture and its execution model. Don’t worry
if you don’t understand everything right away.
Pick one (and only one) Spark tutorial to try out, and follow every step. This is not the time to get too creative.
Just stay focused and stick to one thing at a time.
Try a simple project of your own. Keep it simple to start with, so you can boost your confidence and motivation.