Getting started with machine learning with Python and scikit-learn

Why Python?

For most people, I recommend getting started with R, because the tools in R for exploratory data analysis and visualization are easier and more comprehensive than the tools in Python. However, if you have a computer science background or if you want to jump on the fast track to high-performance machine-learning, then you might want to start with Python. Python is an awesome programming language, because it is easy to write, readable, and well-documented, and it is very fast if you do it right. It also has extensive libraries for scientific computing, stats, and machine learning.

The Python scientific stack

Python has a comprehensive and integrated scientific computing stack that has an incredible combination of performance, ease of use, and depth. It is made up of several libraries and utilities, including:

numpy: Fast and easy array computations and manipulations. Includes “broadcasting” and “fancy” indexing, which give Python arrays some of the simple syntax of R vectors.
scipy: Scientific computing functionality, including optimized matrix operations and data structures, numerical optimization, calculus functions, etc.
pandas: Data frames in Python. A little more complicated than R data frames, but with much better performance (more like R data tables than data frames in practice)
scikit-learn: Machine learning in Python. Fast and easy to use.
Jupyter (previously IPython Notebook): A browser-based notebook for scientific computing in Python and other languages.
Matplotlib: A nice plotting library. Many people use Seaborn as a user-friendly alternative.

A couple of quick tutorials to get started

Kaggle has a great tutorial series on getting started with Python. It takes you through the basics of loading data, manipulating data, transforming data, and building a random forest machine learning model. It uses the Titanic survival dataset to walk you through all of these skills in a practical case study. It is probably best to start at part I, although it is probably OK to skip part I if you are impatient.

Getting Started with Python Part I: Covers basic CSV loading and numpy arrays
Getting Started with Python Part II: Introduces the extremely useful pandas data frame concept
Getting Started with Random Forests: Introduces the scikit-learn library and shows you how to build a basic machine learning model

The Kaggle tutorials should get you started and start building your confidence. After these, I recommend the scikit-learn quick start tutorial. This gives a bit of the bigger picture on scikit-learn and the concepts of machine learning. The scikit-learn documentation taught me so much, and I highly recommend it. The scikit-learn website also contains lots of code examples, although the examples can seem a bit complex at first.

Where to go from here

If you really sit down and work through these tutorials, you will be ready to try some more examples on your own. I recommend checking out a couple more straightforward Kaggle competitions like Give Me Some Credit. Machine learning is a craft, and it takes practice to get good at it, but the payoff can be huge.

Why Python?

The Python scientific stack

A couple of quick tutorials to get started

Where to go from here

Leave a Reply Cancel reply