Will Stanton's Data Science Blog

  • About Me
  • Resources
  • Privacy Policy

Getting started with machine learning with Python and scikit-learn

June 1, 2016 by Will Stanton Leave a Comment

Why Python?

For most people, I recommend getting started with R, because the tools in R for exploratory data analysis and visualization are easier and more comprehensive than the tools in Python. However, if you have a computer science background or if you want to jump on the fast track to high-performance machine-learning, then you might want to start with Python. Python is an awesome programming language, because it is easy to write, readable, and well-documented, and it is very fast if you do it right. It also has extensive libraries for scientific computing, stats, and machine learning.

The Python scientific stack

Python has a comprehensive and integrated scientific computing stack that has an incredible combination of performance, ease of use, and depth. It is made up of several libraries and utilities, including:

  • numpy: Fast and easy array computations and manipulations. Includes “broadcasting” and “fancy” indexing, which give Python arrays some of the simple syntax of R vectors.
  • scipy: Scientific computing functionality, including optimized matrix operations and data structures, numerical optimization, calculus functions, etc.
  • pandas: Data frames in Python. A little more complicated than R data frames, but with much better performance (more like R data tables than data frames in practice)
  • scikit-learn: Machine learning in Python. Fast and easy to use.
  • Jupyter (previously IPython Notebook): A browser-based notebook for scientific computing in Python and other languages.
  • Matplotlib: A nice plotting library. Many people use Seaborn as a user-friendly alternative.

A couple of quick tutorials to get started

Kaggle has a great tutorial series on getting started with Python. It takes you through the basics of loading data, manipulating data, transforming data, and building a random forest machine learning model. It uses the Titanic survival dataset to walk you through all of these skills in a practical case study. It is probably best to start at part I, although it is probably OK to skip part I if you are impatient.

  • Getting Started with Python Part I: Covers basic CSV loading and numpy arrays
  • Getting Started with Python Part II: Introduces the extremely useful pandas data frame concept
  • Getting Started with Random Forests: Introduces the scikit-learn library and shows you how to build a basic machine learning model

The Kaggle tutorials should get you started and start building your confidence. After these, I recommend the scikit-learn quick start tutorial. This gives a bit of the bigger picture on scikit-learn and the concepts of machine learning. The scikit-learn documentation taught me so much, and I highly recommend it. The scikit-learn website also contains lots of code examples, although the examples can seem a bit complex at first.

Where to go from here

If you really sit down and work through these tutorials, you will be ready to try some more examples on your own. I recommend checking out a couple more straightforward Kaggle competitions like Give Me Some Credit. Machine learning is a craft, and it takes practice to get good at it, but the payoff can be huge.

Filed Under: Uncategorized

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Copyright © 2025 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish.Accept Read More
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT