“Python is too slow!”
Python has an undeserved reputation among software engineers as being “slow” or “not for production loads.” This is mostly a misconception. Python can be extremely fast and robust, but it doesn’t happen automatically. Python is dynamically typed and interpreted, and it tends to be hard to do anything multithreaded.
Python comes “batteries included,” so there are lots of built in functions that make it easy to hack together a script to solve a problem. This is a double edged sword. You can develop software quickly in Python, but you can also develop slow and unreliable software very quickly in Python.
Does it really need to be fast?
The most important thing you need to ask yourself is: Does this code really need to be fast? In a lot of cases, it actually does not. Donald Knuth famously said, “Premature optimization is the root of all evil.” He means that you should never optimize code unless there is a compelling reason to do so. It is almost always better to write working code first, even if it is not as fast as it could be.
The best thing about Python is that it is very simple to solve common problems with it. Most of the time, the simple, out of the box, Stack Overflow solution to a problem will work for you. Do not get caught in the trap of overengineering code before you need to! Now, on to the optimization methods.
Make Use of Optimized Functions
Python is built in the (very fast) programming language C. This means that precompiled parts of Python can be incredibly fast. For example, with scipy matrices, you can take advantage of extremely fast matrix decomposition algorithms from LAPACK. These are highly optimized and typically use C under the covers. For functions like these, the implementation in Python is about as fast as in any language. Just make sure to use the right BLAS and LAPACK versions. The anaconda distribution of Python sets all of this up for you by default.
Be Orthodox
Since Python is “batteries included,” there is a package for everything. Make sure you read the documentation for the package and do things the orthodox way when you can. Typically, the orthodox way is fast. For example, if you are doing computations on numpy arrays, try to use vectorized functions instead of loops. More generally, if you are writing lots of code to do a common operation, it’s worth looking to see if there is a better way. Recently, I had to compute the pairwise distance between a bunch of vectors. You could do this with a loop, but I figured out that I was trying to compute an outer product. This is something you can do natively in numpy. The solution was literally one line of code, and it was pretty darn fast.
Use Appropriate Data Structures
This is true in any language, but you should always check to see if you are using the right data structure. Python has plenty of built-in and custom data structures, and it is worth checking to see if there is one pre-built for your needs. For example, I once wrote a function that checked millions of items for membership in a list. Then, I realized I could use a set instead of a list, since the order of the items was not important. After that, I figured out that I could use a frozenset instead of a set, since the set rarely changed. Just this change improved the performance of the function by an order of magnitude.
Use Cython
In some cases, you can rewrite some Python functions in your script in C and run them in your script using a package called Cython. C is an incredibly fast language that allows you fine-grained control of things like memory management. In some cases, you can take advantage of this to speed up parts of your Python script by a huge amount.
Make it parallel
In some cases, a problem can be solved in a parallel way. In these cases, you might be able to use a tool like IPython Parallel to speed things up. You can also “manually parallelize” by opening up multiple screen sessions or virtual machines to run different parts of a script. (This is more of a hack and is only suitable for manually executed batch jobs)
Python also has facilities for multiprocessing and multithreading. These sound like the same thing, but they are not. A process may be made up of multiple threads. Multiple threads within a single process share the same memory space, so you can have shared objects between different threads in a process. However, two different processes cannot share the same memory space. There is a great discussion of this on StackOverflow.
Python is infamous for its Global Interpreter Lock, also known as the GIL. I was asked about it in an interview once, and I totally blew it! The GIL prevents the Python interpreter from running on multiple threads at once. The GIL is the reason that multithreading in Python is difficult. The reason for the GIL has to do with the way that CPython manages memory. In fact, the GIL does not exist in Jython or IronPython, two different versions of Python that are not written in C. Of course, these Python versions lack some of the other features and packages of CPython. You do not need to understand much about the GIL. The only you really need to understand is that multithreading is usually hard in Python, but multiprocessing usually is not. In fact, multiprocessing is pretty easy.
For training scikit-learn models, you can train different models for cross-validation of the model using multiprocessing. This can save a lot of time.
Should you use Python when things need to be fast?
Some people might read this and think, “If you have to do all of this stuff to make Python fast, shouldn’t you just use another programming language?” This is a reasonable question, but I think it misses the reason that Python is great. Python is almost always fast enough, and it is really fast and easy to develop applications with. It also has the most extensive collection of data science packages outside of R. Most of the time, you do not need to write fast code; you need to write code fast.