The high-level nature of Python makes it very easy to program, read, and reason about code. Many programmers report being more productive in Python. For example, Robert Kern once told me that "Python gets out of my way" when I asked him why he likes Python. Others express it as "Python fits your brain." My experience resonates with both of these comments.
It is not rare, however, to need to do many calculations over a lot of data. No matter how fast computers get, there will always be cases where you still need the code to be as fast as you can get it. In those cases, I first reach for NumPy which provides high-level expressions of fast low-level calculations over large arrays. With NumPy's rich slicing and broadcasting capabilities, as well as its full suite of vectorized calculation routines, I can quite often do the number crunching I am trying to do with very little effort.
Even with NumPy's fast vectorized calculations, however, there are still times when either the vectorization is too complex, or it uses too much memory. It is also sometimes just easier to express the calculation with a simple loop. For those parts of the application, there are two general approaches that work really well to get you back to compiled speeds: weave or Cython.
Weave is a sub-package of SciPy and allows you to inline arbitrary C or C++ code into an extension module that is dynamically loaded into Python and executed in-line with the rest of your Python code. The code is compiled and linked at run-time the very first time the code is executed. The compiled code is then cached on-disk and made available for immediate later use if it is called again.
Cython is an extension-module generator for Python that allows you to write Python-looking code (Python syntax with type declarations) that is then pre-compiled to an extension module for later dynamic linking into the Python run-time. Cython translates Python-looking code into "not-for-human-eyes" C-code that compiles to reasonably fast C-code. Cython has been gaining a lot of momentum in recent years as people who have never learned C, can use Cython to get C-speeds exactly where they want them starting from working Python code. Even though I feel quite comfortable in C, my appreciation for Cython has been growing over the past few years, and I know am an avid supporter of the Cython community and like to help it whenever I can.
Recently I re-did the same example that Prabhu Ramachandran first created several years ago which is reported here. This example solves Laplace's equation over a 2-d rectangular grid using a simple iterative method. The code finds a two-dimensional function, u, where ∇2 u = 0, given some fixed boundary conditions.
The pure Python solution is the following:
This code takes a very long time to run in order to converge to the correct solution. For a 100x100 grid, visually-indistinguishable convergence occurs after about 8000 iterations. The pure Python solution took an estimated 560 seconds (9 minutes) to finish (using IPython's %timeit magic command).
Using NumPy, we can speed this code up significantly by using slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution. The NumPy update code is:
Using num_update as the calculation function reduced the time for 8000 iterations on a 100x100 grid to only 2.24 seconds (a 250x speed-up). Such speed-ups are not uncommon when using NumPy to replace Python loops where the inner loop is doing simple math on basic data-types.
Quite often it is sufficient to stop there and move on to another part of the code-base. Even though you might be able to speed up this section of code more, it may not be the critical path anymore in your over-all problem. Programmer effort should be spent where more benefit will be obtained. Occasionally, however, it is essential to speed-up even this kind of code.
Even though NumPy implements the calculations at compiled speeds, it is possible to get even faster code. This is mostly because NumPy needs to create temporary arrays to hold intermediate simple calculations in expressions like the average of adjacent cells shown above. If you were to implement such a calculation in C/C++ or Fortran, you would likely create a single loop with no intermediate temporary memory allocations and perform a more complex computation at each iteration of the loop.
In order to get an optimized version of the update function, we need a machine-code implementation that Python can call. Of course, we could do this manually by writing the inner call in a compilable language and using Python's extension facilities. More simply, we can use Cython and Weave which do most of the heavy lifting for us.
Cython is an extension-module writing language that looks a lot like Python except for optional type declarations for variables. These type declarations allow the Cython compiler to replace generic, highly dynamic Python code with specific and very fast compiled code that is then able to be loaded into the Python run-time dynamically. Here is the Cython code for the update function:
This code looks very similar to the original Python-only implementation except for the additional type-declarations. Notice that even NumPy arrays can be declared with Cython and Cython will correctly translate Python element selection into fast memory-access macros in the generated C code. When this function was used for each iteration in the inner calculation loop, the 8000 iterations on a 100x100 grid took only 1.28 seconds.
For completeness, the following shows the contents of the setup.py file that was also created in order to produce a compiled-module where the cy_update function lived.
The extension module was then built using the command: python setup.py build_ext --inplace
An older, but still useful, approach to speeding up code is to use weave to directly embed a C or C++ implementation of the algorithm into the Python program directly. Weave is a module that surrounds the bit of C or C++ code that you write with a template to on-the-fly create an extension module that is compiled and then dynamically loaded into the Python run-time. Weave has a caching mechanism so that different strings or different types of inputs lead to a new extension module being created, compiled, and loaded. The first time code using weave runs, the compilation has to take place. Subsequent runs of the same code will load the cached extension module and run the machine code.
For this particular case, an update routine using weave looks like:
The inline function takes a string of C or C++ code plus a list of variable names that will be pushed from the Python namespace into the compiled code. The inline function takes this code and the list of variables and either loads and executes a function in a previously-created extension module (if the string and types of the variables have been previously created) or else creates a new extension module before compiling, loading, and executing the code.
Notice that weave defines special macros so that U2 allows referencing the elements of the 2-d array u using simple expressions. Weave also defines the special C-array of integers Nu to contain the shape of the u array. There are also special macros similarly defined to access the elements of array u if it would have been a 1-, 3-, or 4-dimensional array (U1, U3, and U4). Although not used in this snippet of code, the C-array Su containing the strides in each dimension and the integer Du defining the number of dimensions of the array are both also defined.
Using the weave_update function, 8000 iterations on a 100x100 grid took only 1.02 seconds. This was the fastest implementation of all of the methods used. Knowing a little C and having a compiler on hand can certainly speed up critical sections of code in a big way.
After I originally published this post, I received some great feedback in the Comments section that encouraged me to add some parameters to the Cython solution in order to get an even faster solution. I was also reminded about pyximport and given example code to make it work more easily. Basically by adding some compiler directives to Cython to avoid some checks at each iteration of the loop, Cython generated even faster C-code. To the top of my previous Cython code, I added a few lines:
I then saved this new file as _laplace.pyx, and added the following lines to the top of the Python file that was running the examples:
This provided an update function cy_update2 that resulted in the very fastest implementation (943 ms) for 8000 iterations of a 100x100 grid.
The following table summarizes the results which were all obtained on a 2.66 Ghz Intel Core i7 MacBook Pro with 8GB of 1067 Mhz DDR3 Memory. The relative speed column shows the speed relative to the NumPy implementation.
Clearly when it comes to doing a lot of heavy number crunching, Pure Python is not really an option. However, perhaps somewhat surprisingly, NumPy can get you most of the way to compiled speeds through vectorization. In situations where you still need the last ounce of speed in a critical section, or when it either requires a PhD in NumPy-ology to vectorize the solution or it results in too much memory overhead, you can reach for Cython or Weave. If you already know C/C++, then weave is a simple and speedy solution. If, however, you are not already familiar with C then you may find Cython to be exactly what you are looking for to get the speed you need out of Python.
It is not rare, however, to need to do many calculations over a lot of data. No matter how fast computers get, there will always be cases where you still need the code to be as fast as you can get it. In those cases, I first reach for NumPy which provides high-level expressions of fast low-level calculations over large arrays. With NumPy's rich slicing and broadcasting capabilities, as well as its full suite of vectorized calculation routines, I can quite often do the number crunching I am trying to do with very little effort.
Even with NumPy's fast vectorized calculations, however, there are still times when either the vectorization is too complex, or it uses too much memory. It is also sometimes just easier to express the calculation with a simple loop. For those parts of the application, there are two general approaches that work really well to get you back to compiled speeds: weave or Cython.
Weave is a sub-package of SciPy and allows you to inline arbitrary C or C++ code into an extension module that is dynamically loaded into Python and executed in-line with the rest of your Python code. The code is compiled and linked at run-time the very first time the code is executed. The compiled code is then cached on-disk and made available for immediate later use if it is called again.
Cython is an extension-module generator for Python that allows you to write Python-looking code (Python syntax with type declarations) that is then pre-compiled to an extension module for later dynamic linking into the Python run-time. Cython translates Python-looking code into "not-for-human-eyes" C-code that compiles to reasonably fast C-code. Cython has been gaining a lot of momentum in recent years as people who have never learned C, can use Cython to get C-speeds exactly where they want them starting from working Python code. Even though I feel quite comfortable in C, my appreciation for Cython has been growing over the past few years, and I know am an avid supporter of the Cython community and like to help it whenever I can.
Recently I re-did the same example that Prabhu Ramachandran first created several years ago which is reported here. This example solves Laplace's equation over a 2-d rectangular grid using a simple iterative method. The code finds a two-dimensional function, u, where ∇2 u = 0, given some fixed boundary conditions.
Pure Python Solution
The pure Python solution is the following:
from numpy import zeros from scipy import weave dx = 0.1 dy = 0.1 dx2 = dx*dx dy2 = dy*dy def py_update(u): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1, j] + u[i-1, j]) * dy2 + (u[i, j+1] + u[i, j-1]) * dx2) / (2*(dx2+dy2)) def calc(N, Niter=100, func=py_update, args=()): u = zeros([N, N]) u[0] = 1 for i in range(Niter): func(u,*args) return u
This code takes a very long time to run in order to converge to the correct solution. For a 100x100 grid, visually-indistinguishable convergence occurs after about 8000 iterations. The pure Python solution took an estimated 560 seconds (9 minutes) to finish (using IPython's %timeit magic command).
NumPy Solution
Using NumPy, we can speed this code up significantly by using slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution. The NumPy update code is:
def num_update(u): u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
Using num_update as the calculation function reduced the time for 8000 iterations on a 100x100 grid to only 2.24 seconds (a 250x speed-up). Such speed-ups are not uncommon when using NumPy to replace Python loops where the inner loop is doing simple math on basic data-types.
Quite often it is sufficient to stop there and move on to another part of the code-base. Even though you might be able to speed up this section of code more, it may not be the critical path anymore in your over-all problem. Programmer effort should be spent where more benefit will be obtained. Occasionally, however, it is essential to speed-up even this kind of code.
Even though NumPy implements the calculations at compiled speeds, it is possible to get even faster code. This is mostly because NumPy needs to create temporary arrays to hold intermediate simple calculations in expressions like the average of adjacent cells shown above. If you were to implement such a calculation in C/C++ or Fortran, you would likely create a single loop with no intermediate temporary memory allocations and perform a more complex computation at each iteration of the loop.
In order to get an optimized version of the update function, we need a machine-code implementation that Python can call. Of course, we could do this manually by writing the inner call in a compilable language and using Python's extension facilities. More simply, we can use Cython and Weave which do most of the heavy lifting for us.
Cython solution
Cython is an extension-module writing language that looks a lot like Python except for optional type declarations for variables. These type declarations allow the Cython compiler to replace generic, highly dynamic Python code with specific and very fast compiled code that is then able to be loaded into the Python run-time dynamically. Here is the Cython code for the update function:
cimport numpy as np def cy_update(np.ndarray[double, ndim=2] u, double dx2, double dy2): cdef unsigned int i, j for i in xrange(1,u.shape[0]-1): for j in xrange(1, u.shape[1]-1): u[i,j] = ((u[i+1, j] + u[i-1, j]) * dy2 + (u[i, j+1] + u[i, j-1]) * dx2) / (2*(dx2+dy2))
This code looks very similar to the original Python-only implementation except for the additional type-declarations. Notice that even NumPy arrays can be declared with Cython and Cython will correctly translate Python element selection into fast memory-access macros in the generated C code. When this function was used for each iteration in the inner calculation loop, the 8000 iterations on a 100x100 grid took only 1.28 seconds.
For completeness, the following shows the contents of the setup.py file that was also created in order to produce a compiled-module where the cy_update function lived.
from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext import numpy ext = Extension("laplace", ["laplace.pyx"], include_dirs = [numpy.get_include()]) setup(ext_modules=[ext], cmdclass = {'build_ext': build_ext})
The extension module was then built using the command: python setup.py build_ext --inplace
Weave solution
An older, but still useful, approach to speeding up code is to use weave to directly embed a C or C++ implementation of the algorithm into the Python program directly. Weave is a module that surrounds the bit of C or C++ code that you write with a template to on-the-fly create an extension module that is compiled and then dynamically loaded into the Python run-time. Weave has a caching mechanism so that different strings or different types of inputs lead to a new extension module being created, compiled, and loaded. The first time code using weave runs, the compilation has to take place. Subsequent runs of the same code will load the cached extension module and run the machine code.
For this particular case, an update routine using weave looks like:
def weave_update(u): code = """ int i, j; for (i=1; i<Nu[0]-1; i++) { for (j=1; j<Nu[1]-1; j++) { U2(i,j) = ((U2(i+1, j) + U2(i-1, j))*dy2 + \ (U2(i, j+1) + U2(i, j-1))*dx2) / (2*(dx2+dy2)); } } """ weave.inline(code, ['u', 'dx2', 'dy2'])
The inline function takes a string of C or C++ code plus a list of variable names that will be pushed from the Python namespace into the compiled code. The inline function takes this code and the list of variables and either loads and executes a function in a previously-created extension module (if the string and types of the variables have been previously created) or else creates a new extension module before compiling, loading, and executing the code.
Notice that weave defines special macros so that U2 allows referencing the elements of the 2-d array u using simple expressions. Weave also defines the special C-array of integers Nu to contain the shape of the u array. There are also special macros similarly defined to access the elements of array u if it would have been a 1-, 3-, or 4-dimensional array (U1, U3, and U4). Although not used in this snippet of code, the C-array Su containing the strides in each dimension and the integer Du defining the number of dimensions of the array are both also defined.
Using the weave_update function, 8000 iterations on a 100x100 grid took only 1.02 seconds. This was the fastest implementation of all of the methods used. Knowing a little C and having a compiler on hand can certainly speed up critical sections of code in a big way.
Faster Cython solution (Update)
After I originally published this post, I received some great feedback in the Comments section that encouraged me to add some parameters to the Cython solution in order to get an even faster solution. I was also reminded about pyximport and given example code to make it work more easily. Basically by adding some compiler directives to Cython to avoid some checks at each iteration of the loop, Cython generated even faster C-code. To the top of my previous Cython code, I added a few lines:
#cython: boundscheck=False #cython: wraparound=False
I then saved this new file as _laplace.pyx, and added the following lines to the top of the Python file that was running the examples:
import pyximport import numpy as np pyximport.install(setup_args={'include_dirs':[np.get_include()]}) from _laplace import cy_update as cy_update2
This provided an update function cy_update2 that resulted in the very fastest implementation (943 ms) for 8000 iterations of a 100x100 grid.
Summary
The following table summarizes the results which were all obtained on a 2.66 Ghz Intel Core i7 MacBook Pro with 8GB of 1067 Mhz DDR3 Memory. The relative speed column shows the speed relative to the NumPy implementation.
Method | Time (sec) | Relative Speed |
---|---|---|
Pure Python | 560 | 250 |
NumPy | 2.24 | 1 |
Cython | 1.28 | 0.57 |
Weave | 1.02 | 0.45 |
Faster Cython | 0.94 | 0.42 |
Clearly when it comes to doing a lot of heavy number crunching, Pure Python is not really an option. However, perhaps somewhat surprisingly, NumPy can get you most of the way to compiled speeds through vectorization. In situations where you still need the last ounce of speed in a critical section, or when it either requires a PhD in NumPy-ology to vectorize the solution or it results in too much memory overhead, you can reach for Cython or Weave. If you already know C/C++, then weave is a simple and speedy solution. If, however, you are not already familiar with C then you may find Cython to be exactly what you are looking for to get the speed you need out of Python.