Sunday, October 16, 2011

Thoughts on porting NumPy to PyPy

Last weekend, I attended GitHub's PyCodeConf in Miami Florida and had the opportunity to give a talk on array-oriented computing and Python.  I would like to thank Tom and Chris (GitHub founders) for allowing me to come speak.  I enjoyed my time there, but I have to admit I felt old and a bit out of place.   There were a lot of young people there who understood a lot more about making web-pages and working at cool web start-ups than solving partial differential equations with arrays.  Fortunately, Dave Beazley and Raymond Hettinger were there so I didn't feel completely ancient.   In addition, Wes McKinney and Peter Wang helped me represent the NumPy community.

At the conference I was reminded of PyPy's recent success at showing the speedups some said weren't possible from a dynamic language like Python --- speed-ups which make it possible to achieve C-like speeds using Python constructs.   This is very nice as it illustrates again that high-level languages can be compiled to low-level speeds.    I also became keenly aware of the enthusiasm that has cropped up in porting NumPy to PyPy.   I am happy for this enthusiasm as it illustrates the popularity of NumPy which pleases me.   On the other hand, in every discussion I have heard or read about this effort, I'm not convinced that anyone excited about this porting effort actually understands the complexity of what they are trying to do nor the dangers that it could create in splitting the small community of developers who are regularly contributing to NumPy and SciPy and causing confusion for the user base of Python in Science.

I'm hopeful that I can provide some perspective.   Before I do this, however, I want to congratulate the PyPy team and emphasize that I have the utmost respect for the PyPy developers and what they have achieved.   I am also a true-believer in the ability for high-level languages to achieve faster-than-C speeds.   In fact, I'm not satisfied with a Python JIT.  I want the NumPy constructs such as vectorization, fancy indexing, and reduction to be JIT compiled.    I also think that there are use-cases of NumPy all by itself that makes it somewhat interesting to do a NumPy-in-PyPy port.    I would also welcome the potential things that can be learned about how to improve NumPy that would come out of trying to write a version of NumPy in RPython.

However, to avoid detracting from the overall success of Python in Science, Statistics, and Data Analysis,  I think it is important that 3 things are completely clear to people interested in the NumPy on PyPy idea. 
  1. NumPy is just the beginning (SciPy, matplotlib, scikits, and 100s of other packages and legacy C/C++ and Fortran code are all very important)
  2. NumPy should be a lot faster than it is currently.
  3. NumPy has an ambitious roadmap and will be moving forward rather quickly over the coming years.  

    NumPy is just the beginning 

    Most of the people who use NumPy use it as an entry-point to the entire ecosystem of Scientific Packages available for Python.    This ecosystem is huge.   There are at least 1 million unique visitors to the http://www.scipy.org site every year and that is just an entry point to the very large and diverse community of technical computing users who rely on Python. 

    Most of the scientists and engineers who have come to Python over the past years have done so because it is so easy to integrate their legacy C/C++ and Fortran code into Python.   National laboratories, large oil companies, large banks and many other Fortune 50 companies all must integrate their code into Python in order for Python to be part of their story.   NumPy is part of the answer  that helps them seamlessly view large amounts of data as arrays in Python or as arrays in another compiled language without the non-starter of copying the memory back and forth. 

    Once the port of NumPy to PyPy has finished, are you going to port SciPy?  Are you going to port matplotlib?  Are you going to port scikits.learn, or scikits.statsmodels?   What about Sage?   Most of these rely on not just the Python C-API but also the NumPy C-API which you would have to have a story for to make a serious technical user of Python get excited about a NumPy port to PyPy.  

    To me it is much easier to think about taking the ideas of PyPy and pulling them into the Scientific Python ecosystem then going the other way around.    It's not to say that there isn't some value in re-writing NumPy in PyPy, it just shouldn't be over-sold and those who fund it should understand what they aren't getting in the transaction.  

    C-speed is the wrong target

    Several examples including my own previous blog post has shown that vectorized Fortran 90 can be 4-10 times faster than NumPy.   Thus, we know there is room for improvement even on current single-core machines.   This doesn't even take into account the optimizations that should be possible for multiple-cores, GPUs and even FPGAs all of which are in use today but are not being utilized to the degree they should be.   NumPy needs to adapt to make use of this kind of hardware and will adapt in time.

    NumPy will be evolving rapidly over the coming years

    The pace of NumPy development has leveled off in recent years, but this year has created a firestorm of new ideas that will be coming to fruition over the next 1-2 years and NumPy will be evolving fairly rapidly during that time.    I am committed to making this happen and will be working very hard in 2012 on the code-base itself to realize some of the ideas that have emerged.   Some of this work will require some re-factoring and re-writing as well.   I would honestly rather collaborate with PyPy than compete, but my constraints are that I care very much about backward compatibility and very much about the entire SciPy ecosystem.    I sacrificed a year of my life in 1999 (delaying my PhD graduation by at least 6-12 months) bringing SciPy to life.    I sacrificed my tenure-track position in academia bringing NumPy to life in 2005.   Constraints of keeping my family fed, clothed, and housed seem to keep me on this 6-7 year sabbatical-like cycle for SciPy/NumPy but it looks like next year I will finally be in a position to spend substantial time and take the next steps with NumPy to help it progress to the next stage.

    Some of the ideas that will be implemented include:
    • integration of non-contiguous memory chunks into the NumPy array structure (generalization of strides)
    • addition of labels to axes and dimensions (generalization of shape)
    • derived fields, enumerated data-types, reference data-types, and indices for structured arrays
    • improvements to the data-type infrastructure to make it easier to add new data-types
    • improvements to the calculation infrastructure (iterators and fast general looping constructs)
    • fancy-indexing as views
    • integration of Pandas group-by features
    • missing data bit-patterns
    • distributed arrays 
    Over conversations with many people this year, more ideas than room to talk about them have emerged and I am excited to start seeing these ideas come to fruition to make NumPy and Python the best solution for data-analysis.    Beginning next year, I will be pushing hard for their introduction into the NumPy/SciPy ecosystem --- with a careful eye on backward compatibility which has long been one of NumPy's strengths.

    A way forward

    I would love to see more scientific code written at a high-level without sacrificing run-time performance.   The high-level intent allows for the creation of faster machine code than lower-level translations of the intent often does. I know this is possible and I intend to do everything I can professionally to see that happen (but from within the context of the entire SciPy ecosystem). As this work emerges, I will encourage PyPy developers to join us using the hard-won knowledge and tools they have created.

    Even if PyPy continues as a separate eco-system, then there are points of collaboration that will benefit both groups. One of these is to continue the effort Microsoft initially funded to separate the C-parts of NumPy away from the CPython interface to NumPy. This work is now in a separate branch that has diverged from the main NumPy branch and needs to be re-visited. If people interested in NumPy on PyPy spent time on improving this refactoring into basically a NumPy C-library, then PyPy can call this independent library using its methods for making native calls just as CPython can call it using its extension approach. Then IronPython, Jython (and for that matter Ruby, Javascript, etc.) can all call the C-library and leverage the code. There is some effort to do this and it's not trivial. Perhaps, there is even a way for PyPy to generate C-libraries from Python source code --- now that would be an interesting way to collaborate.

    The second way forward is for PyPy to interact better with the Cython community. Support in PyPy for Cython extension modules would be a first step. There is wide agreement among NumPy developers that more of NumPy should be written at a high-level (probably using Cython). Cython already is used to implement many, many extension modules for the Sage project. William Stein's valiant efforts in that community have made Cython the de-facto standard for how most scientists and engineers are writing extensions modules for Python these days. This is a good thing for efforts like PyPy because it adds a layer of indirection that allows PyPy to make a Cython backend and avoid the Python C-API.  

    I was quite astonished that Cython never came up in the panel discussion at the last PyCon when representatives from CPython, PyPy, IronPython, and Jython all talked about the Python VMs.   To me that oversite was very troublesome.   I was left doubting the PyPy community after Cython was not mentioned at all --- even when the discussion about how to manage extensions to the language came up during the panel discussion.    It shows that pure Python developers on all fronts have lost sight of what the scientific Python community is doing.   This is dangerous. I encourage Python developers to come to a SciPy conference and take a peek at what is going on.   I hope to be able to contribute more to the discussion as well. 

    If you are a Python developer and want to extend an olive leaf, then put a matrix infix operator into the language.  It's way past time :-)

    19 comments:

    1. Hi Travis,

      I think it would definitely be beneficial for people with vested interests in both sides of this (Scientific and PyPy) to really sit down and try to figure out how we can best collaborate on this, because I don't think either one of us wants to have the issues of divergent codebases and communities.

      Alex

      ReplyDelete
    2. Are their any plans for numpy/scipy to use GPUs. Add OpenCL code backend to use all compute resources on a computer or server.

      ReplyDelete
    3. do you if people are cross shopping scientific python versus octave/scilab for their computational needs. matlab syntax is very different than the numpy syntax and infix operators will be welcomed addition to allow matlab/octave/scilab developers to seamlessly move to scientific python.

      ReplyDelete
    4. For SciPy on the GPU, check out AccelerEyes, http://accelereyes.com

      ReplyDelete
    5. Check out the discussion at http://news.ycombinator.com/item?id=3118620

      I feel guilty as being one of the "young people who understood a lot more about making web-pages and working at cool web start-ups than solving partial differential equations with arrays.", but I think there are plenty of students who are interested in scientific computation, even if only as a means to an end. Those people just aren't the ones who get the attention in media or at conferences.

      ReplyDelete
    6. Thank you for the thoughtful analysis of the situation, Travis. I've looked at the NumPy-on-PyPy effort with a mix of excitement and nervousness, excitement at the idea NumPy benefiting from a JIT (which is part of what we aim to do with Theano), nervousness at the prospects of community fragmentation and the same skepticism you mention about enthusiastic developers perhaps not quite grasping the scope of the problem. Working on Theano and diving on occasion into NumPy's internals has given me a great appreciation for just what a beast NumPy is, and with so much of the Python scientific computing ecosystem tied to CPython I'm not sure reimplementation is time well spent, particularly if NumPy is to become far more of a moving target than it has been for the past 2-3 years.

      I look forward to the coming era of sweeping change, and hope I can lend a hand. Speaking of which, there's a pull request that's been sitting there for about a month... ;)

      ReplyDelete
    7. I'd just like to point out, that in the last GSoC, there was a project to create a backend to Cython, that generates pure-python code with ctypes calls. In fact, it was created with Pypy in mind, but would benefit all kinds of interpreters, so Cython is not exactly forgotten.

      Unfortunately, the project didn't finish during the GSoC. I'm sure, that it will be finished some time in the future, though.

      ReplyDelete
    8. I think it is unrealistic to expect the (small) pypy team / community to contribute to the numpy C core! If the numpy community is concerned about fragmentation by ports to other implementations of Python then that is definitely an issue for *numpy* to address.

      Concerns like the numpy C API, wider package support (scipy etc) are genuine issues though of course.

      As Timo says, there is a work-in-progress to have cython support pypy. Hopefully this will address some of those issues. The pypy team can't do everything themselves though.

      Anyway, reassuring to hear that numpy continues to evolve and improve. Sad that the microsoft funded work may be lost if no-one picks it up though, it sounded like a great way forward.

      ReplyDelete
    9. Hi Travis, thanks for the detailed notes. I've linked this to the pypy-dev discussion:
      http://mail.python.org/pipermail/pypy-dev/2011-October/008601.html

      Ian.

      ReplyDelete
    10. Is there a PEP written for the infix operator? That is the first step to make if anyone wants to have it in cpython (and pypy for that matter).

      ReplyDelete
    11. Thanks Travis for this post. Here follows my point of view as a regular NumPy / SciPy consumer (for the scikit-learn project). As David, I am both very excited by the prospect of having a JIT for scientific computing but also afraid that having only a fraction of NumPy re-implemented in a PyPy compatible way (without most of the remainder packages from the SciPy ecosystem) would be pretty useless to us.

      IMHO one of the most important and tricky parts to get right for a PyPy re-implementation of Numpy (and maybe later SciPy) is the "{numpy|scipy}.linalg" package that is mostly backed by an optimized implementation of the blas / lapack API, either in Fortran or in C thanks to libraries like Atlas and MKL). In particular efficient and numerically stable routines implementing decomposition such as Cholesky, QR or SVD are critical for many numerical analysis tasks.

      For scikit-learn this is a very major service provided by the NumPy / SciPy dependency (besides the numpy.ndarray datastructure itself).

      Also eigen solvers / Singular value decomposition implementations for sparse matrices as provided by arpack in Scipy is also very important for us.

      Reimplementing an equivalent of libatlas and lapack in pure PyPy that is both as fast (with SSE2/3 vectorization) and numerically correct sounds like a daunting task. Are people really willing to embark this road?

      For the rest of the scikit, most of the compiled extensions are written in Cython that allow us to get the conciseness of a high level language and with the speed of C (and recently OpenMP for multicore parallelism).

      A PyPy <=> Cython integration would allow us to run most of the scikit-learn specific compiled extension on a PyPy runtime. That sounds like an interesting prospect although I am afraid that the underlying ctypes overhead when calling blas routines directly from cython would eat most of the speed up (in that case it might be beneficial to rewrite those few pieces of blas such as ddot of daxpy in Cython directly).

      ReplyDelete
    12. Leonardo: PEP211 proposed this over a decade ago, and there have been various revisions/other proposals. Still no movement though.

      ReplyDelete
    13. Travis:

      A very thoughtful post, and a timely one too I am sure. While I am not in close contact with the PyPy developers I have publicly supported their work many times. Fortunately, as Ales's earlier reply indicates, they are good people with the good of the Python language and community at heart, so I think we will see good cooperation to ensure that users continue to find NumPy the best way of meeting their numerical and scientific computing needs.

      It's also great to hear that you will be managing to escape back into development work, even if only for an all-too-brief sabbatical. I look forward to seeing the new developments in NumPy that brings.

      ReplyDelete
    14. We *don't* want to reimplement BLAS in numpy. We just want to reimplement the part that uses CPython C API.

      ReplyDelete
    15. The future of NumPy seems particularly exciting.
      Thanks!

      ReplyDelete
    16. One of the more impressive blogs Ive seen. Thanks so much for keeping the internet classy for a change. You’ve got style, class, bravado. I mean it. Please keep it up because without the internet is definitely lacking in intelligence.
      joomla extensions

      ReplyDelete
    17. thanks Travis! Very encouraging to hear the plans for NumPy from 'the horse's mouth' as it were, given that i, like so many depend so heavily on NumPy in our daily programming.

      ReplyDelete
    18. Thanks for sharing your info. I really appreciate your efforts and I will be waiting for your further write ups thanks once again.
      Web Shopping Cart | Free Ecommerce Software

      ReplyDelete
    19. There are certainly a lot more details to take into consideration, but thanks for sharing this post.
      source: www.wbupdates.com

      ReplyDelete