Wednesday, July 3, 2013

Thoughts after SciPy 2013 and a specific NumPy improvement

I attended a few days of SciPy 2013 and enjoyed interacting with the many old friends and many new friends that participate in this conference.   I thought the program committee did an excellent job of selecting talks and there were more attendees this year which also mirrors my experience with the PyData conference series which sells out every time.     Andy Terrell, a NumFOCUS board member and researcher at the University of Texas, and Jonathan Rocher, an Enthought developer, were co-chairs of SciPy this year and did an excellent job of coordination.

Continuum Analytics, my new company, is the institutional sponsor of the PyData conference series and I know how much work it can be, so my thanks go out to Enthought for their efforts to sponsor the SciPy conference this year and in years past.   I'm really looking forward to the day when the SciPy conference, like the PyData conference series, directly benefits NumFOCUS which is a non-profit organization with 501(c)(3) status started by the scientific Python community and run by the same community behind so much of the SciPy stack.    It looks like steps are being taken in that direction which is wonderful to see.  At the SciPy conference, Fernando Perez, of IPython fame, led the charge to get fiscal sponsorship documents improved to make it much simpler for people wanting to sponsor the great projects on the scientific python stack (IPython, NumPy, SciPy, Pandas, SymPy, Matplotlib, etc.) to have a vehicle to do it.  This year, NumFOCUS was able to sponsor the attendance of two students to the SciPy conference because of generous donors.  Right now, NumFOCUS is looking for help for its website to improve the look and feel.   It's a great way to get involved with the community and help out.    Just send an email to the numfocus google group (a public group for all to get involved with):

Right now, a conversation involving graph-representations for Python compilation tools is happening on the numfocus mailing list among several interested parties from SymPy, Numba, Theano, Pythran, Parakeet, etc.     One of the highlights of the conference for me was meeting and interacting with other people interested in Python-for-science compiler technology as it looks like there is a healthy community developing around this topic.   I hope those interested in the topic check out and issue pull requests to that github-hosted page to describe their favorite tool.

I only attended some of the tutorial given by fellow Continuum team members Ben Zaitlen and Clayton Davis.   I was gratified to see that was useful for so many people during the tutorials, and appreciated the feedback on how we can continue to improve the tool.   I'm also grateful to see all the people able to productively use Anaconda which is our free, cross-platform, distribution for using Python for scientific work and data analysis.

It was nice to see David Cournapeau give a detailed discussion of NumPy internals in one of the tutorials.   There is much more that could be said about NumPy internals, but David gave a good introduction to the topic.   I like how he showed how it is possible to extend the NumPy dtype system --- especially with certain kinds of types.   In NumPy, I tried very hard to make the type-system more extensible.   It's nice to see it being used more and more.   Extending the type system more generally (to include things like variable-length strings, and infinite precision floats) is a bit harder and not very easy to do in current NumPy (especially while trying to keep the foundation stable).     In fact, one of the reasons Continuum is sponsoring the development of dynd is precisely to build a foundation with an easier to extend type-system.   Making it a C++ library should hopefully allow languages like Javascript, Ruby, Haskell, and others to also benefit from the dynamic type concepts as well.

I really enjoyed the talk on Spyder by Carlos Cordoba.   The Spyder IDE is a very nice tool and I was happy to see Carlos promoting it.   The Spyder IDE is featured in our Anaconda Launcher (part of the Anaconda 1.6 release) along with the IPython notebook and IPython console.   The Launcher allows anyone to publish their app to multiple platforms simply by making a conda package (with an icon and an entry-point) and upload it to a repository that the Launcher is looking at.   All the dependencies can be specified and they will be installed via conda automatically when the app is selected.   The hope is to make it very easy for anyone to get their cool application based on Python in front of people quickly without having to make installers for every platform.

Besides the excellent keynote talks, by Fernando Perez, William Schroeder, and Olivier Grisel, I also found the talks by Matthew Rocklin, Pat Marion, Ramalingam Saravanan, Serge Guelton, Samuel SkillmanJake Vanderplas, and Joshua Warner very interesting.   It was especially nice to meet Joshua who was coming from the Mayo Clinic where SciPy began.   I started writing the SciPy library in 1999 at the Mayo Clinic while I was a graduate student there (then called Multipack, special, and a bunch of other modules).    It was very nice to meet someone from Mayo contributing again to this community with a very nice fuzzy logic package based on the work of an old professor of mine Hal Otteson.    His work is now a new scikit.  The scikit concept has been a tremendous boon for development of the Scientific Python community as it allows more distributed development and more rapid expansion of the available tools.    If better packaging had existed at the time, I would very likely have kept my early modules independent so they could grow with their own developer bases.   What is now the SciPy library should most likely have been a SciPy distribution (with perhaps a smaller core).    But, hindsight is 20/20 and given the state of the world at the time, the best option seemed to be to create the SciPy library with Eric Jones and Pearu Peterson.

Mark Wiebe did an excellent job in presenting dynd, a C++ library for dynamic multi-dimensional array manipulation with nice python bindings.   Mark's work, sponsored by Continuum Analytics,  is something that could lead to NumPy 2.0, although nobody has suggested exactly how that might work yet.    As dynd forms a foundation for Blaze, and Blaze and NumPy can co-exist for many years, I haven't been thinking much about how NumPy 2.0 could grow out of dynd until now.  I do now have some ideas about how NumPy could be improved that I think will help the space evolve more fluidly and productively with many interested people able to coordinate their varied efforts.   The most important of these is the introduction of multi-methods into NumPy which I'll outline below.

I participated on a panel about the future of Array Oriented Computing in Python.   Of course, I've been spending a lot of time over the past year working and thinking exactly about that, so I would have preferred a talk versus a panel with only a limited amount of time.    However, I have limited time to prepare talks and will be speaking at the upcoming PyData conference in Boston, so I was grateful for the chance to at least express some of the ideas we've been working on.    To be clear, I think that Blaze is the future of Array Oriented Computing in Python, though we have some work ahead to prove that out.   Exactly what the transition from NumPy to Blaze looks like for people will be a story I care quite a bit about and will be telling more and more in the coming months and years.    I take personal responsibility for anyone who adopted NumPy, and I will do everything I can to make sure their transition to using Blaze is as simple as possible.   Backward compatibility is very important to me.  I spent many hours making sure that NumPy was compatible with both Numarray and Numeric.   Fortunately, Blaze and NumPy can co-exist and so there is less of a story of either / or and more about which / when (especially during the transition phase).

There is also another possibility that will be interesting to see if it emerges:  retro-fitting NumPy with multi-methods (dispatching on python type and also on dtype).    I think this is the single-most important thing that can be done for NumPy.   If someone is motivated and has budget, I can work with her to do this in about 1-2 months (maybe even sooner depending on the experience).    This is not on my immediately funded road-map, however, so it would need outside funding and/or interest.

There are several different multi-method implementations for Python.   For those unfamiliar with the concept, here is a good essay by Guido on the general concept.   Multi-methods are also at the heart of Julia.    They are a simple concept.    Basically, a multi-method is an object that dispatches to a different implementation based on the number and types of the arguments.   The idea is that you can add new implementations of the underlying function quite easily without changing the function object itself.   So, for example, if were a multi-method, then I could change the implementation of for my new fancy array-object without directly changing the source-code of in NumPy and all downstream functions and methods that use in their implementation would automatically work with my new type of array.    Multi-methods allow extensibility in a manner similar to how operator overloading allows extensibility in object-oriented programming.   But, it's a much more natural fit for operations where dispatching only on the first argument does not make a lot of sense.

In fact, at the heart of NumPy's ufuncs is a multi-method dispatch mechanism (on NumPy dtype, instead of Python type), so NumPy users have been using multi-methods for a long time.  In fact, if NumPy's ufuncs were true multi-methods to begin with, then all the hassle with __array_wrap__, __array_prepare__, and so forth which are hacks to compensate for the lack of true Python-type-based multi-methods would not be necessary.    If you look at the implementation of NumPy's masked array's for example you will see some of the ugliness that is caused by NumPy's lack of a better multi-method mechanism.    Numba's autojit also effectively creates a kind of multi-method as it creates a new function to dispatch to whenever it encounters a new set of types for the arguments.    These are the ideas that we are building on and using in Blaze, as we learn from our experience with NumPy.

The biggest challenge for multi-methods is always what function to return if you don't find an exact match.    A simple multi-method is basically a dictionary whose key is the a tuple of the types of the input arguments and whose value is the implementation.  But, what do you do if the key does not return an implementation?  How do you find a compatible function and use it instead?    There is a lot of theory on this and several approaches people have taken.  I'm not aware of a universal solution that everybody agrees should be used.      However, there are reasonable approaches that can be taken using  the idea of typesets or type-hierarchies (for those interested you can read more about contravariance and covariance for other approaches to resolving the type dispatch problem as well).

I'm confident that useful if not universal approaches to this problem can be found (several are already available for Python and in Julia, for example).   For NumPy, what is needed is a two-tiered dispatch mechanism.   My view is that all NumPy (and SciPy and Scikit) functions should be multi-methods that dispatch based on Python-type *and* then additionally for memory-view-like objects on the data-type of the elements.    The dispatch rules for each of these cases can and should be separate, I think.

If you are interested in this problem and especially if you have money to fund it, feel free to contact me directly at travis at continuum dot io.

While I am spending more and more of my conference time with the PyData conference series, I still enjoy reconnecting with people I will always consider friends at the SciPy conference.   Fortunately, many speakers participate in both.     Having both conferences allows the community to grow and have bigger and better impact as I think can be witnessed by the increased attendance this year at SciPy.