Friday, December 6, 2013

Why I promote conda

Anaconda users have been enjoying the benefits of conda for quickly and easily
managing their binary Python packages for over a year.  During that time conda
has also been steadily improving as a general-purpose package manager.  I
have recently been promoting the very nice things that conda can do for Python
users generally --- especially with complex binary extensions to Python as
exist in the NumPy stack.   For example, It is very easy to create python 3
environments and python 2 environments on the same system and install
scikit-learn into them.   Normally, this process can be painful if you
do not have a suitable build environment, or don't want to wait for
compilation to succeed.

Naturally, I sometimes get asked, "Why did you promote/write another
python package manager (conda) instead of just contributing to the
standard pip and virtualenv?"  The python packaging story is older and
more personal to me than you might think.  Python packaging has been a thorn
in my side personally since 1998 when I released my first Python extension
(called numpyio actually).  Since then, I've written and personally released
many, many Python packages (Multipack which became SciPy, NumPy, llvmpy,
Numba, Blaze, etc.).   There is nothing you want more as a package author than
users.  So, to make Multipack (SciPy), then NumPy available, I had to become a
packaging expert by experiencing a lot of pain with the lack of
suitable tools for my (admittedly complex) task.

Along the way, I've suffered through believing that distutils,
setuptools, distribute, and pip/virtualenv would solve my actual
problem.  All of these tools provided some standardization (at least around what somebody
types at the command line to build a package) but no help in actually doing the
build and no real help in getting compatible binaries of things like SciPy
installed onto many users machines.

I've personally made terrible software engineering mistakes because of the lack of
good package management.  For example, I allowed the pressure of "no ABI
changes" to severely hamper the progress of the NumPy API.  Instead of pushing
harder and breaking the ABI when necessary to get improvements into NumPy, I
buckled under the pressure and agreed to the requests coming mostly from NumPy
windows users and froze the ABI.  I could empathize with people who would spend
days building their NumPy stack and literally become fearful of changing it.
From NumPy 1.4 to NumPy 1.7, the partial date-time addition caused various
degrees of broken-ness and is part of why missing data data-types have never
showed up in NumPy at all.   If conda had existed back then with standard
conda binaries released for different projects, there would have been almost
no problem at all.   That pressure would have largely disappeared.   Just
install the packages again --- problem solved for everybody (not just the
Linux users who had apt-get and yum).

Some of the problems with SciPy are also rooted in the lack of good packages
and package  management.  SciPy, when we first released it in 2001 was
basically a distribution of multiple modules from Multipack, some new BLAS /
LAPACK and linear algebra wrappers and nascent plotting tools.  It was a SciPy
distribution masquerading as a single library.  Most of the effort spent was
a packaging effort (especially on Windows).  Since then, the scikits effort
has done a great job of breaking up the domain of SciPy into more manageable
chunks and providing a space for the community to grow.   This kind of re-
factoring is only possible with good distributions and is really only
effective when you have good package management.   On Mac and Linux
package managers exist --- on Windows things like EPD, Anaconda or C.
Gohlke's collection of binaries have been the only solution.

Through all of this work, I've cut my fingers and toes and sometimes face on
compilers, shared and static libraries on all kinds of crazy systems (AIX,
Windows NT, etc.).  I still remember the night I learned what it meant to have
ABI incompatibilty between different compilers (try passing structs
such as complex-numbers between a file compiled with mingw and a library compiled with
Visual Studio).   I've been bitten more than once by unicode-width
incompatibilities, strange shared-library incompatibilities, and the vagaries
of how different compilers and run-times define the `FILE *` file pointer.

In fact, if you have not read "Linkers and Loaders", you should actually do
that right now as it will open your mind to that interesting limbo between
"developer-code" and "running process" overlooked by even experienced
developers.  I'm grateful Dave Beazley recommended it to me over 6 years ago.
Here is a link:

We in the scientific python community have had difficulty and a rocky
history with just waiting for the community to solve the
problem.  With distutils for example, we had to essentially re-write
most of it (as numpy.distutils) in order to support compilation of
extensions that needed Fortran-compiled libraries.  This was not an
easy task.  All kinds of other tools could have (and, in retrospect,
should have) been used.  Most of the design of distutils did not help
us in the NumPy stack at all.  In fact, numpy.distutils replaces most
of the innards of distutils but is still shackled by the architecture
and imperative approach to what should fundamentally be a declarative
problem.  We should have just used or written something like waf or
bento or cmake and encouraged its use everywhere.  However, we buckled
under the pressure of the distutils promise of "one right way to do
it" and "one-size fits all" solution that we all hoped for, but
ultimately did not get.  I appreciate the effort of the distutils
authors.  Their hearts were in the right place and they did provide a
useful solution for their use-cases.  It was just not useful for ours,
and we should not have tried to force the issue.  Not all code is
useful to everyone.  The real mistake was the Python community picking
a "standard" that was actually limiting for a sizeable set of users.
This was the real problem --- but it should be noted that this
"problem" is only because of the incredible success and therefore
influence of python developers and  With this influence, however,
comes a certain danger of limiting progress if all advances have to be
made via committee --- working out specifications instead of watching for
innovation and encouraging it.

David Cooke and many others finally wrestled numpy.distutils to the
point that the library does provide some useful functionality for
helping build extensions requiring NumPy.  Even after all that effort,
however, some in the Python community who seem to have no idea of the
history of how these things came about and simply claim that
files that need numpy.distutils are "broken" because they import numpy
before "requiring" them.  To this, I reply that what is actually
broken is the design that does not have a delcarative meta-data file
that describes dependencies and then a build process that creates the
environment needed before running any code to do the actual build.
This is what `conda build` does and it works beautifully to create any
kind of binary package you want from any list of dependencies you may
have.  Anything else is going to require all kinds of "bootstrap"
gyrations to fit into the square hole of a process that seems to
require that all things begin with the python incantation.

Therefore, you can't really address the problem of Python packaging without
addressing the core problems of trying to use distutils (at least for the
NumPy stack).  The problems for us in the NumPy stack started there and have
to be rooted out there as well.  This was confirmed for me at the first PyData
meetup at Google HQ, where several of us asked Guido what we can do to fix
Python packaging for the NumPy stack.   Guido's answer was to "solve the
problem ourselves".  We at Continuum took him at his word.  We looked at dpkg,
rpm, pip/virtualenv, brew, nixos, and 0installer, and used our past experience
with EPD.  We thought hard about the fundamental issues, and created the conda
package manager and conda environments.  We who have been working on this for
the past year have decades of Python packaging experience between us: me,
Peter Wang, Ilan Schnell, Bryan Van de Ven, Mark Wiebe, Trent Nelson, Aaron
Meurer, and now Andy Terrel are all helping improve things.  We welcome
contributions, improvements, and updates from anyone else as conda is BSD
licensed and completely open source and can be used and re-used by
anybody.  We've also recently made a mailing list which is open to anyone to join and participate:!forum/conda

Conda pkg files are similar to .whl files except they are Python-agnostic.  A
conda pkg file is a bzipped tar file with an 'info' directory, and then
whatever other directory structure is created by the install process in
"prefix".   It's the equivalent of taking a file-system diff pre and post-
install and then tarring the result up.  It's more general than .whl files and
can support any kind of binary file.    Making conda packages is as simple as making a recipe for it.   We make a growing collection of public-domain, example recipes available to everyone and also encourage attachment of a conda recipe directory to every project that needs binaries.

At the heart of conda package installation is the concept of environments.
Environments are like namespaces in Python -- but for binary packages.  Their
applicability is extensive.  We are using them within Anaconda and Wakari for
all kinds of purposes (from testing to application isolation to easy
reproducibility to supporting multiple versions of packages in different
scripts that are part of the same installation).  Truly, to borrow the famous
Tim Peters' quip: "Environments are one honking great idea -- let's do more of
those".  Rather than tacking this on after the fact like virtualenv does to
pip, OS-level environments are built-in from the beginning.  As a result,
every conda package is always installed into an environment.  There is a
default (root) environment if you don't explicitly specify another one.
Installation of a package is simply merging the unpacked binary into the union
of unpacked binaries already at the root-path of the environment.   If union
filesystems were better implemented in different operating systems, then each
environment would simply be a union of the untarred binary packages.  Instead
we accomplish the same thing with hard-linking, soft-linking, and (when
necessary) copying of files.

The design is simple, which helps it be easy to understand and easy to
mix with other ideas.  We don't see easily how to take these simple,
powerful ideas and adapt them to .whl and virtualenv which are trying
to fit-in to a world created by distutils and setuptools.  It was
actually much easier to just write our own solution and create
hundreds of packages and make them available and provide all the tools
to reproduce what we have done inside conda than to try and untangle
how to provide our solution in that world and potentially even not
quite get the result we want (which can be argued is what happened
with numpy.distutils).

You can use conda to build your own distribution of binaries that
compete with Anaconda if you like.  Please do.  I would be completely
thrilled if every other Python distribution (, EPD,
ActiveState, etc.) just used conda packages that they build and in so
doing helped improve the conda package manager.  I recognize that
conda emerged at the same time as the Anaconda distribution was
stabilizing and so there is natural confusion over the two.  So,
I will try to clarify: Conda is an open-source, general,
cross-platform package manager.  One could accurately describe it as a
cross-platform hombrew written in Python.  Anyone can use the tool and
related infrastructure to build and distribute whatever packages they

Anaconda is the collection of conda packages that we at Continuum provide for
free to everyone, based on a particular base Python we choose (which you can
download at as Miniconda).  In the past it has
been some work to get conda working outside Miniconda or Anaconda because our
first focus was creating a working solution for our users.  We have been
fixing those minor issues and have now released a version of conda that can be
'pip installed'.   As conda has significant overlap with virtualenv in
particular we are still working out kinks in the interop of these two
solutions.   But, it all can and should work together and we fix issues as
quickly as we can identify them.

We also provide a service called (register with beta-code
"binstar in beta") which allows you to host your own binary conda packages.
With this missing piece, you just tell people to point their conda
repositories to your collection -- and they can easily install everything you
want them to.  You can also build your own conda repositories and host them on
your own servers.  It all works, today, now -- for hundreds of thousands of
people.  In this context, Anaconda could be considered a "reference"
distribution and a proof of concept of how to use the conda package manager.
Wakari also uses the conda package manager at its core to share bundles.
Bundles are just conda packages (with a set of dependencies) and capture the
core problems associated with reproducible computing in a light-weight and
easily reproduced way.  We have made the tools available for *anyone* to re-
create this distribution pretty easily and compete with us.

It is very important to keep in mind that we created conda to solve
the problem of distributing an environment to end-users that allow
them do to advanced data analytics, scientific discovery, and general
engineering work.  Python has a chance to play a major role in this
space.  However, it is not the only player.  Other solutions exist in
the space we are targeting (SAS, Matlab, SPSS, and R).  We want Python
to dominate this space.  We could not wait for the packaging solution
we needed to evolve from the lengthy discussions that are on-going
which also have to untangle the history of distutils, setuptools,
easy_install, and distribute.  What we could do is solve our problem
and then look for interoperability and influence opportunities once we
had something that worked for our needs.   That the approach we took
and I'm glad we did.  We have a working solution now which benefits
hundreds of thousands of users (and could benefit millions more if
IT administrators recognized conda as an acceptable packaging approach
from others in the community).

We are going to keep improving conda until it becomes an obvious
solution for everyone: users, developers, and IT administrators alike.
We welcome additions and suggestions that allow it to interoperate
with anything else in the Python packaging space.   I do believe that the group of people working on Python packaging and Nick Coghlan in particular are doing a valuable service.  It's a very difficult job to take into account the history of Python packaging, fix all the little issues around it, *and* provide a binary distribution system that allows users to not have to think about packaging and distribution.    With our resources we did just the latter.   I admire those who are on the front lines of the former and look to provide as much context as I can to ensure that any future decisions take our use-cases into account.   I am looking forward to continuing to work with the community to reach future solutions that benefit everyone.

If you would like to see more detail about conda and how it can be used here are some

Talk at PyData NYC 2013:
 - Slides:
 - Video:

Blog Posts:

Mailing list:

Wednesday, July 3, 2013

Thoughts after SciPy 2013 and a specific NumPy improvement

I attended a few days of SciPy 2013 and enjoyed interacting with the many old friends and many new friends that participate in this conference.   I thought the program committee did an excellent job of selecting talks and there were more attendees this year which also mirrors my experience with the PyData conference series which sells out every time.     Andy Terrell, a NumFOCUS board member and researcher at the University of Texas, and Jonathan Rocher, an Enthought developer, were co-chairs of SciPy this year and did an excellent job of coordination.

Continuum Analytics, my new company, is the institutional sponsor of the PyData conference series and I know how much work it can be, so my thanks go out to Enthought for their efforts to sponsor the SciPy conference this year and in years past.   I'm really looking forward to the day when the SciPy conference, like the PyData conference series, directly benefits NumFOCUS which is a non-profit organization with 501(c)(3) status started by the scientific Python community and run by the same community behind so much of the SciPy stack.    It looks like steps are being taken in that direction which is wonderful to see.  At the SciPy conference, Fernando Perez, of IPython fame, led the charge to get fiscal sponsorship documents improved to make it much simpler for people wanting to sponsor the great projects on the scientific python stack (IPython, NumPy, SciPy, Pandas, SymPy, Matplotlib, etc.) to have a vehicle to do it.  This year, NumFOCUS was able to sponsor the attendance of two students to the SciPy conference because of generous donors.  Right now, NumFOCUS is looking for help for its website to improve the look and feel.   It's a great way to get involved with the community and help out.    Just send an email to the numfocus google group (a public group for all to get involved with):

Right now, a conversation involving graph-representations for Python compilation tools is happening on the numfocus mailing list among several interested parties from SymPy, Numba, Theano, Pythran, Parakeet, etc.     One of the highlights of the conference for me was meeting and interacting with other people interested in Python-for-science compiler technology as it looks like there is a healthy community developing around this topic.   I hope those interested in the topic check out and issue pull requests to that github-hosted page to describe their favorite tool.

I only attended some of the tutorial given by fellow Continuum team members Ben Zaitlen and Clayton Davis.   I was gratified to see that was useful for so many people during the tutorials, and appreciated the feedback on how we can continue to improve the tool.   I'm also grateful to see all the people able to productively use Anaconda which is our free, cross-platform, distribution for using Python for scientific work and data analysis.

It was nice to see David Cournapeau give a detailed discussion of NumPy internals in one of the tutorials.   There is much more that could be said about NumPy internals, but David gave a good introduction to the topic.   I like how he showed how it is possible to extend the NumPy dtype system --- especially with certain kinds of types.   In NumPy, I tried very hard to make the type-system more extensible.   It's nice to see it being used more and more.   Extending the type system more generally (to include things like variable-length strings, and infinite precision floats) is a bit harder and not very easy to do in current NumPy (especially while trying to keep the foundation stable).     In fact, one of the reasons Continuum is sponsoring the development of dynd is precisely to build a foundation with an easier to extend type-system.   Making it a C++ library should hopefully allow languages like Javascript, Ruby, Haskell, and others to also benefit from the dynamic type concepts as well.

I really enjoyed the talk on Spyder by Carlos Cordoba.   The Spyder IDE is a very nice tool and I was happy to see Carlos promoting it.   The Spyder IDE is featured in our Anaconda Launcher (part of the Anaconda 1.6 release) along with the IPython notebook and IPython console.   The Launcher allows anyone to publish their app to multiple platforms simply by making a conda package (with an icon and an entry-point) and upload it to a repository that the Launcher is looking at.   All the dependencies can be specified and they will be installed via conda automatically when the app is selected.   The hope is to make it very easy for anyone to get their cool application based on Python in front of people quickly without having to make installers for every platform.

Besides the excellent keynote talks, by Fernando Perez, William Schroeder, and Olivier Grisel, I also found the talks by Matthew Rocklin, Pat Marion, Ramalingam Saravanan, Serge Guelton, Samuel SkillmanJake Vanderplas, and Joshua Warner very interesting.   It was especially nice to meet Joshua who was coming from the Mayo Clinic where SciPy began.   I started writing the SciPy library in 1999 at the Mayo Clinic while I was a graduate student there (then called Multipack, special, and a bunch of other modules).    It was very nice to meet someone from Mayo contributing again to this community with a very nice fuzzy logic package based on the work of an old professor of mine Hal Otteson.    His work is now a new scikit.  The scikit concept has been a tremendous boon for development of the Scientific Python community as it allows more distributed development and more rapid expansion of the available tools.    If better packaging had existed at the time, I would very likely have kept my early modules independent so they could grow with their own developer bases.   What is now the SciPy library should most likely have been a SciPy distribution (with perhaps a smaller core).    But, hindsight is 20/20 and given the state of the world at the time, the best option seemed to be to create the SciPy library with Eric Jones and Pearu Peterson.

Mark Wiebe did an excellent job in presenting dynd, a C++ library for dynamic multi-dimensional array manipulation with nice python bindings.   Mark's work, sponsored by Continuum Analytics,  is something that could lead to NumPy 2.0, although nobody has suggested exactly how that might work yet.    As dynd forms a foundation for Blaze, and Blaze and NumPy can co-exist for many years, I haven't been thinking much about how NumPy 2.0 could grow out of dynd until now.  I do now have some ideas about how NumPy could be improved that I think will help the space evolve more fluidly and productively with many interested people able to coordinate their varied efforts.   The most important of these is the introduction of multi-methods into NumPy which I'll outline below.

I participated on a panel about the future of Array Oriented Computing in Python.   Of course, I've been spending a lot of time over the past year working and thinking exactly about that, so I would have preferred a talk versus a panel with only a limited amount of time.    However, I have limited time to prepare talks and will be speaking at the upcoming PyData conference in Boston, so I was grateful for the chance to at least express some of the ideas we've been working on.    To be clear, I think that Blaze is the future of Array Oriented Computing in Python, though we have some work ahead to prove that out.   Exactly what the transition from NumPy to Blaze looks like for people will be a story I care quite a bit about and will be telling more and more in the coming months and years.    I take personal responsibility for anyone who adopted NumPy, and I will do everything I can to make sure their transition to using Blaze is as simple as possible.   Backward compatibility is very important to me.  I spent many hours making sure that NumPy was compatible with both Numarray and Numeric.   Fortunately, Blaze and NumPy can co-exist and so there is less of a story of either / or and more about which / when (especially during the transition phase).

There is also another possibility that will be interesting to see if it emerges:  retro-fitting NumPy with multi-methods (dispatching on python type and also on dtype).    I think this is the single-most important thing that can be done for NumPy.   If someone is motivated and has budget, I can work with her to do this in about 1-2 months (maybe even sooner depending on the experience).    This is not on my immediately funded road-map, however, so it would need outside funding and/or interest.

There are several different multi-method implementations for Python.   For those unfamiliar with the concept, here is a good essay by Guido on the general concept.   Multi-methods are also at the heart of Julia.    They are a simple concept.    Basically, a multi-method is an object that dispatches to a different implementation based on the number and types of the arguments.   The idea is that you can add new implementations of the underlying function quite easily without changing the function object itself.   So, for example, if were a multi-method, then I could change the implementation of for my new fancy array-object without directly changing the source-code of in NumPy and all downstream functions and methods that use in their implementation would automatically work with my new type of array.    Multi-methods allow extensibility in a manner similar to how operator overloading allows extensibility in object-oriented programming.   But, it's a much more natural fit for operations where dispatching only on the first argument does not make a lot of sense.

In fact, at the heart of NumPy's ufuncs is a multi-method dispatch mechanism (on NumPy dtype, instead of Python type), so NumPy users have been using multi-methods for a long time.  In fact, if NumPy's ufuncs were true multi-methods to begin with, then all the hassle with __array_wrap__, __array_prepare__, and so forth which are hacks to compensate for the lack of true Python-type-based multi-methods would not be necessary.    If you look at the implementation of NumPy's masked array's for example you will see some of the ugliness that is caused by NumPy's lack of a better multi-method mechanism.    Numba's autojit also effectively creates a kind of multi-method as it creates a new function to dispatch to whenever it encounters a new set of types for the arguments.    These are the ideas that we are building on and using in Blaze, as we learn from our experience with NumPy.

The biggest challenge for multi-methods is always what function to return if you don't find an exact match.    A simple multi-method is basically a dictionary whose key is the a tuple of the types of the input arguments and whose value is the implementation.  But, what do you do if the key does not return an implementation?  How do you find a compatible function and use it instead?    There is a lot of theory on this and several approaches people have taken.  I'm not aware of a universal solution that everybody agrees should be used.      However, there are reasonable approaches that can be taken using  the idea of typesets or type-hierarchies (for those interested you can read more about contravariance and covariance for other approaches to resolving the type dispatch problem as well).

I'm confident that useful if not universal approaches to this problem can be found (several are already available for Python and in Julia, for example).   For NumPy, what is needed is a two-tiered dispatch mechanism.   My view is that all NumPy (and SciPy and Scikit) functions should be multi-methods that dispatch based on Python-type *and* then additionally for memory-view-like objects on the data-type of the elements.    The dispatch rules for each of these cases can and should be separate, I think.

If you are interested in this problem and especially if you have money to fund it, feel free to contact me directly at travis at continuum dot io.

While I am spending more and more of my conference time with the PyData conference series, I still enjoy reconnecting with people I will always consider friends at the SciPy conference.   Fortunately, many speakers participate in both.     Having both conferences allows the community to grow and have bigger and better impact as I think can be witnessed by the increased attendance this year at SciPy.