Friday, December 6, 2013

Why I promote conda

Anaconda users have been enjoying the benefits of conda for quickly and easily
managing their binary Python packages for over a year.  During that time conda
has also been steadily improving as a general-purpose package manager.  I
have recently been promoting the very nice things that conda can do for Python
users generally --- especially with complex binary extensions to Python as
exist in the NumPy stack.   For example, It is very easy to create python 3
environments and python 2 environments on the same system and install
scikit-learn into them.   Normally, this process can be painful if you
do not have a suitable build environment, or don't want to wait for
compilation to succeed.

Naturally, I sometimes get asked, "Why did you promote/write another
python package manager (conda) instead of just contributing to the
standard pip and virtualenv?"  The python packaging story is older and
more personal to me than you might think.  Python packaging has been a thorn
in my side personally since 1998 when I released my first Python extension
(called numpyio actually).  Since then, I've written and personally released
many, many Python packages (Multipack which became SciPy, NumPy, llvmpy,
Numba, Blaze, etc.).   There is nothing you want more as a package author than
users.  So, to make Multipack (SciPy), then NumPy available, I had to become a
packaging expert by experiencing a lot of pain with the lack of
suitable tools for my (admittedly complex) task.

Along the way, I've suffered through believing that distutils,
setuptools, distribute, and pip/virtualenv would solve my actual
problem.  All of these tools provided some standardization (at least around what somebody
types at the command line to build a package) but no help in actually doing the
build and no real help in getting compatible binaries of things like SciPy
installed onto many users machines.

I've personally made terrible software engineering mistakes because of the lack of
good package management.  For example, I allowed the pressure of "no ABI
changes" to severely hamper the progress of the NumPy API.  Instead of pushing
harder and breaking the ABI when necessary to get improvements into NumPy, I
buckled under the pressure and agreed to the requests coming mostly from NumPy
windows users and froze the ABI.  I could empathize with people who would spend
days building their NumPy stack and literally become fearful of changing it.
From NumPy 1.4 to NumPy 1.7, the partial date-time addition caused various
degrees of broken-ness and is part of why missing data data-types have never
showed up in NumPy at all.   If conda had existed back then with standard
conda binaries released for different projects, there would have been almost
no problem at all.   That pressure would have largely disappeared.   Just
install the packages again --- problem solved for everybody (not just the
Linux users who had apt-get and yum).

Some of the problems with SciPy are also rooted in the lack of good packages
and package  management.  SciPy, when we first released it in 2001 was
basically a distribution of multiple modules from Multipack, some new BLAS /
LAPACK and linear algebra wrappers and nascent plotting tools.  It was a SciPy
distribution masquerading as a single library.  Most of the effort spent was
a packaging effort (especially on Windows).  Since then, the scikits effort
has done a great job of breaking up the domain of SciPy into more manageable
chunks and providing a space for the community to grow.   This kind of re-
factoring is only possible with good distributions and is really only
effective when you have good package management.   On Mac and Linux
package managers exist --- on Windows things like EPD, Anaconda or C.
Gohlke's collection of binaries have been the only solution.

Through all of this work, I've cut my fingers and toes and sometimes face on
compilers, shared and static libraries on all kinds of crazy systems (AIX,
Windows NT, etc.).  I still remember the night I learned what it meant to have
ABI incompatibilty between different compilers (try passing structs
such as complex-numbers between a file compiled with mingw and a library compiled with
Visual Studio).   I've been bitten more than once by unicode-width
incompatibilities, strange shared-library incompatibilities, and the vagaries
of how different compilers and run-times define the `FILE *` file pointer.

In fact, if you have not read "Linkers and Loaders", you should actually do
that right now as it will open your mind to that interesting limbo between
"developer-code" and "running process" overlooked by even experienced
developers.  I'm grateful Dave Beazley recommended it to me over 6 years ago.
Here is a link:  http://www.iecc.com/linker/

We in the scientific python community have had difficulty and a rocky
history with just waiting for the Python.org community to solve the
problem.  With distutils for example, we had to essentially re-write
most of it (as numpy.distutils) in order to support compilation of
extensions that needed Fortran-compiled libraries.  This was not an
easy task.  All kinds of other tools could have (and, in retrospect,
should have) been used.  Most of the design of distutils did not help
us in the NumPy stack at all.  In fact, numpy.distutils replaces most
of the innards of distutils but is still shackled by the architecture
and imperative approach to what should fundamentally be a declarative
problem.  We should have just used or written something like waf or
bento or cmake and encouraged its use everywhere.  However, we buckled
under the pressure of the distutils promise of "one right way to do
it" and "one-size fits all" solution that we all hoped for, but
ultimately did not get.  I appreciate the effort of the distutils
authors.  Their hearts were in the right place and they did provide a
useful solution for their use-cases.  It was just not useful for ours,
and we should not have tried to force the issue.  Not all code is
useful to everyone.  The real mistake was the Python community picking
a "standard" that was actually limiting for a sizeable set of users.
This was the real problem --- but it should be noted that this
"problem" is only because of the incredible success and therefore
influence of python developers and python.org.  With this influence, however,
comes a certain danger of limiting progress if all advances have to be
made via committee --- working out specifications instead of watching for
innovation and encouraging it.

David Cooke and many others finally wrestled numpy.distutils to the
point that the library does provide some useful functionality for
helping build extensions requiring NumPy.  Even after all that effort,
however, some in the Python community who seem to have no idea of the
history of how these things came about and simply claim that setup.py
files that need numpy.distutils are "broken" because they import numpy
before "requiring" them.  To this, I reply that what is actually
broken is the design that does not have a delcarative meta-data file
that describes dependencies and then a build process that creates the
environment needed before running any code to do the actual build.
This is what `conda build` does and it works beautifully to create any
kind of binary package you want from any list of dependencies you may
have.  Anything else is going to require all kinds of "bootstrap"
gyrations to fit into the square hole of a process that seems to
require that all things begin with the python setup.py incantation.

Therefore, you can't really address the problem of Python packaging without
addressing the core problems of trying to use distutils (at least for the
NumPy stack).  The problems for us in the NumPy stack started there and have
to be rooted out there as well.  This was confirmed for me at the first PyData
meetup at Google HQ, where several of us asked Guido what we can do to fix
Python packaging for the NumPy stack.   Guido's answer was to "solve the
problem ourselves".  We at Continuum took him at his word.  We looked at dpkg,
rpm, pip/virtualenv, brew, nixos, and 0installer, and used our past experience
with EPD.  We thought hard about the fundamental issues, and created the conda
package manager and conda environments.  We who have been working on this for
the past year have decades of Python packaging experience between us: me,
Peter Wang, Ilan Schnell, Bryan Van de Ven, Mark Wiebe, Trent Nelson, Aaron
Meurer, and now Andy Terrel are all helping improve things.  We welcome
contributions, improvements, and updates from anyone else as conda is BSD
licensed and completely open source and can be used and re-used by
anybody.  We've also recently made a mailing list
conda@continuum.io which is open to anyone to join and participate:
https://groups.google.com/a/continuum.io/forum/#!forum/conda

Conda pkg files are similar to .whl files except they are Python-agnostic.  A
conda pkg file is a bzipped tar file with an 'info' directory, and then
whatever other directory structure is created by the install process in
"prefix".   It's the equivalent of taking a file-system diff pre and post-
install and then tarring the result up.  It's more general than .whl files and
can support any kind of binary file.    Making conda packages is as simple as making a recipe for it.   We make a growing collection of public-domain, example recipes available to everyone and also encourage attachment of a conda recipe directory to every project that needs binaries.

At the heart of conda package installation is the concept of environments.
Environments are like namespaces in Python -- but for binary packages.  Their
applicability is extensive.  We are using them within Anaconda and Wakari for
all kinds of purposes (from testing to application isolation to easy
reproducibility to supporting multiple versions of packages in different
scripts that are part of the same installation).  Truly, to borrow the famous
Tim Peters' quip: "Environments are one honking great idea -- let's do more of
those".  Rather than tacking this on after the fact like virtualenv does to
pip, OS-level environments are built-in from the beginning.  As a result,
every conda package is always installed into an environment.  There is a
default (root) environment if you don't explicitly specify another one.
Installation of a package is simply merging the unpacked binary into the union
of unpacked binaries already at the root-path of the environment.   If union
filesystems were better implemented in different operating systems, then each
environment would simply be a union of the untarred binary packages.  Instead
we accomplish the same thing with hard-linking, soft-linking, and (when
necessary) copying of files.

The design is simple, which helps it be easy to understand and easy to
mix with other ideas.  We don't see easily how to take these simple,
powerful ideas and adapt them to .whl and virtualenv which are trying
to fit-in to a world created by distutils and setuptools.  It was
actually much easier to just write our own solution and create
hundreds of packages and make them available and provide all the tools
to reproduce what we have done inside conda than to try and untangle
how to provide our solution in that world and potentially even not
quite get the result we want (which can be argued is what happened
with numpy.distutils).

You can use conda to build your own distribution of binaries that
compete with Anaconda if you like.  Please do.  I would be completely
thrilled if every other Python distribution (python.org, EPD,
ActiveState, etc.) just used conda packages that they build and in so
doing helped improve the conda package manager.  I recognize that
conda emerged at the same time as the Anaconda distribution was
stabilizing and so there is natural confusion over the two.  So,
I will try to clarify: Conda is an open-source, general,
cross-platform package manager.  One could accurately describe it as a
cross-platform hombrew written in Python.  Anyone can use the tool and
related infrastructure to build and distribute whatever packages they
want.

Anaconda is the collection of conda packages that we at Continuum provide for
free to everyone, based on a particular base Python we choose (which you can
download at http://continuum.io/downloads as Miniconda).  In the past it has
been some work to get conda working outside Miniconda or Anaconda because our
first focus was creating a working solution for our users.  We have been
fixing those minor issues and have now released a version of conda that can be
'pip installed'.   As conda has significant overlap with virtualenv in
particular we are still working out kinks in the interop of these two
solutions.   But, it all can and should work together and we fix issues as
quickly as we can identify them.

We also provide a service called http://binstar.org (register with beta-code
"binstar in beta") which allows you to host your own binary conda packages.
With this missing piece, you just tell people to point their conda
repositories to your collection -- and they can easily install everything you
want them to.  You can also build your own conda repositories and host them on
your own servers.  It all works, today, now -- for hundreds of thousands of
people.  In this context, Anaconda could be considered a "reference"
distribution and a proof of concept of how to use the conda package manager.
Wakari also uses the conda package manager at its core to share bundles.
Bundles are just conda packages (with a set of dependencies) and capture the
core problems associated with reproducible computing in a light-weight and
easily reproduced way.  We have made the tools available for *anyone* to re-
create this distribution pretty easily and compete with us.

It is very important to keep in mind that we created conda to solve
the problem of distributing an environment to end-users that allow
them do to advanced data analytics, scientific discovery, and general
engineering work.  Python has a chance to play a major role in this
space.  However, it is not the only player.  Other solutions exist in
the space we are targeting (SAS, Matlab, SPSS, and R).  We want Python
to dominate this space.  We could not wait for the packaging solution
we needed to evolve from the lengthy discussions that are on-going
which also have to untangle the history of distutils, setuptools,
easy_install, and distribute.  What we could do is solve our problem
and then look for interoperability and influence opportunities once we
had something that worked for our needs.   That the approach we took
and I'm glad we did.  We have a working solution now which benefits
hundreds of thousands of users (and could benefit millions more if
IT administrators recognized conda as an acceptable packaging approach
from others in the community).

We are going to keep improving conda until it becomes an obvious
solution for everyone: users, developers, and IT administrators alike.
We welcome additions and suggestions that allow it to interoperate
with anything else in the Python packaging space.   I do believe that the group of people working on Python packaging and Nick Coghlan in particular are doing a valuable service.  It's a very difficult job to take into account the history of Python packaging, fix all the little issues around it, *and* provide a binary distribution system that allows users to not have to think about packaging and distribution.    With our resources we did just the latter.   I admire those who are on the front lines of the former and look to provide as much context as I can to ensure that any future decisions take our use-cases into account.   I am looking forward to continuing to work with the community to reach future solutions that benefit everyone.

If you would like to see more detail about conda and how it can be used here are some
resources:

Documentation: http://docs.continuum.io/conda/index.html
Talk at PyData NYC 2013:
 - Slides: https://speakerdeck.com/teoliphant/packaging-and-deployment-with-conda
 - Video: http://vimeo.com/79862018

Blog Posts:
 - http://continuum.io/blog/anaconda-python-3
 - http://continuum.io/blog/new-advances-in-conda
 - http://continuum.io/blog/conda

Mailing list:
 - conda@continuum.io
 - https://groups.google.com/a/continuum.io/forum/#!forum/conda