Technical Discovery

Monday, March 5, 2018

Reflections on Anaconda as I start a new chapter with Quansight

Leaving the company you founded is always a tough decision and a tough process that involves many people. It requires a series of potentially emotional "crucial-conversations." It is actually not that uncommon in venture-backed companies for one or more of the original founders to leave at some point. There is a decent article on the topic here: https://hbswk.hbs.edu/item/the-founding-ceos-dilemma-stay-or-go.

Still it is extremely difficult to let go. You live and breathe the company you start. Years of working to connect as many people as possible to the dream gives you a feeling of "ownership" and connection that no stock certificate can replace. Starting a company is a lot of work. It takes a lot of effort. There are many decisions to make and many voices to incorporate. Hiring, firing, raising money, engaging customers, engaging employees, planning projects, organizing events, and aligning a pastiche of personalities while staying relevant in a rapidly evolving technology jungle is difficult.

As a founder over 40 with modest means, I had a family of 6 children who relied on me. That family had teenage children who needed my attention and pre-school and elementary-school children that I could not simply leave only in the hands of my wife. I look back and sometimes wonder how we pulled it off. The truth probably lies in the time we borrowed: time from exercise, time from sleep, time from vacations, and time from family. I'd like to say that this dissonance against "work-life-harmony" was always a bad choice, but honestly, I don't see how I could have made too may different choices and still have created Anaconda.

Several things drove me. I could not let the people associated with the company down. I would not lose the money for those that invested in us. I could not let down the people who worked their tail off to build manage, document, market, and sell the technology and products that we produced. Furthermore, I would not let the community of customers and users down that had enabled us to continue to thrive.

The only way you succeed as a founder is through your customers being served by the efforts of those who surround you. It is only the efforts of the talented people who joined us in our journey that has allowed Anaconda to succeed so far. It is critical to stay focused on what is in the best interests of those people.

Permit me to use the name Continuum to describe the angel-funded and bootstrapped early-stage company that Peter and I founded in 2012 and Anaconda to describe the venture-backed company that Continuum became (This company we called Continuum 2.0 internally that really got started in the summer of 2015 after we raised the first tranche of $22 million from VCs.)

Back in 2012, Peter and I knew a few things: 1) we had to connect Python to the Big Data movement; 2) we needed to help the scientific programmer, or a data-scientist developer build visualization-based applications quickly in the web; and 3) we needed to scale the stack of code around the PyData community to bigger hardware and multiple machines. We had big visions of an interconnected data-web, distributed schedulers, and data-structures that traversed the internet which could be analyzed across the cloud with simple Python scripts. We talked and talked about these things and grew misty-eyed in our enthusiasm for the potential of what was possible if we just built the right technology and sold just the right product to fund it.

We knew that we wanted to build a product-company -- though we didn't know exactly what those products would be at the outset. We had some ideas, only portions of which actually worked out. I knew how to run a consulting and training company around Python and open-source. Because of this, I felt comfortable raising money from family members. While consulting companies are not "high-growth" they can make real returns for investors. I was pretty confident that I would not lose their money.

We raised $2.25million from a few dozen investors consisting of Peter's family, my family, and a host of third-parties from our mutual networks. Peter's family was critical to this early stage because they basically "led the early round" and ensured that we could get off the ground. After they put their money in the bank, we could finish raising the rest of the seed round which took about 6 months to finish.

It is interesting (and somewhat embarrassing and so not detailed here) to go back and look at what products we thought we would be making. Some of the technologies we ended up building (like Excel integration, Numba, Bokeh, and Dask) were reflected in those early product dreams. However, the real products and commercial success that Anaconda has had so far are only a vague resemblance to what we thought we would do.

Building a Python distribution was the last thing on our minds. I had been building Python distributions since I released SciPy in 2001. As I have often repeated, SciPy was actually the first Python distribution masquerading as a library. The single biggest effort in releasing SciPy was building the binary installers and making sure everything compiled well. With Fortran compilers still more scarce than they should be, it can still be difficult to compile and build SciPy.

Fortunately, with conda, conda-forge, and Anaconda, along with the emergence of wheels, almost nobody needs to build SciPy anymore. It is so easy today to get started with a data-science project and get all the software you need to do amazing work fast. You still have to work to maintain your dependencies and keep that workflow reproducible. But, I'm so happy that Anaconda makes that relatively straightforward today.

This was only possible because General Catalyst and BuildGroup joined us in the journey in the spring of 2015 to really grow the Anaconda story. Their investment allowed us to 1) convert to a serious product-company from a bootstrapped consulting company with a few small products and 2) continue to invest heavily in conda, conda-forge, and Anaconda.

There is nothing like real-world experience as a teacher, and the challenge of converting to a serious product company was a tremendous experience that taught me a great deal. I'm grateful to all the people who brought their best to the company and taught me everyday. It was a privilege and an honor to be a part of their success. I am grateful for their patience with me as my "learning experiences" often led to real struggles for them.

There are many lasting learnings that I look forward to applying in future endeavors. The one that deserves mention in this post, however, is that building enterprise software that helps open-source communities should be done by selling a complementary product to the open-source. The "open-core" model does not work as well. I'm a firm believer that there will always be software to sell, but infrastructure should be and will be open-source --- sustained vibrantly from the companies that depend on it. Joel Spolsky has written about complementary products before. You should read his exposition.

Early on at Anaconda, Peter and I decided to be a board-led company. This board which includes Peter and I has the final say in company leadership and made the important decision to transition Anaconda from being founder-led to being led by a more experienced CEO. After this transition and through multiple conversations over many months we all concluded that the best course of action that would maximize my energy and passion while also allowing Anaconda to focus on its next chapter would be for me to spin-out of Anaconda and start a new services and open-source company where I could pursue a broader mission.

This new company is Quansight (short for Quantitative Insight). Our place-holder homepage is at http://www.quansight.com and we are @quansightai on Twitter. I'm excited to tell you more about the company in future blog-posts and announcements. A few paragraphs will suffice for now.

Our overall mission is to develop people, build technology, and discover products to empower people with knowledge and data to solve the world’s most challenging problems. We are doing that currently by connecting organizations sustainably with open source communities to solve their hardest problems by enabling teams to transparently apply science to their data.

One of the things we are doing is to help companies get started with AI and ML by applying the entire PyData stack to the fundamental data organization, data visualization, and model management problem that is required for practical success with ML and AI in business. We also help companies generally improve their data-science practice by leveraging all the power of the Python, PyData, and related ecoystems.

We are also hard at work on the sustainability problem by continuing the tradition we started at Continuum Analytics of building successful and sustainable open-source "practices" that synchronize company needs with open-source technology development. We have some innovative business approaches to this that we will be announcing in the coming weeks and months.

I'm excited that we have several devs working hard to help bring JupyterLab to 1.0 this year along with a vibrant community. There are many exciting extensions to this remarkable platform that remain to be written.

We also expect to continue to contribute to the PyViz activities that continue to explode in the Python ecosystem as visualization is a critical first step to understanding and using any data you care about.

Finally, Stefan Krah has joined us at Quansight. Stefan is an award-winning Python core developer who has been steadily working over the past 18 months on a small but powerful collection of projects collectively called Plures. These will be more broadly available in the next few months and published under the xnd brand. Xnd is a generic container concept in C with a Python binding that together with its siblings ndtypes and gumath allows building flexible array-computing pipelines over many kinds of data-types.

This technology will serve to underly any array-computing framework and be a glue between machine-learning and data-science frameworks of all kinds. Our plan is to use this tool to help reduce the data and computational silos that currently exist across the open-source ecosystem.

There is still much to work on and many more technologies to emerge. It's an exciting time to work in machine learning, data-science, and scientific computing. I'm thrilled that I continue to get the opportunity to be part of it. Let me know if you'd like to be a part of our journey.

Friday, February 3, 2017

NumFOCUS past and future.

NumFOCUS just finished its 5th year of operations, and I've lately been reflective on the early days and some of the struggles we went through to get the organization started. It once was just an idea in a few community-minded developer's heads and now exists as an important non-profit Foundation for Open Data Science, democratic and reproducible discovery, and a champion for technical progress through diversity.

When Peter Wang and I started Continuum in early 2012, I had already started the ball rolling to create NumFOCUS. I knew that we needed to create a non-profit that would create leadership and be a focus of community activity outside of any one company. I strongly believe that for open-source to thrive, full-time attention needs to be paid to it by many people. This requires money. With the tremendous interest in and explosion around the NumPy community, it was clear to me that this federation of loosely-coupled people needed some kind of organization that could be community-led and could be a rallying point for community activity and community-led financing. The potential also exists for NumFOCUS to act as community-based accountability to encourage positively re-inforcing behavior in the open-source communities it intersects with.

In late 2011, I started a new mailing list and invited anyone interested in discussing the idea of an independent community-run organization to the list. Over 100 people responded and so I knew there was interest. We debated on that list what to call the new concept for several weeks and Anthony Scopatz's name "NumFOCUS" stuck as the best alternative over several other names. As an acronym, NumFOCUS could mean Numerical Foundation for Open Code and Usable Science. I created a new mailing list, and then set about creating the legal organization called NumFOCUS and filing necessary paperwork.

Fernando Perez

John Hunter

In December of 2011, I coordinated with Fernando Perez, Perry Greenfield, John Hunter, and Jarrod Millman who had all expressed some interest in the idea and we incorporated in Texas (using LegalZoom) and became the first board of NumFOCUS. We had a very simple set of bylaws and purposes all centered around making Science more accessible. We decided to meet every-other week. We all knew we were creating something that would last a long time.

Perry Greenfield

Jarrod Millman

In early 2012, I wanted to ensure NumFOCUS success and knew that it needed a strong, full-time, Executive Director to make that happen. The problem was NumFOCUS didn't have a lot of money. A few of the board members had made donations, but Continuum with its own limited means was funding the majority of the costs for getting NumFOCUS started. With the legal organization started, I created bank-accounts and setup the ability for people to donate to NumFOCUS with help from Anthony Scoptatz who was the first treasurer of NumFOCUS.

Anthony Scopatz

I had met Leah Silen through other community interactions in Austin back in 2007. I knew her to be a very capable and committed person and thought she might be available. I asked her if she would come aboard and be employed by Continuum but work full-time for NumFOCUS and the new board. She accepted and the organization of NumFOCUS began to improve immediately.

With her help, we transitioned the organization from LegalZoom's beginnings to register directly with the secretary of state in Texas and started the application process to become a 501(c)3. She also quickly became involved in organizing the PyData conferences which Continuum initially spear-headed along with Julie Steele and Edd Wilder-James (at the time from O'Reilly). In 2012, we had our first successful PyData conference at the GooglePlex in Mountain View . It was clear that PyData could be used as a mechanism to provide revenue for NumFOCUS (at least to support Leah and other administrative help).

Leah Silen

We began working under that model through 2013 and 2014 with Continuum initially spending a lot of human resources and money organizing and running PyData with any proceeds going directly to NumFOCUS. There were no proceeds in those years except enough to help pay for Leah's salary. The rest of Leah's salary and PyData expenses came from Continuum which itself was still a small startup.

During these years of PyData growth in communities around the world, James Powell, became a drumbeat of consistency and community engagement. He has paid his own way to nearly every PyData event throughout the world. He has acted as emcee, volunteer extraordinaire, and popular speaker with his clever implementations and explanations of the Python stack.

James Powell
@dontusethiscode

Andy Terrel had been a friend of NumFOCUS and a member of the community and active with the board from its beginning. In 2014, while working at Continuum, he took over my board seat. In that capacity, he worked hard to gain financial independence for NumFOCUS. He was instrumental in moving PyData fully to NumFOCUS management. I was comfortable stepping back from the board and stepping down in my involvement around organizing and backing PyData from a financial perspective because I trusted Andy's leadership and non-profit management instincts. He, James Powell, Leah, and all the other local PyData meetups and organizations world-wide have done an impressive thing in self-organizing and growing the community. We should all be grateful for their efforts.

Andy Terrel

I am very proud of the work I did to help start NumFOCUS and PyData. I hope to remember it as one of the most useful things I've done professionally. I am very grateful for all the others who also helped to create NumFOCUS as well as PyData. So many have worked hard to ensure it can be a worldwide and community-governed organization to support Open Data Science for a long time to come. I'm proud of the funding and people-time that Continuum provided to get NumFOCUS and PyData started as well as the on-going support of NumFOCUS that Continuum and other industry partners continue to provide.

Now, as an adviser to the organization, I get to hear from time to time how things are going. I'm very impressed at the progress being made by the dedication of the current leadership behind Andy Terrel as President and Leah Silen as Executive Director and the rest of the current board.

If you use or appreciate any of the tools in the Open Data Science that NumFOCUS sponsors, I encourage you to join and/or make a supporting donation here: http://www.numfocus.org/support-numfocus.html. Help NumFOCUS continue its mission to support the tools and communities you rely on everyday.

Tuesday, March 29, 2016

Anaconda and Hadoop --- a story of the journey and where we are now.

Early Experience with Clusters

My first real experience with cluster computing came in 1999 during my graduate school days at the Mayo Clinic. These were wonderful times. My advisor was Dr. James Greenleaf. He was very patient with allowing me to pester a bunch of IT professionals throughout the hospital to collect their aging Mac Performa machines and build my own home-grown cluster. He also let me use a bunch of space in his ultrasound lab to host the cluster for about 6 months.

Building my own cluster

The form-factor for those Mac machines really made it easy to stack them. I ended up with 28 machines in two stacks with 14 machines in each stack (all plugged into a few power strips and a standard lab-quality outlet). With the recent release of Yellow-Dog Linux, I wiped the standard OS from all the machines and installed Linux on all those Macs to create a beautiful cluster of UNIX goodness I could really get excited about. I called my system "The Orchard" and thought it would be difficult to come up with 28 different kinds of apple varieties to name each machine after. It wasn't difficult. It turns out there are over 7,500 varieties of apples grown throughout the world.

Me smiling alongside by smoothly humming "Orchard" of interconnected Macs

The reason I put this cluster together was to simulate Magnetic Resonance Elastography (MRE) which is a technique to visualize motion using Magnetic Resonance Imaging (MRI). I wanted to simulate the Bloch equations with a classical model for how MRI images are produced. The goal was to create a simulation model for the MRE experiment that I could then use to both understand the data and perhaps eventually use this model to determine material properties directly from the measurements using Bayesian inversion (ambitiously bypassing the standard sequential steps of inverse FFT and local-frequency estimation).

Now I just had to get all these machines to talk to each other, and then I would be poised to do anything. I read up a bit on MPI, PVM, and anything else I could find about getting computers to talk to each other. My unfamiliarity with the field left me puzzled as I tried to learn these frameworks in addition to figuring out how to solve my immediate problem. Eventually, I just settled down with a trusted UNIX book by the late W. Richard Stevens. This book explained how the internet works. I learned enough about TCP/IP and sockets so that I could write my own C++ classes representing the model. These classes communicated directly with each other over raw sockets. While using sockets directly was perhaps not the best approach, it did work and helped me understand the internet so much better. It also makes me appreciate projects like tornado and zmq that much more.

Lessons Learned

I ended up with a system that worked reasonably well, and I could simulate MRE to some manner of fidelity with about 2-6 hours of computation. This little project didn't end up being critical to my graduation path and so it was abandoned after about 6 months. I still value what I learned about C++, how abstractions can ruin performance, how to guard against that, and how to get machines to communicate with each other.

Using Numeric, Python, and my recently-linked ODE library (early SciPy), I built a simpler version of the simulator that was actually faster on one machine than my cluster-version was in C++ on 20+ machines. I certainly could have optimized the C++ code, but I could have also optimized the Python code. The Python code took me about 4 days to write, the C++ code took me about 4 weeks. This experience has markedly influenced my thinking for many years about both pre-mature parallelization and pre-mature use of C++ and other compiled languages.

Fast forward over a decade. My computer efforts until 2012 were spent on sequential array-oriented programming, creating SciPy, writing NumPy, solving inverse problems, and watching a few parallel computing paradigms emerge while I worked on projects to provide for my family. I didn't personally get to work on parallel computing problems during that time, though I always dreamed of going back and implementing this MRE simulator using a parallel construct with NumPy and SciPy directly. When I needed to do the occassional parallel computing example during this intermediate period, I would either use IPython parallel or multi-processing.

Parallel Plans at Continuum

In 2012, Peter Wang and I started Continuum, created PyData, and released Anaconda. We also worked closely with members of the community to establish NumFOCUS as an independent organization. In order to give NumFOCUS the attention it deserved, we hired the indefatigable Leah Silen and donated her time entirely to the non-profit so she could work with the community to grow PyData and the Open Data Science community and ecosystem. It has been amazing to watch the community-based, organic, and independent growth of NumFOCUS. It took effort and resources to jump-start, but now it is moving along with a diverse community driving it. It is a great organization to join and contribute effort to.

A huge reason we started Continuum was to bring the NumPy stack to parallel computing --- for both scale-up (many cores) and scale-out (many nodes). We knew that we could not do this alone and it would require creating a company and rallying a community to pull it off. We worked hard to establish PyData as a conference and concept and then transitioned the effort to the community through NumFOCUS to rally the community behind the long-term mission of enabling data-, quantitative-, and computational-scientists with open-source software. To ensure everyone in the community could get the software they needed to do data science with Python quickly and painlessly, we also created Anaconda and made it freely available.

In addition to important community work, we knew that we would need to work alone on specific, hard problems to also move things forward. As part of our goals in starting Continuum we wanted to significantly improve the status of Python in the JVM-centric Hadoop world. Conda, Bokeh, Numba, and Blaze were the four technologies we started specifically related to our goals as a company beginning in 2012. Each had a relationship to parallel computing including Hadoop.

Conda enables easy creation and replication of environments built around deep and complex software dependencies that often exist in the data-scientist workflow. This is a problem on a single node --- it's an even bigger problem when you want that environment easily updated and replicated across a cluster.

Bokeh allows visualization-centric applications backed by quantitative-science to be built easily in the browser --- by non web-developers. With the release of Bokeh 0.11 it is extremely simple to create visualization-centric-web-applications and dashboards with simple Python scripts (or also R-scripts thanks to rBokeh).

With Bokeh, Python data scientists now have the power of both d3 and Shiny, all in one package. One of the driving use-cases of Bokeh was also easy visualization of large data. Connecting the visualization pipeline with large-scale cluster processing was always a goal of the project. Now, with datashader, this goal is now also being realized to visualize billions of points in seconds and display them in the browser.

Our scale-up computing efforts centered on the open-source Numba project as well as our Accelerate product. Numba has made tremendous progress in the past couple of years, and is in production use in multiple places. Many are taking advantage of numba.vectorize to create array-oriented solutions and program the GPU with ease. The CUDA Python support in Numba makes it the easiest way to program the GPU that I'm aware of. The CUDA simulator provided in Numba makes it much simpler to debug in Python the logic of CUDA-based GPU programming. The addition of parallel-contexts to numba.vectorize mean that any many-core architecture can now be exploited in Python easily. Early HSA support is also in Numba now meaning that Numba can be used to program novel hardware devices from many vendors.

Summarizing Blaze

The ambitious Blaze project will require another blog-post to explain its history and progress well. I will only try to summarize the project and where it's heading. Blaze came out of a combination of deep experience with industry problems in finance, oil&gas, and other quantitative domains that would benefit from a large-scale logical array solution that was easy to use and connected with the Python ecosystem. We observed that the MapReduce engine of Hadoop was definitely not what was needed. We were also aware of Spark and RDD's but felt that they too were also not general enough (nor flexible enough) for the demands of distributed array computing we encountered in those fields.

DyND, Datashape, and a vision for the future of Array-computing

After early work trying to extend the NumPy code itself led to struggles because of both the organic complexity of the code base and the stability needs of a mature project, the Blaze effort started with an effort to re-build the core functionality of NumPy and Pandas to fix some major warts of NumPy that had been on my mind for some time. With Continuum support, Mark Wiebe decided to continue to develop a C++ library that could then be used by Python and any-other data-science language (DyND). This necessitated defining a new data-description language (datashape) that generalizes NumPy's dtype to structures of arrays (column-oriented layout) as well as variable-length strings and categorical types. This work continues today and is making rapid progress which I will leave to others to describe in more detail. I do want to say, however, that dynd is implementing my "Pluribus" vision for the future of array-oriented computing in Python. We are factoring the core capability into 3 distinct parts: the type-system (or data-declaration system), a generalized function mechanism that can interact with any "typed" memory-view or "typed" buffer, and finally the container itself. We are nearing release of a separated type-library and are working on a separate C-API to the generalized function mechanism. This is where we are heading and it will allow maximum flexibility and re-use in the dynamic and growing world of Python and data-analysis. The DyND project is worth checking out right now (if you have desires to contribute) as it has made rapid progress in the past 6 months.

As we worked on the distributed aspects of Blaze it centered on the realization that to scale array computing to many machines you fundamentally have to move code and not data. To do this well means that how the computer actually sees and makes decisions about the data must be exposed. This information is usually part of the type system that is hidden either inside the compiler, in the specifics of the data-base schema, or implied as part of the runtime. To fundamentally solve the problem of moving code to data in a general way, a first-class and wide-spread data-description language must be created and made available. Python users will recognize that a subset of this kind of information is contained in the struct module (the struct "format" strings), in the Python 3 extended buffer protocol definition (PEP 3118), and in NumPy's dtype system. Extending these concepts to any language is the purpose of datashape.

In addition, run-times that understand this information and can execute instructions on variables that expose this information must be adapted or created for every system. This is part of the motivation for DyND and why very soon the datashape system and its C++ type library will be released independently from the rest of DyND and Blaze. This is fundamentally why DyND and datashape are such important projects to me. I see in them the long-term path to massive code-reuse, the breaking down of data-silos that currently cause so much analytics algorithm duplication and lack of cooperation.

Simple algorithms from data-munging scripts to complex machine-learning solutions must currently be re-built for every-kind of data-silo unless there is a common way to actually functionally bring code to data. Datashape and the type-library runtime from DyND (ndt) will allow this future to exist. I am eager to see the Apache Arrow project succeed as well because it has related goals (though more narrowly defined).

The next step in this direction is an on-disk and in-memory data-fabric that allows data to exist in a distributed file-system or a shared-memory across a cluster with a pointer to the head of that data along with a data-shape description of how to interpret that pointer so that any language that can understand the bytes in that layout can be used to execute analytics on those bytes. The C++ type run-time stands ready to support any language that wants to parse and understand data-shape-described pointers in this future data-fabric.

From one point of view, this DyND and data-fabric effort are a natural evolution of the efforts I started in 1998 that led to the creation of SciPy and NumPy. We built a system that allows existing algorithms in C/C++ and Fortran to be applied to any data in Python. The evolution of that effort will allow algorithms from many other languages to be applied to any data in memory across a cluster.

Blaze Expressions and Server

The key part of Blaze that is also important to mention is the notion of the Blaze server and user-facing Blaze expressions and functions. This is now what Blaze the project actually entails --- while other aspects of Blaze have been pushed into their respective projects. Functionally, the Blaze server allows the data-fabric concept on a machine or a cluster of machines to be exposed to the rest of the internet as a data-url (e.g. http://mydomain.com/catalog/datasource/slice). This data-url can then be consumed as a variable in a Blaze expression --- first across entire organizations and then across the world.

This is the truly exciting part of Blaze that would enable all the data in the world to be as accessible as an already-loaded data-frame or array. The logical expressions and transformations you can then write on those data to be your "logical computer" will then be translated at compute time to the actual run-time instructions as determined by the Blaze server which is mediating communication with various backends depending on where the data is actually located. We are realizing this vision on many data-sets and a certain set of expressions already with a growing collection of backends. It is allowing true "write-once and run anywhere" to be applied to data-transformations and queries and eventually data-analytics. Currently, the data-scientists finds herself to be in a situation similar to the assembly programmer in the 1960s who had to know what machine the code would run on before writing the code. Before beginning a data analytics task, you have to determine which data-silo the data is located in before tackling the task. SQL has provided a database-agnostic layer for years, but it is too limiting for advanced analytics --- and user-defined functions are still database specific.

Continuum's support of blaze development is currently taking place as defined by our consulting customers as well as by the demands of our Anaconda platform and the feature-set of an exciting new product for the Anaconda Platform that will be discussed in the coming weeks and months. This new product will provide a simplified graphical user-experience on top of Blaze expressions, and Bokeh visualizations for rapidly connecting quantitative analysts to their data and allowing explorations that retain provenance and governance. General availability is currently planned for August.

Blaze also spawned additional efforts around fast compressed storage of data (blz which formed the inspiration and initial basis for bcolz) and experiments with castra as well as a popular and straight-forward tool for quickly copying data from one data-silo kind to another (odo).

Developing dask the library and Dask the project

The most important development to come out of Blaze, however, will have tremendous impact in the short term well before the full Blaze vision is completed. This project is Dask and I'm excited for what Dask will bring to the community in 2016. It is helping us finally deliver on scaled-out NumPy / Pandas and making Anaconda a first-class citizen in Hadoop.

In 2014, Matthew Rocklin started working at Continuum on the Blaze team. Matthew is the well-known author of many functional tools for Python. He has a great blog you should read regularly. His first contribution to Blaze was to adapt a multiple-dispatch system he had built which formed the foundation of both odo and Blaze. He also worked with Andy Terrel and Phillip Cloud to clarify the Blaze library as a front-end to multiple backends like Spark, Impala, Mongo, and NumPy/Pandas.

With these steps taken, it was clear that the Blaze project needed its own first-class backend as well something that the community could rally around to ensure that Python remained a first-class participant in the scale-out conversation --- especially where systems that connected with Hadoop were being promoted. Python should not ultimately be relegated to being a mere front-end system that scripts Spark or Hadoop --- unable to talk directly to the underlying data. This is not how Python achieved its place as a de-facto data-science language. Python should be able to access and execute on the data directly inside Hadoop.

Getting there took time. The first version of dask was released in early 2015 and while distributed work-flows were envisioned, the first versions were focused on out-of-core work-flows --- allowing problem-sizes that were too big to fit in memory to be explored with simple pandas-like and numpy-like APIs.

When Matthew showed me his first version of dask, I was excited. I loved three things about it: 1) It was simple and could, therefore, be used as a foundation for parallel PyData. 2) It leveraged already existing code and infrastructure in NumPy and Pandas. 3) It had very clean separation between collections like arrays and data-frames, the directed graph representation, and the schedulers that executed those graphs. This was the missing piece we needed in the Blaze ecosystem. I immediately directed people on the Blaze team to work with Matt Rocklin on Dask and asked Matt to work full-time on it.

He and the team made great progress and by summer of 2015 had a very nice out-of-core system working with two functioning parallel-schedulers (multi-processing and multi-threaded). There was also a "synchronous" scheduler that could be used for debugging the graph and the system showed well enough throughout 2015 to start to be adopted by other projects (scikit-image and xarray).

In the summer of 2015, Matt began working on the distributed scheduler. By fall of 2015, he had a very nice core system leveraging the hard work of the Python community. He built the API around the concepts of asynchronous computing already being promoted in Python 3 (futures) and built dask.distributed on top of tornado. The next several months were spent improving the scheduler by exposing it to as many work-flows as possible from computational-science, quantitative-science and computational-science. By February of 2016, the system was ready to be used by a variety of people interested in distributed computing with Python. This process continues today.

Using dask.dataframes and dask.arrays you can quickly build array- and table-based work-flows with a Pandas-like and NumPy-like syntax respectively that works on data sitting across a cluster.

Anaconda and the PyData ecosystem now had another solution for the scale-out problem --- one whose design and implementation was something I felt could be a default run-time backend for Blaze. As a result, I could get motivated to support, market, and seek additional funding for this effort. Continuum has received some DARPA funding under the XDATA program. However, this money was now spread pretty thin among Bokeh, Numba, Blaze, and now Dask.

Connecting to Hadoop

With the distributed scheduler basically working and beginning to improve, two problems remained with respect to Hadoop interoperability: 1) direct access to the data sitting in HDFS and 2) interaction with the resource schedulers running most Hadoop clusters (YARN or mesos).

To see how important the next developments are, it is useful to describe an anecdote from early on in our XDATA experience. In the summer of 2013, when the DARPA XDATA program first kicked-off, the program organizers had reserved a large Hadoop cluster (which even had GPUs on some of the nodes). They loaded many data sets onto the cluster and communicated about its existence to all of the teams who had gathered to collaborate on getting insights out of "Big Data." However, a large number of the people collaborating were using Python, R, or C++. To them the Hadoop cluster was inaccessible as there was very little they could use to interact with the data stored in HDFS (beyond some high-latency and low-bandwidth streaming approaches) and nothing they could do to interact with the scheduler directly (without writing Scala or Java code). The Hadoop cluster sat idle for most of the summer while teams scrambled to get their own hardware to run their code on and deliver their results.

This same situation we encountered in 2013 exists in many organizations today. People have large Hadoop infrastructures, but are not connecting that infrastructure effectively to their data-scientists who are more comfortable in Python, R, or some-other high-level (non JVM language).

With dask working reasonably well, tackling this data-connection problem head on became an important part of our Anaconda for Hadoop story and so in December of 2015 we began two initiatives to connect Anaconda directly to Hadoop. Getting data from HDFS turned out to be much easier than we had initially expected because of the hard-work of many others. There had been quite a bit of work building a C++ interface to Hadoop at Pivotal that had culminated in a library called libhdfs3. Continuum wrote a Python interface to that library quickly, and it now exists as the hdfs3 library under the Dask organization on Github.

The second project was a little more involved as we needed to integrate with YARN directly. Continuum developers worked on this and produced a Python library that communicates directly to the YARN classes (using Scala) in order to allow the Python developer to control computing resources as well as spread files to the Hadoop cluster. This project is called knit, and we expect to connect it to mesos and other cluster resource managers in the near future (if you would like to sponsor this effort, please get in touch with me).

Early releases of hdfs3 and knit were available by the end of February 2015. At that time, these projects were joined with dask.distributed and the dask code-base into a new Github organization called Dask. The graduation of Dask into its own organization signified an important milestone that dask was now ready for rapid improvement and growth alongside Spark as a first-class execution engine in the Hadoop ecosystem.

Our initial goals for Dask are to build enough examples, capability, and awareness so that every PySpark user tries Dask to see if it helps them. We also want Dask to be a compatible and respected member of the growing Hadoop execution-framework community. We are also seeking to enable Dask to be used by scientists of all kinds who have both array and table data stored on central file-systems and distributed file-systems outside of the Hadoop ecosystem.

Anaconda as a first-class execution ecosystem for Hadoop

With Dask (including hdfs3 and knit), Anaconda is now able to participate on an equal footing with every other execution framework for Hadoop. Because of the vast reaches of Anaconda Python and Anaconda R communities, this means that a lot of native code can now be integrated to Hadoop much more easily, and any company that has stored their data in HDFS or other distributed file system (like s3fs or gpfs) can now connect that data easily to the entire Python and/or R computing stack.

This is exciting news! While we are cautious because these integrative technologies are still young, they are connected to and leveraging the very mature PyData ecosystem. While benchmarks can be misleading, we have a few benchmarks that I believe accurately reflect the reality of what parallel and distributed Anaconda can do and how it relates to other Hadoop systems. For array-based and table-based computing workflows, Dask will be 10x to 100x faster than an equivalent PySpark solution. For applications where you are not using arrays or tables (i.e. word-count using a dask.bag), Dask is a little bit slower than a similar PySpark solution. However, I would argue that Dask is much more Pythonic and easier to understand for someone who has learned Python.

It will be very interesting to see what the next year brings as more and more people realize what is now available to them in Anaconda. The PyData crowd will now have instant access to cluster computing at a scale that has previously been accessible only by learning complicated new systems based on the JVM or paying an unfortunate performance penalty. The Hadoop crowd will now have direct and optimized access to entire classes of algorithms from Python (and R) that they have not previously been used to.

It will take time for this news and these new capabilities to percolate, be tested, and find use-cases that resonate with the particular problems people actually encounter in practice. I look forward to helping many of you take the leap into using Anaconda at scale in 2016.

We will be showing off aspects of the new technology at Strata in San Jose in the Continuum booth #1336 (look for Anaconda logo and mark). We have already announced at a high-level some of the capabilities: Peter and I will both be at Strata along with several of the talented people at Continuum. If you are attending drop by and say hello.

We first came to Strata on behalf of Continuum in 2012 in Santa Clara. We announced that we were going to bring you scaled-out NumPy. We are now beginning to deliver on this promise with Dask. We brought you scaled-up NumPy with Numba. Blaze and Bokeh will continue to bring them together along with the rest of the larger data community to provide real insight on data --- where-ever it is stored. Try out Dask and join the new scaled-out PyData story which is richer than ever before, has a larger community than ever before, and has a brighter future than ever before.

Friday, December 6, 2013

Why I promote conda

Anaconda users have been enjoying the benefits of conda for quickly and easily
managing their binary Python packages for over a year. During that time conda
has also been steadily improving as a general-purpose package manager. I
have recently been promoting the very nice things that conda can do for Python
users generally --- especially with complex binary extensions to Python as
exist in the NumPy stack. For example, It is very easy to create python 3
environments and python 2 environments on the same system and install
scikit-learn into them. Normally, this process can be painful if you
do not have a suitable build environment, or don't want to wait for
compilation to succeed.

Naturally, I sometimes get asked, "Why did you promote/write another
python package manager (conda) instead of just contributing to the
standard pip and virtualenv?" The python packaging story is older and
more personal to me than you might think. Python packaging has been a thorn
in my side personally since 1998 when I released my first Python extension
(called numpyio actually). Since then, I've written and personally released
many, many Python packages (Multipack which became SciPy, NumPy, llvmpy,
Numba, Blaze, etc.). There is nothing you want more as a package author than
users. So, to make Multipack (SciPy), then NumPy available, I had to become a
packaging expert by experiencing a lot of pain with the lack of
suitable tools for my (admittedly complex) task.

Along the way, I've suffered through believing that distutils,
setuptools, distribute, and pip/virtualenv would solve my actual
problem. All of these tools provided some standardization (at least around what somebody
types at the command line to build a package) but no help in actually doing the
build and no real help in getting compatible binaries of things like SciPy
installed onto many users machines.

I've personally made terrible software engineering mistakes because of the lack of
good package management. For example, I allowed the pressure of "no ABI
changes" to severely hamper the progress of the NumPy API. Instead of pushing
harder and breaking the ABI when necessary to get improvements into NumPy, I
buckled under the pressure and agreed to the requests coming mostly from NumPy
windows users and froze the ABI. I could empathize with people who would spend
days building their NumPy stack and literally become fearful of changing it.
From NumPy 1.4 to NumPy 1.7, the partial date-time addition caused various
degrees of broken-ness and is part of why missing data data-types have never
showed up in NumPy at all. If conda had existed back then with standard
conda binaries released for different projects, there would have been almost
no problem at all. That pressure would have largely disappeared. Just
install the packages again --- problem solved for everybody (not just the
Linux users who had apt-get and yum).

Some of the problems with SciPy are also rooted in the lack of good packages
and package management. SciPy, when we first released it in 2001 was
basically a distribution of multiple modules from Multipack, some new BLAS /
LAPACK and linear algebra wrappers and nascent plotting tools. It was a SciPy
distribution masquerading as a single library. Most of the effort spent was
a packaging effort (especially on Windows). Since then, the scikits effort
has done a great job of breaking up the domain of SciPy into more manageable
chunks and providing a space for the community to grow. This kind of re-
factoring is only possible with good distributions and is really only
effective when you have good package management. On Mac and Linux
package managers exist --- on Windows things like EPD, Anaconda or C.
Gohlke's collection of binaries have been the only solution.

Through all of this work, I've cut my fingers and toes and sometimes face on
compilers, shared and static libraries on all kinds of crazy systems (AIX,
Windows NT, etc.). I still remember the night I learned what it meant to have
ABI incompatibilty between different compilers (try passing structs
such as complex-numbers between a file compiled with mingw and a library compiled with
Visual Studio). I've been bitten more than once by unicode-width
incompatibilities, strange shared-library incompatibilities, and the vagaries
of how different compilers and run-times define the `FILE *` file pointer.

In fact, if you have not read "Linkers and Loaders", you should actually do
that right now as it will open your mind to that interesting limbo between
"developer-code" and "running process" overlooked by even experienced
developers. I'm grateful Dave Beazley recommended it to me over 6 years ago.
Here is a link: http://www.iecc.com/linker/

We in the scientific python community have had difficulty and a rocky
history with just waiting for the Python.org community to solve the
problem. With distutils for example, we had to essentially re-write
most of it (as numpy.distutils) in order to support compilation of
extensions that needed Fortran-compiled libraries. This was not an
easy task. All kinds of other tools could have (and, in retrospect,
should have) been used. Most of the design of distutils did not help
us in the NumPy stack at all. In fact, numpy.distutils replaces most
of the innards of distutils but is still shackled by the architecture
and imperative approach to what should fundamentally be a declarative
problem. We should have just used or written something like waf or
bento or cmake and encouraged its use everywhere. However, we buckled
under the pressure of the distutils promise of "one right way to do
it" and "one-size fits all" solution that we all hoped for, but
ultimately did not get. I appreciate the effort of the distutils
authors. Their hearts were in the right place and they did provide a
useful solution for their use-cases. It was just not useful for ours,
and we should not have tried to force the issue. Not all code is
useful to everyone. The real mistake was the Python community picking
a "standard" that was actually limiting for a sizeable set of users.
This was the real problem --- but it should be noted that this
"problem" is only because of the incredible success and therefore
influence of python developers and python.org. With this influence, however,
comes a certain danger of limiting progress if all advances have to be
made via committee --- working out specifications instead of watching for
innovation and encouraging it.

David Cooke and many others finally wrestled numpy.distutils to the
point that the library does provide some useful functionality for
helping build extensions requiring NumPy. Even after all that effort,
however, some in the Python community who seem to have no idea of the
history of how these things came about and simply claim that setup.py
files that need numpy.distutils are "broken" because they import numpy
before "requiring" them. To this, I reply that what is actually
broken is the design that does not have a delcarative meta-data file
that describes dependencies and then a build process that creates the
environment needed before running any code to do the actual build.
This is what `conda build` does and it works beautifully to create any
kind of binary package you want from any list of dependencies you may
have. Anything else is going to require all kinds of "bootstrap"
gyrations to fit into the square hole of a process that seems to
require that all things begin with the python setup.py incantation.

Therefore, you can't really address the problem of Python packaging without
addressing the core problems of trying to use distutils (at least for the
NumPy stack). The problems for us in the NumPy stack started there and have
to be rooted out there as well. This was confirmed for me at the first PyData
meetup at Google HQ, where several of us asked Guido what we can do to fix
Python packaging for the NumPy stack. Guido's answer was to "solve the
problem ourselves". We at Continuum took him at his word. We looked at dpkg,
rpm, pip/virtualenv, brew, nixos, and 0installer, and used our past experience
with EPD. We thought hard about the fundamental issues, and created the conda
package manager and conda environments. We who have been working on this for
the past year have decades of Python packaging experience between us: me,
Peter Wang, Ilan Schnell, Bryan Van de Ven, Mark Wiebe, Trent Nelson, Aaron
Meurer, and now Andy Terrel are all helping improve things. We welcome
contributions, improvements, and updates from anyone else as conda is BSD
licensed and completely open source and can be used and re-used by
anybody. We've also recently made a mailing list
conda@continuum.io which is open to anyone to join and participate:
https://groups.google.com/a/continuum.io/forum/#!forum/conda

Conda pkg files are similar to .whl files except they are Python-agnostic. A
conda pkg file is a bzipped tar file with an 'info' directory, and then
whatever other directory structure is created by the install process in
"prefix". It's the equivalent of taking a file-system diff pre and post-
install and then tarring the result up. It's more general than .whl files and
can support any kind of binary file. Making conda packages is as simple as making a recipe for it. We make a growing collection of public-domain, example recipes available to everyone and also encourage attachment of a conda recipe directory to every project that needs binaries.

At the heart of conda package installation is the concept of environments.
Environments are like namespaces in Python -- but for binary packages. Their
applicability is extensive. We are using them within Anaconda and Wakari for
all kinds of purposes (from testing to application isolation to easy
reproducibility to supporting multiple versions of packages in different
scripts that are part of the same installation). Truly, to borrow the famous
Tim Peters' quip: "Environments are one honking great idea -- let's do more of
those". Rather than tacking this on after the fact like virtualenv does to
pip, OS-level environments are built-in from the beginning. As a result,
every conda package is always installed into an environment. There is a
default (root) environment if you don't explicitly specify another one.
Installation of a package is simply merging the unpacked binary into the union
of unpacked binaries already at the root-path of the environment. If union
filesystems were better implemented in different operating systems, then each
environment would simply be a union of the untarred binary packages. Instead
we accomplish the same thing with hard-linking, soft-linking, and (when
necessary) copying of files.

The design is simple, which helps it be easy to understand and easy to
mix with other ideas. We don't see easily how to take these simple,
powerful ideas and adapt them to .whl and virtualenv which are trying
to fit-in to a world created by distutils and setuptools. It was
actually much easier to just write our own solution and create
hundreds of packages and make them available and provide all the tools
to reproduce what we have done inside conda than to try and untangle
how to provide our solution in that world and potentially even not
quite get the result we want (which can be argued is what happened
with numpy.distutils).

You can use conda to build your own distribution of binaries that
compete with Anaconda if you like. Please do. I would be completely
thrilled if every other Python distribution (python.org, EPD,
ActiveState, etc.) just used conda packages that they build and in so
doing helped improve the conda package manager. I recognize that
conda emerged at the same time as the Anaconda distribution was
stabilizing and so there is natural confusion over the two. So,
I will try to clarify: Conda is an open-source, general,
cross-platform package manager. One could accurately describe it as a
cross-platform hombrew written in Python. Anyone can use the tool and
related infrastructure to build and distribute whatever packages they
want.

Anaconda is the collection of conda packages that we at Continuum provide for
free to everyone, based on a particular base Python we choose (which you can
download at http://continuum.io/downloads as Miniconda). In the past it has
been some work to get conda working outside Miniconda or Anaconda because our
first focus was creating a working solution for our users. We have been
fixing those minor issues and have now released a version of conda that can be
'pip installed'. As conda has significant overlap with virtualenv in
particular we are still working out kinks in the interop of these two
solutions. But, it all can and should work together and we fix issues as
quickly as we can identify them.

We also provide a service called http://binstar.org (register with beta-code
"binstar in beta") which allows you to host your own binary conda packages.
With this missing piece, you just tell people to point their conda
repositories to your collection -- and they can easily install everything you
want them to. You can also build your own conda repositories and host them on
your own servers. It all works, today, now -- for hundreds of thousands of
people. In this context, Anaconda could be considered a "reference"
distribution and a proof of concept of how to use the conda package manager.
Wakari also uses the conda package manager at its core to share bundles.
Bundles are just conda packages (with a set of dependencies) and capture the
core problems associated with reproducible computing in a light-weight and
easily reproduced way. We have made the tools available for *anyone* to re-
create this distribution pretty easily and compete with us.

It is very important to keep in mind that we created conda to solve
the problem of distributing an environment to end-users that allow
them do to advanced data analytics, scientific discovery, and general
engineering work. Python has a chance to play a major role in this
space. However, it is not the only player. Other solutions exist in
the space we are targeting (SAS, Matlab, SPSS, and R). We want Python
to dominate this space. We could not wait for the packaging solution
we needed to evolve from the lengthy discussions that are on-going
which also have to untangle the history of distutils, setuptools,
easy_install, and distribute. What we could do is solve our problem
and then look for interoperability and influence opportunities once we
had something that worked for our needs. That the approach we took
and I'm glad we did. We have a working solution now which benefits
hundreds of thousands of users (and could benefit millions more if
IT administrators recognized conda as an acceptable packaging approach
from others in the community).

We are going to keep improving conda until it becomes an obvious
solution for everyone: users, developers, and IT administrators alike.
We welcome additions and suggestions that allow it to interoperate
with anything else in the Python packaging space. I do believe that the group of people working on Python packaging and Nick Coghlan in particular are doing a valuable service. It's a very difficult job to take into account the history of Python packaging, fix all the little issues around it, *and* provide a binary distribution system that allows users to not have to think about packaging and distribution. With our resources we did just the latter. I admire those who are on the front lines of the former and look to provide as much context as I can to ensure that any future decisions take our use-cases into account. I am looking forward to continuing to work with the community to reach future solutions that benefit everyone.

If you would like to see more detail about conda and how it can be used here are some
resources:

Documentation: http://docs.continuum.io/conda/index.html
Talk at PyData NYC 2013:
- Slides: https://speakerdeck.com/teoliphant/packaging-and-deployment-with-conda
- Video: http://vimeo.com/79862018

Blog Posts:
- http://continuum.io/blog/anaconda-python-3
- http://continuum.io/blog/new-advances-in-conda
- http://continuum.io/blog/conda

Mailing list:
- conda@continuum.io
- https://groups.google.com/a/continuum.io/forum/#!forum/conda

Wednesday, July 3, 2013

Thoughts after SciPy 2013 and a specific NumPy improvement

I attended a few days of SciPy 2013 and enjoyed interacting with the many old friends and many new friends that participate in this conference. I thought the program committee did an excellent job of selecting talks and there were more attendees this year which also mirrors my experience with the PyData conference series which sells out every time. Andy Terrell, a NumFOCUS board member and researcher at the University of Texas, and Jonathan Rocher, an Enthought developer, were co-chairs of SciPy this year and did an excellent job of coordination.

Continuum Analytics, my new company, is the institutional sponsor of the PyData conference series and I know how much work it can be, so my thanks go out to Enthought for their efforts to sponsor the SciPy conference this year and in years past. I'm really looking forward to the day when the SciPy conference, like the PyData conference series, directly benefits NumFOCUS which is a non-profit organization with 501(c)(3) status started by the scientific Python community and run by the same community behind so much of the SciPy stack. It looks like steps are being taken in that direction which is wonderful to see. At the SciPy conference, Fernando Perez, of IPython fame, led the charge to get fiscal sponsorship documents improved to make it much simpler for people wanting to sponsor the great projects on the scientific python stack (IPython, NumPy, SciPy, Pandas, SymPy, Matplotlib, etc.) to have a vehicle to do it. This year, NumFOCUS was able to sponsor the attendance of two students to the SciPy conference because of generous donors. Right now, NumFOCUS is looking for help for its website to improve the look and feel. It's a great way to get involved with the community and help out. Just send an email to the numfocus google group (a public group for all to get involved with): mailto:numfocus+subscribe@googlegroups.com?subject=Subscribe.

Right now, a conversation involving graph-representations for Python compilation tools is happening on the numfocus mailing list among several interested parties from SymPy, Numba, Theano, Pythran, Parakeet, etc. One of the highlights of the conference for me was meeting and interacting with other people interested in Python-for-science compiler technology as it looks like there is a healthy community developing around this topic. I hope those interested in the topic check out compilers.pydata.org and issue pull requests to that github-hosted page to describe their favorite tool.

I only attended some of the tutorial given by fellow Continuum team members Ben Zaitlen and Clayton Davis. I was gratified to see that wakari.io was useful for so many people during the tutorials, and appreciated the feedback on how we can continue to improve the tool. I'm also grateful to see all the people able to productively use Anaconda which is our free, cross-platform, distribution for using Python for scientific work and data analysis.

It was nice to see David Cournapeau give a detailed discussion of NumPy internals in one of the tutorials. There is much more that could be said about NumPy internals, but David gave a good introduction to the topic. I like how he showed how it is possible to extend the NumPy dtype system --- especially with certain kinds of types. In NumPy, I tried very hard to make the type-system more extensible. It's nice to see it being used more and more. Extending the type system more generally (to include things like variable-length strings, and infinite precision floats) is a bit harder and not very easy to do in current NumPy (especially while trying to keep the foundation stable). In fact, one of the reasons Continuum is sponsoring the development of dynd is precisely to build a foundation with an easier to extend type-system. Making it a C++ library should hopefully allow languages like Javascript, Ruby, Haskell, and others to also benefit from the dynamic type concepts as well.

I really enjoyed the talk on Spyder by Carlos Cordoba. The Spyder IDE is a very nice tool and I was happy to see Carlos promoting it. The Spyder IDE is featured in our Anaconda Launcher (part of the Anaconda 1.6 release) along with the IPython notebook and IPython console. The Launcher allows anyone to publish their app to multiple platforms simply by making a conda package (with an icon and an entry-point) and upload it to a repository that the Launcher is looking at. All the dependencies can be specified and they will be installed via conda automatically when the app is selected. The hope is to make it very easy for anyone to get their cool application based on Python in front of people quickly without having to make installers for every platform.

Besides the excellent keynote talks, by Fernando Perez, William Schroeder, and Olivier Grisel, I also found the talks by Matthew Rocklin, Pat Marion, Ramalingam Saravanan, Serge Guelton, Samuel Skillman, Jake Vanderplas, and Joshua Warner very interesting. It was especially nice to meet Joshua who was coming from the Mayo Clinic where SciPy began. I started writing the SciPy library in 1999 at the Mayo Clinic while I was a graduate student there (then called Multipack, special, and a bunch of other modules). It was very nice to meet someone from Mayo contributing again to this community with a very nice fuzzy logic package based on the work of an old professor of mine Hal Otteson. His work is now a new scikit. The scikit concept has been a tremendous boon for development of the Scientific Python community as it allows more distributed development and more rapid expansion of the available tools. If better packaging had existed at the time, I would very likely have kept my early modules independent so they could grow with their own developer bases. What is now the SciPy library should most likely have been a SciPy distribution (with perhaps a smaller core). But, hindsight is 20/20 and given the state of the world at the time, the best option seemed to be to create the SciPy library with Eric Jones and Pearu Peterson.

Mark Wiebe did an excellent job in presenting dynd, a C++ library for dynamic multi-dimensional array manipulation with nice python bindings. Mark's work, sponsored by Continuum Analytics, is something that could lead to NumPy 2.0, although nobody has suggested exactly how that might work yet. As dynd forms a foundation for Blaze, and Blaze and NumPy can co-exist for many years, I haven't been thinking much about how NumPy 2.0 could grow out of dynd until now. I do now have some ideas about how NumPy could be improved that I think will help the space evolve more fluidly and productively with many interested people able to coordinate their varied efforts. The most important of these is the introduction of multi-methods into NumPy which I'll outline below.

I participated on a panel about the future of Array Oriented Computing in Python. Of course, I've been spending a lot of time over the past year working and thinking exactly about that, so I would have preferred a talk versus a panel with only a limited amount of time. However, I have limited time to prepare talks and will be speaking at the upcoming PyData conference in Boston, so I was grateful for the chance to at least express some of the ideas we've been working on. To be clear, I think that Blaze is the future of Array Oriented Computing in Python, though we have some work ahead to prove that out. Exactly what the transition from NumPy to Blaze looks like for people will be a story I care quite a bit about and will be telling more and more in the coming months and years. I take personal responsibility for anyone who adopted NumPy, and I will do everything I can to make sure their transition to using Blaze is as simple as possible. Backward compatibility is very important to me. I spent many hours making sure that NumPy was compatible with both Numarray and Numeric. Fortunately, Blaze and NumPy can co-exist and so there is less of a story of either / or and more about which / when (especially during the transition phase).

There is also another possibility that will be interesting to see if it emerges: retro-fitting NumPy with multi-methods (dispatching on python type and also on dtype). I think this is the single-most important thing that can be done for NumPy. If someone is motivated and has budget, I can work with her to do this in about 1-2 months (maybe even sooner depending on the experience). This is not on my immediately funded road-map, however, so it would need outside funding and/or interest.

There are several different multi-method implementations for Python. For those unfamiliar with the concept, here is a good essay by Guido on the general concept. Multi-methods are also at the heart of Julia. They are a simple concept. Basically, a multi-method is an object that dispatches to a different implementation based on the number and types of the arguments. The idea is that you can add new implementations of the underlying function quite easily without changing the function object itself. So, for example, if numpy.dot were a multi-method, then I could change the implementation of numpy.dot for my new fancy array-object without directly changing the source-code of numpy.dot in NumPy and all downstream functions and methods that use numpy.dot in their implementation would automatically work with my new type of array. Multi-methods allow extensibility in a manner similar to how operator overloading allows extensibility in object-oriented programming. But, it's a much more natural fit for operations where dispatching only on the first argument does not make a lot of sense.

In fact, at the heart of NumPy's ufuncs is a multi-method dispatch mechanism (on NumPy dtype, instead of Python type), so NumPy users have been using multi-methods for a long time. In fact, if NumPy's ufuncs were true multi-methods to begin with, then all the hassle with __array_wrap__, __array_prepare__, and so forth which are hacks to compensate for the lack of true Python-type-based multi-methods would not be necessary. If you look at the implementation of NumPy's masked array's for example you will see some of the ugliness that is caused by NumPy's lack of a better multi-method mechanism. Numba's autojit also effectively creates a kind of multi-method as it creates a new function to dispatch to whenever it encounters a new set of types for the arguments. These are the ideas that we are building on and using in Blaze, as we learn from our experience with NumPy.

The biggest challenge for multi-methods is always what function to return if you don't find an exact match. A simple multi-method is basically a dictionary whose key is the a tuple of the types of the input arguments and whose value is the implementation. But, what do you do if the key does not return an implementation? How do you find a compatible function and use it instead? There is a lot of theory on this and several approaches people have taken. I'm not aware of a universal solution that everybody agrees should be used. However, there are reasonable approaches that can be taken using the idea of typesets or type-hierarchies (for those interested you can read more about contravariance and covariance for other approaches to resolving the type dispatch problem as well).

I'm confident that useful if not universal approaches to this problem can be found (several are already available for Python and in Julia, for example). For NumPy, what is needed is a two-tiered dispatch mechanism. My view is that all NumPy (and SciPy and Scikit) functions should be multi-methods that dispatch based on Python-type *and* then additionally for memory-view-like objects on the data-type of the elements. The dispatch rules for each of these cases can and should be separate, I think.

If you are interested in this problem and especially if you have money to fund it, feel free to contact me directly at travis at continuum dot io.

While I am spending more and more of my conference time with the PyData conference series, I still enjoy reconnecting with people I will always consider friends at the SciPy conference. Fortunately, many speakers participate in both. Having both conferences allows the community to grow and have bigger and better impact as I think can be witnessed by the increased attendance this year at SciPy.