Tuesday, March 29, 2016

Anaconda and Hadoop --- a story of the journey and where we are now.

Early Experience with Clusters

My first real experience with cluster computing came in 1999 during my graduate school days at the Mayo Clinic.  These were wonderful times.   My advisor was Dr. James Greenleaf.   He was very patient with allowing me to pester a bunch of IT professionals throughout the hospital to collect their aging Mac Performa machines and build my own home-grown cluster.   He also let me use a bunch of space in his ultrasound lab to host the cluster for about 6 months.

Building my own cluster

The form-factor for those Mac machines really made it easy to stack them.   I ended up with 28 machines in two stacks with 14 machines in each stack (all plugged into a few power strips and a standard lab-quality outlet).  With the recent release of Yellow-Dog Linux, I wiped the standard OS from all the machines  and installed Linux on all those Macs to create a beautiful cluster of UNIX goodness I could really get excited about.   I called my system "The Orchard" and thought it would be difficult to come up with 28 different kinds of apple varieties to name each machine after.  It wasn't difficult. It turns out there are over 7,500 varieties of apples grown throughout the world.

Me smiling alongside by smoothly humming "Orchard" of interconnected Macs

The reason I put this cluster together was to simulate Magnetic Resonance Elastography (MRE) which is a technique to visualize motion using Magnetic Resonance Imaging (MRI).  I wanted to simulate the Bloch equations with a classical model for how MRI images are produced.  The goal was to create a simulation model for the MRE experiment that I could then use to both understand the data and perhaps eventually use this model to determine material properties directly from the measurements using Bayesian inversion (ambitiously bypassing the standard sequential steps of inverse FFT and local-frequency estimation).

Now I just had to get all these machines to talk to each other, and then I would be poised to do anything.  I read up a bit on MPI, PVM, and anything else I could find about getting computers to talk to each other.  My unfamiliarity with the field left me puzzled as I tried to learn these frameworks in addition to figuring out how to solve my immediate problem.  Eventually, I just settled down with a trusted UNIX book by the late W. Richard Stevens.    This book explained how the internet works.   I learned enough about TCP/IP and sockets so that I could write my own C++ classes representing the model.  These classes communicated directly with each other over raw sockets.   While using sockets directly was perhaps not the best approach, it did work and helped me understand the internet so much better.  It also makes me appreciate projects like tornado and zmq that much more.

Lessons Learned

I ended up with a system that worked reasonably well, and I could simulate MRE to some manner of fidelity with about 2-6 hours of computation. This little project didn't end up being critical to my graduation path and so it was abandoned after about 6 months.  I still value what I learned about C++, how abstractions can ruin performance, how to guard against that, and how to get machines to communicate with each other.

Using Numeric, Python, and my recently-linked ODE library (early SciPy), I built a simpler version of the simulator that was actually faster on one machine than my cluster-version was in C++ on 20+ machines.  I certainly could have optimized the C++ code, but I could have also optimized the Python code.   The Python code took me about 4 days to write, the C++ code took me about 4 weeks.  This experience has markedly influenced my thinking for many years about both pre-mature parallelization and pre-mature use of C++ and other compiled languages.

Fast forward over a decade.   My computer efforts until 2012 were spent on sequential array-oriented programming, creating SciPy, writing NumPy, solving inverse problems, and watching a few parallel computing paradigms emerge while I worked on projects to provide for my family.  I didn't personally get to work on parallel computing problems during that time, though I always dreamed of going back and implementing this MRE simulator using a parallel construct with NumPy and SciPy directly.   When I needed to do the occassional parallel computing example during this intermediate period, I would either use IPython parallel or multi-processing.

Parallel Plans at Continuum

In 2012, Peter Wang and I started Continuum, created PyData, and released Anaconda.   We also worked closely with members of the community to establish NumFOCUS as an independent organization.  In order to give NumFOCUS the attention it deserved, we hired the indefatigable Leah Silen and donated her time entirely to the non-profit so she could work with the community to grow PyData and the Open Data Science community and ecosystem.  It has been amazing to watch the community-based, organic, and independent growth of NumFOCUS.    It took effort and resources to jump-start,  but now it is moving along with a diverse community driving it.   It is a great organization to join and contribute effort to.

A huge reason we started Continuum was to bring the NumPy stack to parallel computing --- for both scale-up (many cores) and scale-out (many nodes).   We knew that we could not do this alone and it would require creating a company and rallying a community to pull it off.   We worked hard to establish PyData as a conference and concept and then transitioned the effort to the community through NumFOCUS to rally the community behind the long-term mission of enabling data-, quantitative-, and computational-scientists with open-source software.  To ensure everyone in the community could get the software they needed to do data science with Python quickly and painlessly, we also created Anaconda and made it freely available.

In addition to important community work, we knew that we would need to work alone on specific, hard problems to also move things forward.   As part of our goals in starting Continuum we wanted to significantly improve the status of Python in the JVM-centric Hadoop world.   Conda, Bokeh, Numba, and Blaze were the four technologies we started specifically related to our goals as a company beginning in 2012.   Each had a relationship to parallel computing including Hadoop.

Conda enables easy creation and replication of environments built around deep and complex software dependencies that often exist in the data-scientist workflow.   This is a problem on a single node --- it's an even bigger problem when you want that environment easily updated and replicated across a cluster.

Bokeh  allows visualization-centric applications backed by quantitative-science to be built easily in the browser --- by non web-developers.   With the release of Bokeh 0.11 it is extremely simple to create visualization-centric-web-applications and dashboards with simple Python scripts (or also R-scripts thanks to rBokeh).

With Bokeh, Python data scientists now have the power of both d3 and Shiny, all in one package. One of the driving use-cases of Bokeh was also easy visualization of large data.  Connecting the visualization pipeline with large-scale cluster processing was always a goal of the project.   Now, with datashader, this goal is now also being realized to visualize billions of points in seconds and display them in the browser.

Our scale-up computing efforts centered on the open-source Numba project as well as our Accelerate product.  Numba has made tremendous progress in the past couple of years, and is in production use in multiple places.   Many are taking advantage of numba.vectorize to create array-oriented solutions and program the GPU with ease.   The CUDA Python support in Numba makes it the easiest way to program the GPU that I'm aware of.  The CUDA simulator provided in Numba makes it much simpler to debug in Python the logic of CUDA-based GPU programming.  The addition of parallel-contexts to numba.vectorize mean that any many-core architecture can now be exploited in Python easily.   Early HSA support is also in Numba now meaning that Numba can be used to program novel hardware devices from many vendors.

Summarizing Blaze 

The ambitious Blaze project will require another blog-post to explain its history and progress well. I will only try to summarize the project and where it's heading.  Blaze came out of a combination of deep experience with industry problems in finance, oil&gas, and other quantitative domains that would benefit from a large-scale logical array solution that was easy to use and connected with the Python ecosystem.    We observed that the MapReduce engine of Hadoop was definitely not what was needed.  We were also aware of Spark and RDD's but felt that they too were also not general enough (nor flexible enough) for the demands of distributed array computing we encountered in those fields.

DyND, Datashape, and a vision for the future of Array-computing 

After early work trying to extend the NumPy code itself led to struggles because of both the organic complexity of the code base and the stability needs of a mature project, the Blaze effort started with an effort to re-build the core functionality of NumPy and Pandas to fix some major warts of NumPy that had been on my mind for some time.   With Continuum support, Mark Wiebe decided to continue to develop a C++ library that could then be used by Python and any-other data-science language (DyND).   This necessitated defining a new data-description language (datashape) that generalizes NumPy's dtype to structures of arrays (column-oriented layout) as well as variable-length strings and categorical types.   This work continues today and is making rapid progress which I will leave to others to describe in more detail.  I do want to say, however, that dynd is implementing my "Pluribus" vision for the future of array-oriented computing in Python.   We are factoring the core capability into 3 distinct parts:  the type-system (or data-declaration system), a generalized function mechanism that can interact with any "typed" memory-view or "typed" buffer, and finally the container itself.   We are nearing release of a separated type-library and are working on a separate C-API to the generalized function mechanism.   This is where we are heading and it will allow maximum flexibility and re-use in the dynamic and growing world of Python and data-analysis.   The DyND project is worth checking out right now (if you have desires to contribute) as it has made rapid progress in the past 6 months.

As we worked on the distributed aspects of Blaze it centered on the realization that to scale array computing to many machines you fundamentally have to move code and not data.   To do this well means that how the computer actually sees and makes decisions about the data must be exposed.  This information is usually part of the type system that is hidden either inside the compiler, in the specifics of the data-base schema, or implied as part of the runtime.   To fundamentally solve the problem of moving code to data in a general way, a first-class and wide-spread data-description language must be created and made available.   Python users will recognize that a subset of this kind of information is contained in the struct module (the struct "format" strings), in the Python 3 extended buffer protocol definition (PEP 3118), and in NumPy's dtype system.   Extending these concepts to any language is the purpose of datashape.

In addition, run-times that understand this information and can execute instructions on variables that expose this information must be adapted or created for every system.  This is part of the motivation for DyND and why very soon the datashape system and its C++ type library will be released independently from the rest of DyND and Blaze.   This is fundamentally why DyND and datashape are such important projects to me.  I see in them the long-term path to massive code-reuse, the breaking down of data-silos that currently cause so much analytics algorithm duplication and lack of cooperation.

Simple algorithms from data-munging scripts to complex machine-learning solutions must currently be re-built for every-kind of data-silo unless there is a common way to actually functionally bring code to data.  Datashape and the type-library runtime from DyND (ndt) will allow this future to exist.   I am eager to see the Apache Arrow project succeed as well because it has related goals (though more narrowly defined).

The next step in this direction is an on-disk and in-memory data-fabric that allows data to exist in a distributed file-system or a shared-memory across a cluster with a pointer to the head of that data along with a data-shape description of how to interpret that pointer so that any language that can understand the bytes in that layout can be used to execute analytics on those bytes.  The C++ type run-time stands ready to support any language that wants to parse and understand data-shape-described pointers in this future data-fabric.

From one point of view, this DyND and data-fabric effort are a natural evolution of the efforts I started in 1998 that led to the creation of SciPy and NumPy.  We built a system that allows existing algorithms in C/C++ and Fortran to be applied to any data in Python.   The evolution of that effort will allow algorithms from many other languages to be applied to any data in memory across a cluster.

Blaze Expressions and Server

The key part of Blaze that is also important to mention is the notion of the Blaze server and user-facing Blaze expressions and functions.   This is now what Blaze the project actually entails --- while other aspects of Blaze have been pushed into their respective projects.  Functionally, the Blaze server allows the data-fabric concept on a machine or a cluster of machines to be exposed to the rest of the internet as a data-url (e.g. http://mydomain.com/catalog/datasource/slice).   This data-url can then be consumed as a variable in a Blaze expression --- first across entire organizations and then across the world.

This is the truly exciting part of Blaze that would enable all the data in the world to be as accessible as an already-loaded data-frame or array.  The logical expressions and transformations you can then write on those data to be your "logical computer" will then be translated at compute time to the actual run-time instructions as determined by the Blaze server which is mediating communication with various backends depending on where the data is actually located.   We are realizing this vision on many data-sets and a certain set of expressions already with a growing collection of backends.   It is allowing true "write-once and run anywhere" to be applied to data-transformations and queries and eventually data-analytics.     Currently, the data-scientists finds herself to be in a situation similar to the assembly programmer in the 1960s who had to know what machine the code would run on before writing the code.   Before beginning a data analytics task, you have to determine which data-silo the data is located in before tackling the task.  SQL has provided a database-agnostic layer for years, but it is too limiting for advanced analytics --- and user-defined functions are still database specific.

Continuum's support of blaze development is currently taking place as defined by our consulting customers as well as by the demands of our Anaconda platform and the feature-set of an exciting new product for the Anaconda Platform that will be discussed in the coming weeks and months. This new product will provide a simplified graphical user-experience on top of Blaze expressions, and Bokeh visualizations for rapidly connecting quantitative analysts to their data and allowing explorations that retain provenance and governance.  General availability is currently planned for August.

Blaze also spawned additional efforts around fast compressed storage of data (blz which formed the inspiration and initial basis for bcolz) and experiments with castra as well as a popular and straight-forward tool for quickly copying data from one data-silo kind to another (odo).

Developing dask the library and Dask the project

The most important development to come out of Blaze, however, will have tremendous impact in the short term well before the full Blaze vision is completed.  This project is Dask and I'm excited for what Dask will bring to the community in 2016.   It is helping us finally deliver on scaled-out NumPy / Pandas and making Anaconda a first-class citizen in Hadoop.

In 2014, Matthew Rocklin started working at Continuum on the Blaze team.   Matthew is the well-known author of many functional tools for Python.  He has a great blog you should read regularly.   His first contribution to Blaze was to adapt a multiple-dispatch system he had built which formed the foundation of both odo and Blaze.  He also worked with Andy Terrel and Phillip Cloud to clarify the Blaze library as a front-end to multiple backends like Spark, Impala, Mongo, and NumPy/Pandas.

With these steps taken, it was clear that the Blaze project needed its own first-class backend as well something that the community could rally around to ensure that Python remained a first-class participant in the scale-out conversation --- especially where systems that connected with Hadoop were being promoted.  Python should not ultimately be relegated to being a mere front-end system that scripts Spark or Hadoop --- unable to talk directly to the underlying data.    This is not how Python achieved its place as a de-facto data-science language.  Python should be able to access and execute on the data directly inside Hadoop.

Getting there took time.  The first version of dask was released in early 2015 and while distributed work-flows were envisioned, the first versions were focused on out-of-core work-flows --- allowing problem-sizes that were too big to fit in memory to be explored with simple pandas-like and numpy-like APIs.

When Matthew showed me his first version of dask, I was excited.  I loved three things about it:  1) It was simple and could, therefore, be used as a foundation for parallel PyData.  2) It leveraged already existing code and infrastructure in NumPy and Pandas.  3) It had very clean separation between collections like arrays and data-frames, the directed graph representation, and the schedulers that executed those graphs.   This was the missing piece we needed in the Blaze ecosystem.   I immediately directed people on the Blaze team to work with Matt Rocklin on Dask and asked Matt to work full-time on it.

He and the team made great progress and by summer of 2015 had a very nice out-of-core system working with two functioning parallel-schedulers (multi-processing and multi-threaded).  There was also a "synchronous" scheduler that could be used for debugging the graph and the system showed well enough throughout 2015 to start to be adopted by other projects (scikit-image and xarray).

In the summer of 2015, Matt began working on the distributed scheduler.  By fall of 2015, he had a very nice core system leveraging the hard work of the Python community.   He built the API around the concepts of asynchronous computing already being promoted in Python 3 (futures) and built dask.distributed on top of tornado.   The next several months were spent improving the scheduler by exposing it to as many work-flows as possible from computational-science, quantitative-science and computational-science.   By February of 2016, the system was ready to be used by a variety of people interested in distributed computing with Python.   This process continues today.

Using dask.dataframes and dask.arrays you can quickly build array- and table-based work-flows with a Pandas-like and NumPy-like syntax respectively that works on data sitting across a cluster.

Anaconda and the PyData ecosystem now had another solution for the scale-out problem --- one whose design and implementation was something I felt could be a default run-time backend for Blaze.  As a result, I could get motivated to support, market, and seek additional funding for this effort.  Continuum has received some DARPA funding under the XDATA program.  However, this money was now spread pretty thin among Bokeh, Numba, Blaze, and now Dask.

Connecting to Hadoop

With the distributed scheduler basically working and beginning to improve, two problems remained with respect to Hadoop interoperability: 1) direct access to the data sitting in HDFS and 2) interaction with the resource schedulers running most Hadoop clusters (YARN or mesos).

To see how important the next developments are, it is useful to describe an anecdote from early on in our XDATA experience.  In the summer of 2013, when the DARPA XDATA program first kicked-off, the program organizers had reserved a large Hadoop cluster (which even had GPUs on some of the nodes).  They loaded many data sets onto the cluster and communicated about its existence to all of the teams who had gathered to collaborate on getting insights out of "Big Data."    However, a large number of the people collaborating were using Python, R, or C++.  To them the Hadoop cluster was inaccessible as there was very little they could use to interact with the data stored in HDFS (beyond some high-latency and low-bandwidth streaming approaches) and nothing they could do to interact with the scheduler directly (without writing Scala or Java code). The Hadoop cluster sat idle for most of the summer while teams scrambled to get their own hardware to run their code on and deliver their results.

This same situation we encountered in 2013 exists in many organizations today.  People have large Hadoop infrastructures, but are not connecting that infrastructure effectively to their data-scientists who are more comfortable in Python, R, or some-other high-level (non JVM language).

With dask working reasonably well, tackling this data-connection problem head on became an important part of our Anaconda for Hadoop story and so in December of 2015 we began two initiatives to connect Anaconda directly to Hadoop.   Getting data from HDFS turned out to be much easier than we had initially expected because of the hard-work of many others.    There had been quite a bit of work building a C++ interface to Hadoop at Pivotal that had culminated in a library called libhdfs3.   Continuum wrote a Python interface to that library quickly, and it now exists as the hdfs3 library under the Dask organization on Github.

The second project was a little more involved as we needed to integrate with YARN directly.   Continuum developers worked on this and produced a Python library that communicates directly to the YARN classes (using Scala) in order to allow the Python developer to control computing resources as well as spread files to the Hadoop cluster.   This project is called knit, and we expect to connect it to mesos and other cluster resource managers in the near future (if you would like to sponsor this effort, please get in touch with me).

Early releases of hdfs3 and knit were available by the end of February 2015.  At that time, these projects were joined with dask.distributed and the dask code-base into a new Github organization called Dask.   The graduation of Dask into its own organization signified an important milestone that dask was now ready for rapid improvement and growth alongside Spark as a first-class execution engine in the Hadoop ecosystem.

Our initial goals for Dask are to build enough examples, capability, and awareness so that every PySpark user tries Dask to see if it helps them.    We also want Dask to be a compatible and respected member of the growing Hadoop execution-framework community.   We are also seeking to enable Dask to be used by scientists of all kinds who have both array and table data stored on central file-systems and distributed file-systems outside of the Hadoop ecosystem.

Anaconda as a first-class execution ecosystem for Hadoop

With Dask (including hdfs3 and knit), Anaconda is now able to participate on an equal footing with every other execution framework for Hadoop.  Because of the vast reaches of Anaconda Python and Anaconda R communities, this means that a lot of native code can now be integrated to Hadoop much more easily, and any company that has stored their data in HDFS or other distributed file system (like s3fs or gpfs) can now connect that data easily to the entire Python and/or R computing stack.

This is exciting news!    While we are cautious because these integrative technologies are still young, they are connected to and leveraging the very mature PyData ecosystem.    While benchmarks can be misleading, we have a few benchmarks that I believe accurately reflect the reality of what parallel and distributed Anaconda can do and how it relates to other Hadoop systems.  For array-based and table-based computing workflows, Dask will be 10x to 100x faster than an equivalent PySpark solution.   For applications where you are not using arrays or tables (i.e. word-count using a dask.bag), Dask is a little bit slower than a similar PySpark solution.  However, I would argue that Dask is much more Pythonic and easier to understand for someone who has learned Python.

It will be very interesting to see what the next year brings as more and more people realize what is now available to them in Anaconda.  The PyData crowd will now have instant access to cluster computing at a scale that has previously been accessible only by learning complicated new systems based on the JVM or paying an unfortunate performance penalty.   The Hadoop crowd will now have direct and optimized access to entire classes of algorithms from Python (and R) that they have not previously been used to.

It will take time for this news and these new capabilities to percolate, be tested, and find use-cases that resonate with the particular problems people actually encounter in practice.  I look forward to helping many of you take the leap into using Anaconda at scale in 2016.

We will be showing off aspects of the new technology at Strata in San Jose in the Continuum booth #1336 (look for Anaconda logo and mark).  We have already announced at a high-level some of the capabilities:   Peter and I will both be at Strata along with several of the talented people at Continuum.    If you are attending drop by and say hello.

We first came to Strata on behalf of Continuum in 2012 in Santa Clara.  We announced that we were going to bring you scaled-out NumPy.  We are now beginning to deliver on this promise with Dask.   We brought you scaled-up NumPy with Numba.   Blaze and Bokeh will continue to bring them together along with the rest of the larger data community to provide real insight on data --- where-ever it is stored.   Try out Dask and join the new scaled-out PyData story which is richer than ever before, has a larger community than ever before, and has a brighter future than ever before.