Technical Discovery: 2012

Sunday, December 16, 2012

Passing the torch of NumPy and moving on to Blaze

I wrote this letter tonight to the NumPy mailing list --- a list I have been actively participating in for nearly 15 years.

Hello all,

There is a lot happening in my life right now and I am spread quite thin among the various projects that I take an interest in. In particular, I am thrilled to publicly announce on this list that Continuum Analytics has received DARPA funding (to the tune of at least $3 million) for Blaze, Numba, and Bokeh which we are writing to take NumPy, SciPy, and visualization into the domain of very large data sets. This is part of the XDATA program, and I will be taking an active role in it. You can read more about Blaze here: http://blaze.pydata.org. You can read more about XDATA here: http://www.darpa.mil/Our_Work/I2O/Programs/XDATA.aspx

I personally think Blaze is the future of array-oriented computing in Python. I will be putting efforts and resources next year behind making that case. How it interacts with future incarnations of NumPy, Pandas, or other projects is an interesting and open question. I have no doubt the future will be a rich ecosystem of interoperating array-oriented data-structures. I invite anyone interested in Blaze to participate in the discussions and development at https://groups.google.com/a/continuum.io/forum/#!forum/blaze-dev or watch the project on our public GitHub repo: https://github.com/ContinuumIO/blaze. Blaze is being incubated under the ContinuumIO GitHub project for now, but eventually I hope it will receive its own GitHub project page later next year. Development of Blaze is early but we are moving rapidly with it (and have deliverable deadlines --- thus while we will welcome input and pull requests we won't have a ton of time to respond to simple queries until at least May or June). There is more that we are working on behind the scenes with respect to Blaze that will be coming out next year as well but isn't quite ready to show yet.

As I look at the coming months and years, my time for direct involvement in NumPy development is therefore only going to get smaller. As a result it is not appropriate that I remain as "head steward" of the NumPy project (a term I prefer to BDF12 or anything else). I'm sure that it is apparent that while I've tried to help personally where I can this year on the NumPy project, my role has been more one of coordination, seeking funding, and providing expert advice on certain sections of code. I fundamentally agree with Fernando Perez that the responsibility of care-taking open source projects is one of stewardship --- something akin to public service. I have tried to emulate that belief this year --- even while not always succeeding.

It is time for me to make official what is already becoming apparent to observers of this community, namely, that I am stepping down as someone who might be considered "head steward" for the NumPy project and officially leaving the development of the project in the hands of others in the community. I don't think the project actually needs a new "head steward" --- especially from a development perspective. Instead I see a lot of strong developers offering key opinions for the project as well as a great set of new developers offering pull requests.

My strong suggestion is that development discussions of the project continue on this list with consensus among the active participants being the goal for development. I don't think 100% consensus is a rigid requirement --- but certainly a super-majority should be the goal, and serious changes should not be made with out a clear consensus. I would pay special attention to under-represented people (users with intense usage of NumPy but small voices on this list). There are many of them. If you push me for specifics then at this point in NumPy's history, I would say that if Chuck, Nathaniel, and Ralf agree on a course of action, it will likely be a good thing for the project. I suspect that even if only 2 of the 3 agree at one time it might still be a good thing (but I would expect more detail and discussion). There are others whose opinion should be sought as well: Ondrej Certik, Perry Greenfield, Stefan van der Walt, David Warde-Farley, Pauli Virtanen, Robert Kern, David Cournapeau, Francesc Alted, and Mark Wiebe to name a few (there are many other people as well whose opinions can only help NumPy). For some questions, I might even seek input from people like Konrad Hinsen and Paul Dubois --- if they have time to give it. I will still be willing to offer my view from time to time, and if I am asked.

Greg Wilson (of Software Carpentry fame) asked me recently what letter I would have written to myself 5 years ago. What would I tell myself to do given the knowledge I have now? I've thought about that for a bit, and I have some answers. I don't know if these will help anyone, but I offer them as hopefully instructive:

1) Do not promise to not break the ABI of NumPy --- and in fact emphasize that it will be broken at least once in the 1.X series. NumPy was designed to add new data-types --- but not without breaking the ABI. NumPy has needed more data-types and still needs even more. While it's not beautifully simple to add new data-types, it can be done. But, it is impossible to add them without breaking the ABI in some fashion. The desire to add new data-types *and* keep ABI compatibility has led to significant pain. I think the ABI non-breakage goal has been amplified by the poor state of package management in Python. The fact that it's painful for someone to update their downstream packages when an upstream ABI breaks (on Windows and Mac in particular) has put a lot of unfortunate pressure on this community. Pressure that was not envisioned or understood when I was writing NumPy.

(As an aside: This is one reason Continuum has invested resources in building the conda tool and a completely free set of binary packages called Anaconda CE which is becoming more and more usable thanks to the efforts of Bryan Van de Ven and Ilan Schnell and our testing team at Continuum. The conda tool: http://docs.continuum.io/conda/index.html is open source and BSD licensed and the next release will provide the ability to build packages, build indexes on package repositories and interface with pip. Expect a blog-post in the near future about how cool conda is!).

2) Don't create array-scalars. Instead, make the data-type object a meta-type object whose instances are the items returned from NumPy arrays. There is no need for a separate array-scalar object and in fact it's confusing to the type-system. I understand that now. I did not understand that 5 years ago.

3) Special-case small arrays to avoid the memory indirection and look at PDL so that generalized ufuncs are supported from the beginning.

4) Define missing-value data-types and labels on the dimensions and arrays

5) Define a standard "dictionary of NumPy arrays" interface as the basic "structure of arrays" concept to go with the "array of structures" that structured arrays provide.

6) Start work on SQL interface to NumPy arrays *now*

Additional comments I would make to someone today:

1) Most of NumPy should be written in Python with Numba used as the compiler (particularly as soon as Numba gets the ability to create Python extension modules which is in the next release).

2) There are still many, many optimizations that can be made in NumPy run-time (especially in the face of modern hardware).

I will continue to be available to answer questions and I may chime in here and there on pull requests. However, most of my time for NumPy will be on administrative aspects of the project where I will continue to take an active interest. To help make sure that this happens in a transparent way, I would like to propose that "administrative" support of the project be left to the NumFOCUS board of which I am currently 1 of 9 members. The other board members are currently: Ralf Gommers, Anthony Scopatz, Andy Terrel, Prabhu Ramachandran, Fernando Perez, Emmanuelle Gouillart, Jarrod Millman, and Perry Greenfield. While NumFOCUS basically seeks to promote and fund the entire scientific Python stack, I think it can also play a role in helping to administer some of the core projects which the board members themselves have a personal interest in.

By administrative support, I mean decisions like "what should be done with any NumPy IP or web-domains" or "what kind of commercially-related ads or otherwise should go on the NumPy home page", or "what should be done with the NumPy github account", etc. --- basically anything that requires an executive decision that is not directly development related. I don't expect there to be many of these decisions. But, when they show up, I would like them to be made in as transparent and public of a way as possible. In practice, the way I see this working is that there are members of the NumPy community who are (like me) particularly interested in admin-related questions and serve on a NumPy team in the NumFOCUS organization. I just know I'll be attending NumFOCUS board meetings, and I would like to help move administrative decisions forward with NumPy as part of the time I spend thinking about NumFOCUS.

If people on this list would like to play an active role in those admin discussions, then I would heartily welcome them into NumFOCUS membership where they would work with interested members of the NumFOCUS board (like me and Ralf) to direct that organization. I would really love to have someone from this list volunteer to serve on the NumPy team as part of the NumFOCUS project. I am certainly going to be interested in the opinions of people who are active participants on this list and on GitHub pages for NumPy on anything admin related to NumPy, and I expect Ralf would also be very interested in those views.

One admin discussion that I will bring up in another email (as this one is already too long) is about making 2 or 3 lists for NumPy such as numpy-admin@numpy.org, numpy-dev@numpy.org, and numpy-users@numpy-org.

Just because I'll be spending more time on Blaze, Numba, Bokeh, and the PyData ecosystem does not mean that I won't be around for NumPy. I will continue to promote NumPy. My involvement with Continuum connects me to NumPy as Continuum continues to offer commercial support contracts for NumPy (and SciPy and other open source projects). Continuum will also continue to maintain its Github NumPy project which will contain pull requests from our company that we are working to get into the mainline branch. Continuum will also continue to provide resources for release-management of NumPy (we have been funding Ondrej in this role for the past 6 months --- though I would like to see this happen through NumFOCUS in the future even if Continuum provides much of the money). We also offer optimized versions of NumPy in our commercial Anaconda distribution (Anaconda CE is free and open source).

Also, I will still be available for questions and help (I'm not disappearing --- just making it clear that I'm stepping back into an occasional NumPy developer role). It has been extremely gratifying to see the number of pull-requests, GitHub-conversations, and code contributions increase this year. Even though the 1.7 release has taken a long time to stabilize, there have been a lot of people participating in the discussion and in helping to track down the problems, figure out what to do, and fix them. It even makes it possible for people to think about 1.7 as a long-term release.

I will continue to hope that the spirit of openness, tolerance, respect, and gratitude continue to permeate this mailing list, and that we continue to seek to resolve any differences with trust and mutual respect. I know I have offended people in the past with quick remarks and actions made sometimes in haste without fully realizing how they might be taken. But, I also know that like many of you I have always done the very best I could for moving Python for scientific computing forward in the best way I know how.

Thank you for the great memories. If you will forgive a little sentiment: My daughter who is in college now was 3 years old when I began working with this community and went down a road that would lead to my involvement with SciPy and NumPy. I have marked the building of my family and the passage of time with where the Python for Scientific Computing Community was at. Like many of you, I have given a great deal of attention and time to building this community. That sacrifice and time has led me to love what we have created. I know that I leave this segment of the community with the tools in better hands than mine. I am hopeful that NumPy will continue to be a useful array library for the Python community for many years to come even as we all continue to build new tools for the future.

Very best regards,

-Travis

Wednesday, October 10, 2012

Continuum and Open Source

As an avid open source contributor for nearly 15 years --- and a father with children to provide for --- I've observed intently the discussions about how to monetize open source. As a young PhD student, I even spent hours avoiding my dissertation by reading about philosophy and economics to try and make sense of how an open-source economy might work.

I love creating and contributing to open source code --- particularly code that has the potential to influence and touch for the better millions of lives. I really enjoy spending as much time as I can on that activity. On the other hand, the wider economy wants money from me for things like college expenses, housing, utilities, and the "camp champions" that I get to attend this week with my 11 year old son. So, I have thought and read a lot about how to make money from open source.

There are a lot of indirect ways to make money from open source which all amount to giving away the code and then making money doing "something else": training, support, consulting, documentation, etc. These are all ways you can sell the expertise that results from open source. Ultimately, however, under all these models open source is a marketing expense and you end up needing to focus your real attention on the the thing you end up getting paid for -- the service itself. As a result, the open source code you care about tends to receive less attention than you had originally hoped and you can only spend your "free time" on it. I've seen this play out over several years in multiple ways.

I still believe that a model that is patterned after the original copyright/patent compromise of "limited-time" protection is actually a good one --- especially for certain kinds of software. Under this model, there are two code-bases: an open source one and a proprietary one. People pay for the software they want and use (and therefore developers get paid to write it) while premium features migrate from the paid-for branch to the free-and-open-source code base as the developers get paid.

While this model would not work for every project, it does have some nice features:

it allows developers to work full-time on code that benefit users (as evidenced by those users' willingness to pay for the software)
developers have a livelihood directly writing code that "will become" open source as people pay for it
users only pay for software that they are getting "premium benefits" from and those premium benefits are lifting the state of open-source software over time

It is a wonderful thing for developers to have a user-base of satisfied customers. For all the benefits of open-source, I've also seen first hand the difficulty of supporting a large user-base with no customers who are directly paying for continued support of the code-base which eventually leads to less satisfied customers.

I am thrilled to be part of a forward-thinking company like Continuum Analytics that is committed enough to open source software to both sponsor directly open source projects (like NumPy and Numba) as well as seek to move features from its premium products into open-source. You can read more about Continuum's Open Source philosophy here: Continuum and Open Source.

For example, we recently moved a feature from our premium product, NumbaPro, into the open-source project Numba which allows you to compile a python file directly to a shared library. You can read about that feature here: Compiling Python code to Shared Library.

We will continue to develop Numba in the open --- in conjunction with others who wish to participate in the development of that project. Our ability to spend time on this, of course, will be directly impacted by how many licenses of NumbaPro we can sell (along with our other products and services). So, if computing on GPUs, creating NumPy ufuncs and generalized ufuncs easily, or taking advantage of multiple-cores in your Python computations is something that would benefit you, take a look at NumbaPro and see if it makes sense for you to purchase it. Hopefully, in addition to great software you appreciate, you will also recognize that you are contributing directly to the development of Numba.

Sunday, September 2, 2012

John Hunter 1968-2012

It was a shock to hear the news from Fernando that John Hunter needed chemo therapy to respond to the cancer that had attacked him. Literally days previous to the news we had just been talking at the SciPy conference about how to take NumFOCUS to the next level. Together with the other members of NumFocus we have ambitious plans for the Foundation: scholarships and post-doc funds for students and early professionals contributing to open-source, conference sponsorship, packaging and continuous integration sponsorships, etc. We had been meeting via phone in board meetings every other week and he was planning to send a message to the matplotlib mailing list encouraging people to donate to our efforts with NumFOCUS. Working with John in person on a mutual project was gratifying. His intelligence, enthusiasm, humility, and pragmatism were a perfect complement to our board discussions.

He had also just spoken at SciPy 2012 and gave a great talk discussing his observations and lessons learned from Matplotlib. If you haven't seen the talk, stop reading this and go watch it here --- you will see a great and humble man describe a labor of love (and not give himself enough credit for what he accomplished).

When I heard the news, I wrote a quick note to John expressing my support and appreciation for all he had done for Python --- not only because I truly feel that matplotlib is a major reason that projects I have invested so heavily in (NumPy and SciPy) have become so popular, but also because I knew that I had not shared enough with him how much I think of him. A sinking feeling in my heart was telling me that I may not have much time.

This is what I sent him:

Hey John,

I am so sorry to hear the news of your diagnosis. I will be praying for you and your family. I understand if you cannot respond. Please let me know if there is anything I can do to help.

I have so much respect for you and what you have done to make Python viable as a language for technical computing. I also just think you are an amazing human being with so much to give.

All the best for a speedy recovery.

-Travis

This is the response I received.

Thanks so much Travis. We're moving full speed ahead with a treatment plan -- chemo may start Tues. As unpleasant as it can be, I'm looking forward to the start of the fight against this bastard.

Thanks so much for your other kind words. You've always been a hero to me and they mean a lot. I have great respect for what you are doing for numpy and NUMFOCUS, and even though I am stepping back from work and MPL and everything non-essential right now, I want to continue supporting NF while I'm able.

All the best,
JDH

I had no idea how much I would come to appreciate this small but meaningful exchange -- my last communication with John. Only a few weeks later, Fernando Perez (author of IPython and a great friend to John) sent word that our mutual friend had an unexpected but terrible reaction to his initial treatment, and it had placed him in critical condition and the prognosis was not good.

I ached when literally hours later, John died. I thought of his 3 daughters (each only about 3 years younger than my own 3 daughters) and how they would miss their father. I thought of the time he did not spend with them because he was writing matplotlib. I know exactly what that means because of the time I have sacrificed with my own little girls (and boys) bringing SciPy to life, merging Numarray and Numeric into NumPy, resurrecting llvmpy, and bringing Numba to life. I thought of the future time I would not get to spend with him building NumFOCUS into a foundation worthy of the software it promotes. I have not lost many of my loved ones to death yet. Perhaps this is why I have been so affected by his death. Not since my mother died 2 years ago (August 31, 2010), has the passing of another driven me so.

When I thought of John's girls, I thought immediately of what could we do to show love and appreciation. What would I want for my own children if I were no longer here to care for them? My oldest daughter had just started college and was experiencing that first transformative week. Perhaps this was why I thought that more than anything if I were not around I would want my girls to have enough money for their education. After speaking with Fernando and with approval from John's wife, Miriam, we setup the John Hunter Memorial Fund. Anthony Scopatz, Leah Holdridge, and I have spent several hours since then making sure the site stays operational (mainly overcoming some unexpected difficulties caused by Google on Friday).

My personal goal is to raise at least $100,000 for John's girls. This will not cover their entire education, but it is will be a good start and will be a symbolic expression of appreciation for all those who work tirelessly on open source software for the benefit of many. After a few days we are at about $20,000 total (from about 450 donors). This is a great start and will be greatly appreciated by John's family --- but I know that all those who benefit from the free use of a high-quality plotting library can do better than that. If you have already given, thank you! If you haven't given something yet, please consider what John has done for you personally, and give your most generous donation.

There are fees associated with using online payment networks. We will find a way to get those fees waived or covered by specific corporate donations, so don't let concern of the fees stop you from helping. We've worked hard to make sure you have as many options to pay as possible. You can use PayPal or WePay (which both have fees of 2.9% + $0.30), you can use an inexpensive payment network like Dwolla (only $0.25 for sending more than $10 and free for sending less --- but you have to have a Dwolla account and put money into it), or you can do as David Beazley suggested and just send a check to one of the addresses listed on the memorial page.

Whatever you decide to do, just remember that it is time to give back!

John has always been supportive of my work in open source. It was his voice that was one of the few positive voices that kept me going in the early days of NumPy when other voices were more discouraging. He has also consistently been a calming and supportive voice on the mailing lists when others have been less considerate and sometimes even hostile. I'm very sorry he will not be able to see even more results of his tireless efforts. I'm very sorry we won't get to feel more of his influence in the world. The world has lost one who truly recognized that great things require cooperation of many people. Obtaining that cooperation takes sacrifice, trust, humility, a willingness to listen, a willingness to speak out with respect, and a willingness to forgive. He exemplified those characteristics. I am truly saddened that I will not be able to learn more from him.

When SciPy was emerging from my collection of modules in 2001, one of the things Eric Jones and I wanted was an integrated plotting package. We spent time on a couple of plotting tools in early SciPy (a simple WX plotting widget, xplot based on Yorick's gist). These early steps were not going to get us what users needed. Fortunately, John Hunter came along around 2001 and started a new project called Matplotlib which steadily grew in popularity until it literally exploded in about 2004 with funding from the Perry Greenfield and the Space Science Telescope Institute and the efforts of the current principal developer of Matplotlib: Michael Droettboom.

I learned from John's project many important things about open source development. A few of them:

Examples, documentation, and ease of use matter -- a lot
Large efforts like Python for Science need a lot of people and a distributed, independent development environment (not everything belongs in a single namespace).

SciPy needed to be a modular "library" not a replacement for Matlab all by itself.
The community needed a unifying installation to make it easy for the end-user to get everything, but we did not need a single namespace.
Open source projects can only cover as much space as a team of about 5-7 active developers can understand. Then, they need to be organized into a larger integration and distribution projects --- a hierarchical federation of projects.
The only way large projects can survive is by separating concerns, having well defined interfaces, and groups that work on individual pieces they have expertise in.

Backwards compatibility matters a great deal to an open source project (he created numerix for Matplotlib to facilitate for end-users the migration of Numeric through Numarray to NumPy in Matplotlib)

I'm sure if John were here, he could improve my rough outline and make it much better. From improving plotting libraries to making useful use of record arrays, he was always doing that. In fact, one of John's last contributions to the world is in improving the mission statement of NumFOCUS. In a recent board meeting, he suggested the word "accessible" to the mission statement: The purpose of NumFOCUS is to promote the use of accessible and reproducible computing in science and technology.

His life's work has indeed been to make science and technology computing more accessible through making Python the de facto standard for doing science with his excellent plotting tool. Let's continue to improve the legacy he has left us by working together to make computing even more accessible. We have a long way to go, but by standing on the shoulders of giants like John we can see just that much farther and continue the journey.

Besides helping his daughters there is nothing more fitting that we can do to honor John's memory than continuing to promote the other work he spent so many hours of his life pushing by contributing to open source projects and/or supporting financially the foundation he wanted to see successful.

Great people lift us both in life and death. In life they are gracious contributors to our well being and encourage us to grow. In death they cause us to reflect on the precious qualities they reflected. They make us want to improve. When we think of them, we want to hold our children close, give an encouraging word to a colleague, feel gratitude for our friends and family, and forgive someone who has hurt us. John Hunter (1968 - 2012) was truly a great man!

Wednesday, August 15, 2012

Numba and LLVMPy

It's been a busy year so far. All the time spent on starting a new company, starting new open source projects, and keeping up with the open source projects that I have interest in, has meant that I haven't written nearly as many blog-posts as I planned on. But, this is probably a good thing at least if you follow the wisdom attributed to Solomon --- which has been paraphrased in this quote attributed to Abraham Lincoln.

One of the things that has been on my mind for the past year is promoting array-oriented computing as a fundamental concept more developers need exposure to. This is one reason that I am so excited that I've been able to find great people to work on Numba (which intends to be an array-oriented compiler for Python code). I have given a few talks trying to convey what is meant by array-oriented computing, but the essence is captured by the difference between the life.py example in the Python code-base and a NumPy version of the same code.

I have seen many, many real world examples of very complicated code that could be simplified and sped up (especially on modern hardware) by just thinking about the problem differently using array-oriented concepts.

One of the goals for Numba is to make it possible to write more vectorized code easily in Python without relying just on the pre-compiled loops that NumPy provides. In order to write Numba, though, we first needed to resurrect the llvm-py project which provides easy access to the LLVM C++ libraries from Python. This project is interesting in its own right and in addition to forming a base tool chain for Numba, allows you to do very interesting things (like instrument C-code compiled with Clang to bitcode), build a compiler, or import bitcode directly into Python (a la bitey).

While the documentation for llvm-py left me frustrated early on, I have to admit that llvm-py re-kindled some of the joy I experienced when being first exposed to Python. Over the past several weeks we have worked to create the llvmpy project from llvm-py. We now have a domain http://www.llvmpy.org, a GitHub repository, a website served from GitHub, and sphinx-based documents that can be edited via a pull request. The documentation still needs a lot of improvement (even to get it to the state that the old llvm-py project was in), and contributions are welcome.

I'm grateful to Fernando Perez, author of IPython, for explaining the 4-repository approach to managing an open source web-site and documentation via github. We are using the same pattern that IPython uses for both numba and llvmpy. It took a bit of work to get set-up but it's a nice approach that should make it easier for the community to maintain the documentation and web-site of both of these projects. The idea is simple. Use a project page (repo llvmpy.github.com) to be the web-site but generate this repo from another repo (llvmpy-webpage) which contains the actual sources. I borrowed the scripts from the IPython project to build the page from the sources, check-out the llvmpy.github.com repo, copy the built pages to the repo, and then push the updates back to github which actually updates the site. The same process (slightly modified) is used for the documentation except the sources for the docs live in the llvmpy repo under the docs directory and the built pages are pushed to the gh-pages branch of the llvmpy-doc repo. If you are editing sources you only modify llvmpy/docs and llvmpy-webpage files. The other repos are generated and pushed via scripts.

We are using the same general scheme to host the numba pages (although there I couldn't get the numba.org domain name and so I am using http://numba.pydata.org). With llvmpy on a relatively solid footing, attention could be shifted to getting a Numba release out. Today, we finally released Numba 0.1. It took longer than expected after the SciPy conference mainly because we were hoping that some of the changes (still currently in a devel branch) to use an AST-based code-generator could be merged into the main-line before the release.

Jon Riehl did the lion's share of the work to transform Numba from my early prototype to a functioning system in 0.1 with funding from Continuum Analytics, Inc. Thanks to him, I can proudly say that Numba is ready to be tried and used. It is still early software --- but it is ready for wider testing. One of the problems you will have with Numba right now is error reporting. If you make a mistake in the Python code that you are decorating, the error you get will not be informative -- so test the Python code before decorating it with Numba. But, if you get things right, Numba can speed up your Python code by 200 times or more. It's is really pretty fun to be able to write image-processing routines in Python. PyPy can do this too, of course, but with Numba you have full integration with the CPython stack and you don't have to wait for someone to port the library you also want to use to PyPy.

Numba's road-map is being defined right now by the people involved in the project. On the horizon is support for NumPy index expressions (slices, etc.), merging of the devel branch which uses the AST and Mark Florisson's minivect compiler, improving support for error checking, emitting calls to the Python C-API for code that cannot be type-specialized, and improving complex-number support. Your suggestions are welcome.

Monday, July 30, 2012

More PyPy discussions

I'm very glad that my co-founder of Continuum Analytics, Peter Wang, has published his recent follow-up blog-post that hopefully clarifies his perspective on the on-going dialogue about CPython and PyPy.

Peter is a fundamentally good-natured person, and he is a lot of fun to be around --- even when he is disagreeing with you. I'm very fortunate to be working with him on a daily basis. He can be opinionated, but his ability to connect deeply to a wide-variety of subjects means that you come away from a dialogue with him having learned something (even if you still remain unconvinced by his views).

Peter is also one of the smartest people I've ever met. One of my great memories in life is sitting at dinner with Peter and Eric Weinstein while those two great minds treated me, Wes McKinney, and Adam Klein to the most impressive display of metaphor ping-pong I've ever seen covering a wide-variety of topics from social justice to string theory. I could keep up with the dialogue, but not enough to really participate meaningfully --- and the other two Ivy-league-educated dinner partners were in the same boat.

I fundamentally agree with Peter's perspective that CPython-the-runtime is and will remain the centerpiece of the Python conversation. In fact, I would say that even more focus needs to be on CPython-the-runtime. It is great to see improvements in Python 3.3 like the completion of the memory-view implementation and the fixing of the internal string (Unicode) representation, but there are many other improvements that could be made.

It is a wonderful and inspiring thing to see great developers think out of the box with novel projects like Jython, IronPython, and PyPy. Nonetheless from my perspective we still have a long way to go to really connect the average developer with ideas of array-oriented computing that could really help the continuing onslaught of parallel-devices-in-search-of-software. As a result, it feels like those wanting Java, .NET, and machine-code integration would be better served by more attention on JPype, Python.NET, LLVMPy, and even CorePy. Such efforts would also be better for the entire user-base of Python --- especially a majority of industry uses of Python.

But regardless of my perspective, I'm encouraged by the PyPy developer enthusiasm, and I do want to encourage dialogue regardless of my views. As a result, I am very happy to report that both NumFOCUS and Continuum Analytics recently joined forces to sponsor Maciej Fijalkowski on a small project to create an embedded version of PyPy --- a "PyPy-in-a-Box." This is an integration of PyPy to the CPython run-time (so that you can speed-up a particular CPython function by calling out to a library-version of PyPy). This is proof-of-concept code so it is not appropriate for production --- but it is a good example of what is possible when we all work together to promote the Python ecosystem.

The online project is here: https://bitbucket.org/fijal/hack2/src/default/pypyembed and you can get a binary version that works on 64-bit Linux here: http://baroquesoftware.com/~fijal/pypy-1.9-in-a-box-linux64.tar.bz2.

This approach needs more development to be a viable tool in the CPython ecosystem, but one of my suggestions to the PyPy community is that they focus on "shedding-tools" like this one for the CPython world --- so that everyone can benefit from their innovations. With an integration effort like embeded PyPy, one can also make better comparisons with tools like Numba --- another dynamic-compilation run-time that uses LLVM and LLVM-py. Numba has made a lot of progress in the last few months. In fact, I recently gave a talk on the project at the well-attended SciPy2012 conference in Austin. You can view my slides that outline and motivate the project online. An actual release of the project is imminent, but you can already use Numba to very easily write signficant Python code using NumPy arrays that executes at "C-speeds." But, that is worth another blog-post of its own....

Saturday, January 7, 2012

Transition to Continuum

Our lives are punctuated by transformational events: the birth of a child, finishing school, the passing of a loved one, meeting someone special. Even without the regular beating of celestial rhythms to provide opportunities for renewal we would have these moments to measure our lives by. Once in a while, the rhythmic and asynchronous coincide providing a particularly poignant opportunity for change. Jan 1, 2012 was just such a time for me as I left my position as President of Enthought to start a new venture with Peter Wang (author of Chaco) and others. Our new company is Continuum Analytics, Inc. (or just Continuum). Our nascent website initially targeted only to the Python initiate is http://www.continuum.io.

While I am ecstatic about the new venture, I will definitely miss the team that we've built around the world that has delivered Enthought's second consecutive record year. This team of exceptional individuals has been very successful at improving and expanding the Python story in a few targeted companies inside of the Fortune 50 as well as making it easy to install Python for the masses. Those who have taken the time to first install and then learn Traits, TraitsUI, Chaco, MayaVi, and the rest of the Enthought Tool Suite have had their efforts rewarded with increased productivity in the creation of rich client UIs and improved pluggable, scriptable, and component-based architecture. It has been a highly educational experience to participate with Enthought. There is much you learn about business, people, and the world when a software consulting company grows from 1 office with fewer than 17 people to 4 offices around the world and nearly 50 people. I will always be grateful to the Enthought founders, employees (past and present), and customers (past and present) for the relationships, the trust, and the thoughtful times we shared in learning, growing, and serving each other.

My heart, however, has always been and continues to be with NumPy and SciPy which need more support than Enthought can currently provide --- so I must move on. It took a lot of trust from my wife when (with 3 small children at home) she patiently waited for me while I spent all of 1999 writing Multipack (which in 2001 formed the bulk of SciPy). It also took trust when in 2005 (with now 5 children at home) she watched me sacrifice my tenure-track position by writing NumPy instead of more papers. In 2012 (with now 6 children at home), I'm asking her to trust me one more time while I leave a comfortable salary with a good company to put more effort full time in helping take NumPy and SciPy to the next level.

Over the past 4 1/2 years consulting with large companies I have learned a great deal about what NumPy (and SciPy) can and should be. These and related tools in the Python ecosystem need to become significant pieces to real solutions to the data analytics challenges that face us. R, Hadoop, and other (proprietary) solutions are already staking their claim on the space that Python should be dominating. Python has significant traction in science and analysis but too little publicity in the nascent nomenclature of data analytics. In order to accelerate the processing capabilities of Python and related tools, much progress needs to be made. My New Year's resolution this year is to begin to contribute more substantially to that progress by organizing a new company that will hopefully allow many people to spend significant time directly on NumPy and SciPy during working hours. I also hope to assist any public, non-profit efforts towards that mutual goal as well. I also hope to be able to spend more time myself on NumPy and SciPy.

To realize my hopes long term, the company must succeed. For the company to succeed it must find customers --- people willing to buy something that it sells. People are appropriately particular about what they buy. Making products that delight will require a lot of work from Continuum, but I am excited to help organize and work alongside the best team we can put together to do it. This may also mean different business models and licensing around some of the NumPy-related code that the company writes. I recognize this may cause some raised eye-brows. I deeply value making code freely available. I'm a Jeffersonian at heart and believe that ideas (including code) should be shared freely. Six years ago I experimented by selling my "Guide to NumPy" long enough to make sufficient money to justify the effort. The book ended up in the public domain and contributed substantially to the current NumPy documentation. This is an illustration of how resources can be allocated to full-time attention and then later made available for all to enjoy. Of course there are other models that also work to accomplish similar ends and we will be actively exploring a few of them.

Despite my ideals, my wife thanks me that I'm a pragmatist with children to provide for. In addition, I have watched wearily as it's been difficult to find volunteer labor (including my own) to turn NumPy into the data-management and data-analytics substrate that it should already be. All of this happens while huge sums of money are wasted at companies large and small inefficiently transforming raw, but inaccessible data into something closer to information that can be used for decisions by the domain experts. The information available is not what it can be. The amount of effort it takes to transform the data to actionable information is not where it can be. The wide-spread understanding about how to program parallel and distributed machines is not where it can be. We can and must do better in figuring out how to get full-time attention on NumPy and related tools while still making them widely available.

At Continuum, we have a vision for significantly changing how people manipulate, transform, and uncover their data. We also have customer-driven plans to achieve it, and we are going to put our full energy into it. So far, the development team consists of Peter Wang, me, Mark Wiebe, Francesc Alted (PyTables), and Bryan Van de Ven. We will also be getting part-time but important development help from Hugo Shi and Andy Terrel. In addition, we are building an initial support/business staff to help us build and grow the business. We plan to continue to collaborate with others in the community both commercial (e.g. Wes McKinney in his new startup: Lambda Foundry) and open (e.g. Fernando Perez, Brian Granger, Min Ragan-Kelley of IPython fame). If you are interested in either joining us or collaborating with us, please send us an email at info@continuum.io. Also, please follow us on Twitter @ContinuumIO or Like us on Facebook.

We are actively looking for customer partners, as well. If you are interested in learning more about where we are heading and how that might help you, please drop us a line, or come see us at PyCon this year. We will also be at Strata, and afterwords we will be hosting a Python Data Workshop ("PyData") at the Googleplex. Please sign up for the PyData workshop wait-list at http://pydataworkshop.eventbrite.com/ (we could only find room for 50 people at the Googleplex). However, given that the event is free of charge, I'm expecting some people who have reserved their spot may not actually be able to attend. So, signing up on the wait-list is still worthwhile.

This year will be an exciting one for us. When I get a spare moment, I still hope to finish a few of the blogs that I've started and possibly include some more that describe more of what I've learned over the past several years as a scientist/engineer-turned-software developer, lessons about running a software company, more of where we are headed at Continuum, reflections on open source, and other more technical ramblings.