tag:blogger.com,1999:blog-687302393580846722024-03-18T21:39:24.594-07:00Technical DiscoveryTravis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.comBlogger18125tag:blogger.com,1999:blog-68730239358084672.post-88822059919865509162018-03-05T22:37:00.004-08:002018-05-10T21:52:17.807-07:00Reflections on Anaconda as I start a new chapter with Quansight<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div style="text-align: left;">
<span style="font-size: large;">Leaving the company you founded is always a tough decision and a tough process that involves many people. It requires a series of potentially emotional "crucial-conversations." It is actually not that uncommon in venture-backed companies for one or more of the original founders to leave at some point. There is a decent article on the topic here: <a href="https://hbswk.hbs.edu/item/the-founding-ceos-dilemma-stay-or-go">https://hbswk.hbs.edu/item/the-founding-ceos-dilemma-stay-or-go</a>.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhppCNGjU3fRhNoHLRZG5ayngqnef6SGb2BOT0lqJ3PAtbw5ioI7XxkViGwGgpx1a7BgtMXUmAX5M0ce4tQQqYjO6qw3xKnv2k_x0_pJjzdfjHoWqnbal59ayAHKE0iqb3GoFQz8SCV20/s1600/image.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="286" data-original-width="500" height="183" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhppCNGjU3fRhNoHLRZG5ayngqnef6SGb2BOT0lqJ3PAtbw5ioI7XxkViGwGgpx1a7BgtMXUmAX5M0ce4tQQqYjO6qw3xKnv2k_x0_pJjzdfjHoWqnbal59ayAHKE0iqb3GoFQz8SCV20/s320/image.jpeg" width="320" /></a></div>
</div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Still it is extremely difficult to let go. You live and breathe the company you start. Years of working to connect as many people as possible to the dream gives you a feeling of "ownership" and connection that no stock certificate can replace. Starting a company is a lot of work. It takes a lot of effort. There are many decisions to make and many voices to incorporate. Hiring, firing, raising money, engaging customers, engaging employees, planning projects, organizing events, and aligning a pastiche of personalities while staying relevant in a rapidly evolving technology jungle is difficult.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">As a founder over 40 with modest means, I had a family of 6 children who relied on me. That family had teenage children who needed my attention and pre-school and elementary-school children that I could not simply leave only in the hands of my wife. I look back and sometimes wonder how we pulled it off. The truth probably lies in the time we borrowed: time from exercise, time from sleep, time from vacations, and time from family. I'd like to say that this dissonance against "work-life-harmony" was always a bad choice, but honestly, I don't see how I could have made too may different choices and still have created Anaconda.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://www.drclareallen.com/work-life-harmony-shop/"><span style="font-size: large;"><img alt="Work life harmony" border="0" data-original-height="276" data-original-width="800" height="137" src="https://www.drclareallen.com/wp-content/uploads/2018/01/WLHlogo.jpg" width="400" /></span></a></div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Several things drove me. I could not let the people associated with the company down. I would not lose the money for those that invested in us. I could not let down the people who worked their tail off to build manage, document, market, and sell the technology and products that we produced. Furthermore, I would not let the community of customers and users down that had enabled us to continue to thrive.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">The only way you succeed as a founder is through your customers being served by the efforts of those who surround you. It is only the efforts of the talented people who joined us in our journey that has allowed Anaconda to succeed so far. It is critical to stay focused on what is in the best interests of those people.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVOZd2XT98qDVkabWpJGdvNOGlovwpyztuoahcnyICQySrYPU6sfUmCAWNY5b2h0gzIU17ueWrvYDQY0PS1YP2PmT4t79GN2jICzrIHV1EEpkOsHyTOd2W4qShGjNJ0aH2bhfHDKTJLGc/s1600/Anaconda-group.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" data-original-height="1026" data-original-width="1600" height="256" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVOZd2XT98qDVkabWpJGdvNOGlovwpyztuoahcnyICQySrYPU6sfUmCAWNY5b2h0gzIU17ueWrvYDQY0PS1YP2PmT4t79GN2jICzrIHV1EEpkOsHyTOd2W4qShGjNJ0aH2bhfHDKTJLGc/s400/Anaconda-group.png" width="400" /></span></a></div>
<span style="font-size: large;"><br /></span><span style="font-size: large;">Permit me to use the name Continuum to describe the angel-funded and bootstrapped early-stage company that Peter and I founded in 2012 and Anaconda to describe the venture-backed company that Continuum became (This company we called Continuum 2.0 internally that really got started in the summer of 2015 after we raised the first tranche of $22 million from VCs.)</span><br />
<span style="font-size: large;"><br /></span><span style="font-size: large;">Back in 2012, Peter and I knew a few things: 1) we had to connect Python to the Big Data movement; 2) we needed to help the scientific programmer, or a data-scientist developer build visualization-based applications quickly in the web; and 3) we needed to scale the stack of code around the PyData community to bigger hardware and multiple machines. We had big visions of an interconnected data-web, distributed schedulers, and data-structures that traversed the internet which could be analyzed across the cloud with simple Python scripts. We talked and talked about these things and grew misty-eyed in our enthusiasm for the potential of what was possible if we just built the right technology and sold just the right product to fund it.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://i1.wp.com/www.jcount.com/wp-content/uploads/2015/07/continuum_analytics_logo.png?fit=716%2C346&ssl=1" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" data-original-height="346" data-original-width="716" height="154" src="https://i1.wp.com/www.jcount.com/wp-content/uploads/2015/07/continuum_analytics_logo.png?fit=716%2C346&ssl=1" width="320" /></span></a></div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">We knew that we wanted to build a product-company -- though we didn't know exactly what those products would be at the outset. We had some ideas, only portions of which actually worked out. I knew how to run a consulting and training company around Python and open-source. Because of this, I felt comfortable raising money from family members. While consulting companies are not "high-growth" they can make real returns for investors. I was pretty confident that I would not lose their money.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">We raised $2.25million from a few dozen investors consisting of Peter's family, my family, and a host of third-parties from our mutual networks. Peter's family was critical to this early stage because they basically "led the early round" and ensured that we could get off the ground. After they put their money in the bank, we could finish raising the rest of the seed round which took about 6 months to finish.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhN3Vs0zKuoj2MC1OCTNPryho4tJEFJtg65d69preRBlqiRCe5h0neRlSBDpUUa1ZDTBnXehBMh0eE0pfeZO2VGCpY8lQF9Ep_tgsLlCRL2YKygGgWqHv4l0pwUGIBPBIDgwlrxW0uAytk/s1600/continuum1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" data-original-height="974" data-original-width="1600" height="242" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhN3Vs0zKuoj2MC1OCTNPryho4tJEFJtg65d69preRBlqiRCe5h0neRlSBDpUUa1ZDTBnXehBMh0eE0pfeZO2VGCpY8lQF9Ep_tgsLlCRL2YKygGgWqHv4l0pwUGIBPBIDgwlrxW0uAytk/s400/continuum1.png" width="400" /></span></a></div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">It is interesting (and somewhat embarrassing and so not detailed here) to go back and look at what products we thought we would be making. Some of the technologies we ended up building (like Excel integration, Numba, Bokeh, and Dask) were reflected in those early product dreams. However, the real products and commercial success that Anaconda has had so far are only a vague resemblance to what we thought we would do.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Building a Python distribution was the last thing on our minds. I had been building Python distributions since I released SciPy in 2001. As I have often repeated, SciPy was actually the first Python distribution masquerading as a library. The single biggest effort in releasing SciPy was building the binary installers and making sure everything compiled well. With Fortran compilers still more scarce than they should be, it can still be difficult to compile and build SciPy.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://awsmp-logos.s3.amazonaws.com/0000Continuum.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" data-original-height="178" data-original-width="800" height="71" src="https://awsmp-logos.s3.amazonaws.com/0000Continuum.PNG" width="320" /></span></a></div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Fortunately, with conda, conda-forge, and Anaconda, along with the emergence of wheels, almost nobody needs to build SciPy anymore. It is so easy today to get started with a data-science project and get all the software you need to do amazing work fast. You still have to work to maintain your dependencies and keep that workflow reproducible. But, I'm so happy that Anaconda makes that relatively straightforward today.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">This was only possible because General Catalyst and BuildGroup joined us in the journey in the spring of 2015 to really grow the Anaconda story. Their investment allowed us to 1) convert to a serious product-company from a bootstrapped consulting company with a few small products and 2) continue to invest heavily in conda, conda-forge, and Anaconda.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://c93fea60bb98e121740fc38ff31162a8.s3.amazonaws.com/wp-content/uploads/2016/02/GeneralCatalyst-logo.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" data-original-height="254" data-original-width="800" height="126" src="https://c93fea60bb98e121740fc38ff31162a8.s3.amazonaws.com/wp-content/uploads/2016/02/GeneralCatalyst-logo.jpg" width="400" /></span></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://crunchbase-production-res.cloudinary.com/image/upload/c_lpad,h_256,w_256,f_auto,q_auto:eco/v1490688337/rrdm5gpdfnsgyrj3pglm.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" data-original-height="256" data-original-width="256" height="200" src="https://crunchbase-production-res.cloudinary.com/image/upload/c_lpad,h_256,w_256,f_auto,q_auto:eco/v1490688337/rrdm5gpdfnsgyrj3pglm.png" width="200" /></span></a></div>
<span style="font-size: large;">There is nothing like real-world experience as a teacher, and the challenge of converting to a serious product company was a tremendous experience that taught me a great deal. I'm grateful to all the people who brought their best to the company and taught me everyday. It was a privilege and an honor to be a part of their success. I am grateful for their patience with me as my "learning experiences" often led to real struggles for them.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">There are many lasting learnings that I look forward to applying in future endeavors. The one that deserves mention in this post, however, is that building enterprise software that helps open-source communities should be done by selling a complementary product to the open-source. The "open-core" model does not work as well. I'm a firm believer that there will always be software to sell, but infrastructure should be and will be open-source --- sustained vibrantly from the companies that depend on it. Joel Spolsky has written about complementary products before. You should read <a href="https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/">his exposition.</a></span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Early on at Anaconda, Peter and I decided to be a board-led company. This board which includes Peter and I has the final say in company leadership and made the important decision to transition Anaconda from being founder-led to being led by a more experienced CEO. After this transition and through multiple conversations over many months we all concluded that the best course of action that would maximize my energy and passion while also allowing Anaconda to focus on its next chapter would be for me to spin-out of Anaconda and start a new services and open-source company where I could pursue a broader mission.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvBaztVW4wHPjg25hP_uQWxor4rO69mNPFe31KWhA25lFNJ46PftBCx2Q2jEW1qLdETELsBCmX4FIPcVqrtPL_byV-6Y8ewz7HyZznfktwpCA05cWuC2ayfduWNznmcDHaylstFKUuSI8/s1600/side-by-side-logo.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" data-original-height="372" data-original-width="1600" height="91" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvBaztVW4wHPjg25hP_uQWxor4rO69mNPFe31KWhA25lFNJ46PftBCx2Q2jEW1qLdETELsBCmX4FIPcVqrtPL_byV-6Y8ewz7HyZznfktwpCA05cWuC2ayfduWNznmcDHaylstFKUuSI8/s400/side-by-side-logo.png" width="400" /></span></a></div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">This new company is Quansight (short for Quantitative Insight). Our place-holder homepage is at <a href="http://www.quansight.com/">http://www.quansight.com</a> and we are @quansightai on Twitter. I'm excited to tell you more about the company in future blog-posts and announcements. A few paragraphs will suffice for now.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Our overall mission is to develop people, build technology, and discover products to empower people with knowledge and data to solve the world’s most challenging problems. We are doing that currently by connecting organizations sustainably with open source communities to solve their hardest problems by enabling teams to transparently apply science to their data.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">One of the things we are doing is to help companies get started with AI and ML by applying the entire PyData stack to the fundamental data organization, data visualization, and model management problem that is required for practical success with ML and AI in business. We also help companies generally improve their data-science practice by leveraging all the power of the Python, PyData, and related ecoystems.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">We are also hard at work on the sustainability problem by continuing the tradition we started at Continuum Analytics of building successful and sustainable open-source "practices" that synchronize company needs with open-source technology development. We have some innovative business approaches to this that we will be announcing in the coming weeks and months.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">I'm excited that we have several devs working hard to help bring<a href="https://blog.jupyter.org/jupyterlab-is-ready-for-users-5a6f039b8906"> JupyterLab</a> to 1.0 this year along with a vibrant community. There are many exciting extensions to this remarkable platform that remain to be written.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://jupyterlab.readthedocs.io/en/stable/_images/interface_jupyterlab.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" data-original-height="450" data-original-width="800" height="225" src="https://jupyterlab.readthedocs.io/en/stable/_images/interface_jupyterlab.png" width="400" /></span></a></div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">We also expect to continue to contribute to the <a href="http://www.pyviz.org/">PyViz</a> activities that continue to explode in the Python ecosystem as visualization is a critical first step to understanding and using any data you care about.</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: large;"><a href="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQKyE0I3FuO2Fo280pcL-R8km5KExQQZDX1YsRTVtTF5hHnx2BD" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="209" data-original-width="209" height="200" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQKyE0I3FuO2Fo280pcL-R8km5KExQQZDX1YsRTVtTF5hHnx2BD" width="200" /></a><a href="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTGyPFuAnsW3lL2M-3H3plUx2GiiyZdK4IAzfrtQ1ejNFE7KcRczQ" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="166" data-original-width="304" height="218" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTGyPFuAnsW3lL2M-3H3plUx2GiiyZdK4IAzfrtQ1ejNFE7KcRczQ" width="400" /></a></span></div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Finally, Stefan Krah has joined us at Quansight. Stefan is an <a href="http://pyfound.blogspot.com/2012/12/stefan-krah-chosen-for-q4-community.html">award-winning</a> Python core developer who has been steadily working over the past 18 months on a small but powerful collection of projects collectively called <a href="https://github.com/plures">Plures</a>. These will be more broadly available in the next few months and published under the xnd brand. Xnd is a generic container concept in C with a Python binding that together with its siblings ndtypes and gumath allows building flexible array-computing pipelines over many kinds of data-types.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">This technology will serve to underly any array-computing framework and be a glue between machine-learning and data-science frameworks of all kinds. Our plan is to use this tool to help reduce the data and computational silos that currently exist across the open-source ecosystem.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">There is still much to work on and many more technologies to emerge. It's an exciting time to work in machine learning, data-science, and scientific computing. I'm thrilled that I continue to get the opportunity to be part of it. <a href="mailto:info@quansight.com"> Let me know</a> if you'd like to be a part of our journey.</span></div>
Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com0tag:blogger.com,1999:blog-68730239358084672.post-10586804394363913902017-02-03T21:15:00.001-08:002017-02-09T21:14:12.435-08:00NumFOCUS past and future.<div dir="ltr" style="text-align: left;" trbidi="on">
<a href="http://www.numfocus.org/">NumFOCUS</a> just finished its 5th year of operations, and I've lately been reflective on the early days and some of the struggles we went through to get the organization started. It once was just an idea in a few community-minded developer's heads and now exists as an important non-profit Foundation for Open Data Science, democratic and reproducible discovery, and a champion for technical progress through diversity.<br />
<br />
When <a href="https://www.continuum.io/people/peter-wang">Peter Wang</a> and I started <a href="https://www.continuum.io/">Continuum</a> in early 2012, I had already started the ball rolling to create NumFOCUS. I knew that we needed to create a non-profit that would create leadership and be a focus of community activity outside of any one company. I strongly believe that for open-source to thrive, full-time attention needs to be paid to it by many people. This requires money. With the tremendous interest in and explosion around the NumPy community, it was clear to me that this federation of loosely-coupled people needed some kind of organization that could be community-led and could be a rallying point for community activity and community-led financing. The potential also exists for NumFOCUS to act as community-based accountability to encourage positively re-inforcing behavior in the open-source communities it intersects with.<br />
<br />
In late 2011, I started a new mailing list and invited anyone interested in discussing the idea of an independent community-run organization to the list. Over 100 people responded and so I knew there was interest. We debated on that list what to call the new concept for several weeks and Anthony Scopatz's name "NumFOCUS" stuck as the best alternative over several other names. As an acronym, NumFOCUS could mean Numerical Foundation for Open Code and Usable Science. I created a new mailing list, and then set about creating the legal organization called NumFOCUS and filing necessary paperwork.<br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="https://pbs.twimg.com/profile_images/1864199033/fperez_photo2_sm.jpg" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="200" src="https://pbs.twimg.com/profile_images/1864199033/fperez_photo2_sm.jpg" width="200" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Fernando Perez</td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://raw.githubusercontent.com/fperez/blog/master/fig/johnhunter-head.jpg" imageanchor="1" style="clear: left; display: inline; margin-bottom: 1em; margin-left: auto; margin-right: auto; text-align: center;"><img border="0" height="200" src="https://raw.githubusercontent.com/fperez/blog/master/fig/johnhunter-head.jpg" width="148" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">John Hunter</td></tr>
</tbody></table>
In December of 2011, I coordinated with Fernando Perez, Perry Greenfield, John Hunter, and Jarrod Millman who had all expressed some interest in the idea and we incorporated in Texas (using LegalZoom) and became the first board of NumFOCUS. We had a very simple set of bylaws and purposes all centered around making Science more accessible. We decided to meet every-other week. We all knew we were creating something that would last a long time.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqE8pj53z5uOZ9F_ltMeeU0sfnK-eIHpVBlyxE9MewQN-GNbYEid2Rev2VFvoO6lEwH0PJz-7xzYdG7wwEw6QHi36tr4uJbREKpJeBhr0j64zlSridg3FlNNVukOdVWQl4QJfctpKZRss/s200/perry.jpg" style="margin-left: auto; margin-right: auto;" width="150" /></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Perry Greenfield</td></tr>
</tbody></table>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://avatars3.githubusercontent.com/u/123428?v=3&s=460" imageanchor="1" style="clear: right; display: inline !important; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="200" src="https://avatars3.githubusercontent.com/u/123428?v=3&s=460" width="200" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Jarrod Millman</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqE8pj53z5uOZ9F_ltMeeU0sfnK-eIHpVBlyxE9MewQN-GNbYEid2Rev2VFvoO6lEwH0PJz-7xzYdG7wwEw6QHi36tr4uJbREKpJeBhr0j64zlSridg3FlNNVukOdVWQl4QJfctpKZRss/s1600/perry.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a></div>
<br />
In early 2012, I wanted to ensure NumFOCUS success and knew that it needed a strong, full-time, Executive Director to make that happen. The problem was NumFOCUS didn't have a lot of money. A few of the board members had made donations, but Continuum with its own limited means was funding the majority of the costs for getting NumFOCUS started. With the legal organization started, I created bank-accounts and setup the ability for people to donate to NumFOCUS with help from Anthony Scoptatz who was the first treasurer of NumFOCUS.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://sc.edu/study/colleges_schools/engineering_and_computing/study/areas_of_study/mechanical_engineering/medepartment_coe/people/images/faculty_for_read/scopatz_read.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="200" src="https://sc.edu/study/colleges_schools/engineering_and_computing/study/areas_of_study/mechanical_engineering/medepartment_coe/people/images/faculty_for_read/scopatz_read.jpg" width="170" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Anthony Scopatz</td></tr>
</tbody></table>
I had met Leah Silen through other community interactions in Austin back in 2007. I knew her to be a very capable and committed person and thought she might be available. I asked her if she would come aboard and be employed by Continuum but work full-time for NumFOCUS and the new board. She accepted and the organization of NumFOCUS began to improve immediately.<br />
<br />
With her help, we transitioned the organization from LegalZoom's beginnings to register directly with the secretary of state in Texas and started the application process to become a 501(c)3. She also quickly became involved in organizing the <a href="http://pydata.org/">PyData</a> conferences which Continuum initially spear-headed along with <a href="https://www.linkedin.com/in/juliesteele">Julie Steele</a> and <a href="https://www.linkedin.com/in/wilder-james/">Edd Wilder-James</a> (at the time from O'Reilly). In 2012, we had our first successful PyData conference at the <a href="https://careers.google.com/locations/mountain-view/">GooglePlex</a> in Mountain View . It was clear that PyData could be used as a mechanism to provide revenue for NumFOCUS (at least to support Leah and other administrative help). <br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkGxvvgnqwBgZObjEovSTPAeJFF8MZGco4UWyJBsN_U99jhnnpIU0lUPecyf39CO-FUXBK16P6cNRHUitWx-CMm6woQSxG25TCpbwNWMjICDJiaSyhqpWA4npyZ0-oQv4WkR6oqjEElx4/s1600/_5738167.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkGxvvgnqwBgZObjEovSTPAeJFF8MZGco4UWyJBsN_U99jhnnpIU0lUPecyf39CO-FUXBK16P6cNRHUitWx-CMm6woQSxG25TCpbwNWMjICDJiaSyhqpWA4npyZ0-oQv4WkR6oqjEElx4/s1600/_5738167.jpg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Leah Silen</td></tr>
</tbody></table>
<br />
We began working under that model through 2013 and 2014 with Continuum initially spending a lot of human resources and money organizing and running PyData with any proceeds going directly to NumFOCUS. There were no proceeds in those years except enough to help pay for Leah's salary. The rest of Leah's salary and PyData expenses came from Continuum which itself was still a small startup.<br />
<br />
During these years of PyData growth in communities around the world, James Powell, became a drumbeat of consistency and community engagement. He has paid his own way to nearly every PyData event throughout the world. He has acted as emcee, volunteer extraordinaire, and popular speaker with his clever implementations and explanations of the Python stack. <br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://pbs.twimg.com/profile_images/378800000701658550/21ba45b6323debfe3d9dd60ca9b35483_400x400.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="200" src="https://pbs.twimg.com/profile_images/378800000701658550/21ba45b6323debfe3d9dd60ca9b35483_400x400.jpeg" width="200" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">James Powell<br />
@dontusethiscode</td></tr>
</tbody></table>
<br />
Andy Terrel had been a friend of NumFOCUS and a member of the community and active with the board from its beginning. In 2014, while working at Continuum, he took over my board seat. In that capacity, he worked hard to gain financial independence for NumFOCUS. He was instrumental in moving PyData fully to NumFOCUS management. I was comfortable stepping back from the board and stepping down in my involvement around organizing and backing PyData from a financial perspective because I trusted Andy's leadership and non-profit management instincts. He, James Powell, Leah, and all the other local PyData meetups and organizations world-wide have done an impressive thing in self-organizing and growing the community. We should all be grateful for their efforts.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXFrU_S-KCfoqOg_0DlntNdOU2lm-G__x4LmH7gWygNGoXivx8XxDgu3n1NqpCjmLcPcEaFrQ59wq_7nutUKCCuBpIhyx8OHqGV6qzmy_Q_yc_W0Rg3xqxwU665tQvH_YDclaEXxtrAMI/s1600/756-speaker.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="212" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXFrU_S-KCfoqOg_0DlntNdOU2lm-G__x4LmH7gWygNGoXivx8XxDgu3n1NqpCjmLcPcEaFrQ59wq_7nutUKCCuBpIhyx8OHqGV6qzmy_Q_yc_W0Rg3xqxwU665tQvH_YDclaEXxtrAMI/s320/756-speaker.jpeg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Andy Terrel</td></tr>
</tbody></table>
<br />
I am very proud of the work I did to help start NumFOCUS and PyData. I hope to remember it as one of the most useful things I've done professionally. I am very grateful for all the others who also helped to create NumFOCUS as well as PyData. So many have worked hard to ensure it can be a worldwide and community-governed organization to support Open Data Science for a long time to come. I'm proud of the funding and people-time that Continuum provided to get NumFOCUS and PyData started as well as the on-going support of NumFOCUS that Continuum and other industry partners continue to provide.<br />
<br />
Now, as an adviser to the organization, I get to hear from time to time how things are going. I'm very impressed at the progress being made by the dedication of the current leadership behind Andy Terrel as President and Leah Silen as Executive Director and the rest of the <a href="http://www.numfocus.org/board.html">current board</a>.<br />
<br />
If you use or appreciate any of the tools in the Open Data Science that NumFOCUS sponsors, I encourage you to join and/or make a supporting donation here: <a href="http://www.numfocus.org/support-numfocus.html">http://www.numfocus.org/support-numfocus.html</a>. Help NumFOCUS continue its mission to support the tools and communities you rely on everyday.</div>
Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com4tag:blogger.com,1999:blog-68730239358084672.post-43300374520422226172016-03-29T12:42:00.001-07:002016-05-25T21:20:18.338-07:00Anaconda and Hadoop --- a story of the journey and where we are now.<div dir="ltr" style="text-align: left;" trbidi="on">
<h2 style="text-align: left;">
Early Experience with Clusters</h2>
My first real experience with cluster computing came in 1999 during my graduate school days at the Mayo Clinic. These were wonderful times. My advisor was <a href="http://www.mayo.edu/research/faculty/greenleaf-james-f-ph-d/bio-00077056">Dr. James Greenleaf</a>. He was very patient with allowing me to pester a bunch of IT professionals throughout the hospital to collect their aging <a href="https://upload.wikimedia.org/wikipedia/commons/c/c9/Macintosh_Performa_6300.jpg">Mac Performa</a> machines and build my own home-grown cluster. He also let me use a bunch of space in his ultrasound lab to host the cluster for about 6 months.<br />
<br />
<h4 style="text-align: left;">
Building my own cluster</h4>
The form-factor for those Mac machines really made it easy to stack them. I ended up with 28 machines in two stacks with 14 machines in each stack (all plugged into a few power strips and a standard lab-quality outlet). With the recent release of <a href="https://en.wikipedia.org/wiki/Yellow_Dog_Linux">Yellow-Dog Linux</a>, I wiped the standard OS from all the machines and installed Linux on all those Macs to create a beautiful cluster of UNIX goodness I could really get excited about. I called my system "The Orchard" and thought it would be difficult to come up with 28 different kinds of apple varieties to name each machine after. It wasn't difficult. It turns out there are over <a href="https://extension.illinois.edu/apples/facts.cfm">7,500 varieties</a> of apples grown throughout the world.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDHo8W_iNOD5cVNW9Z86kVMnN9unaahJfJ8qF2lDcGmvVtu6-WvPEzdDptomhwidshSIy3-E0EPCizxcSXg4FixjExjQnS1Xo-l2yJHYfTavTcPS4_TytelKn-Q2RESaqZ_geYNExCezE/s1600/Resized_20160525_203000.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="257" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDHo8W_iNOD5cVNW9Z86kVMnN9unaahJfJ8qF2lDcGmvVtu6-WvPEzdDptomhwidshSIy3-E0EPCizxcSXg4FixjExjQnS1Xo-l2yJHYfTavTcPS4_TytelKn-Q2RESaqZ_geYNExCezE/s320/Resized_20160525_203000.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Me smiling alongside by smoothly humming "Orchard" of interconnected Macs</td></tr>
</tbody></table>
<br />
The reason I put this cluster together was to simulate Magnetic Resonance Elastography (<a href="https://en.wikipedia.org/wiki/Magnetic_resonance_elastography">MRE</a>) which is a technique to visualize motion using Magnetic Resonance Imaging (MRI). I wanted to simulate the <a href="https://en.wikipedia.org/wiki/Bloch_equations">Bloch equations</a> with a classical model for how MRI images are produced. The goal was to create a simulation model for the MRE experiment that I could then use to both understand the data and perhaps eventually use this model to determine material properties directly from the measurements using Bayesian inversion (ambitiously bypassing the standard sequential steps of inverse FFT and local-frequency estimation).<br />
<br />
Now I just had to get all these machines to talk to each other, and then I would be poised to do anything. I read up a bit on MPI, PVM, and anything else I could find about getting computers to talk to each other. My unfamiliarity with the field left me puzzled as I tried to learn these frameworks in addition to figuring out how to solve my immediate problem. Eventually, I just settled down with a trusted <a href="http://www.amazon.com/gp/product/013490012X/ref=nav_timeline_asin?ie=UTF8&psc=1">UNIX book</a> by the late <a href="https://en.wikipedia.org/wiki/W._Richard_Stevens">W. Richard Stevens</a>. This book explained how the internet works. I learned enough about TCP/IP and sockets so that I could write my own C++ classes representing the model. These classes communicated directly with each other over raw sockets. While using sockets directly was perhaps not the best approach, it did work and helped me understand the internet so much better. It also makes me appreciate projects like tornado and zmq that much more.<br />
<br />
<h4 style="text-align: left;">
Lessons Learned</h4>
I ended up with a system that worked reasonably well, and I could simulate MRE to some manner of fidelity with about 2-6 hours of computation. This little project didn't end up being critical to my graduation path and so it was abandoned after about 6 months. I still value what I learned about C++, how abstractions can ruin performance, how to guard against that, and how to get machines to communicate with each other.<br />
<br />
Using Numeric, Python, and my recently-linked ODE library (early SciPy), I built a simpler version of the simulator that was actually faster on one machine than my cluster-version was in C++ on 20+ machines. I certainly could have optimized the C++ code, but I could have also optimized the Python code. The Python code took me about 4 days to write, the C++ code took me about 4 weeks. This experience has markedly influenced my thinking for many years about both pre-mature parallelization and pre-mature use of C++ and other compiled languages.<br />
<br />
Fast forward over a decade. My computer efforts until 2012 were spent on sequential array-oriented programming, creating SciPy, writing NumPy, solving inverse problems, and watching a few parallel computing paradigms emerge while I worked on projects to provide for my family. I didn't personally get to work on parallel computing problems during that time, though I always dreamed of going back and implementing this MRE simulator using a parallel construct with NumPy and SciPy directly. When I needed to do the occassional parallel computing example during this intermediate period, I would either use IPython parallel or multi-processing.<br />
<br />
<h2 style="text-align: left;">
Parallel Plans at Continuum</h2>
In 2012, <a href="http://twitter.com/pwang">Peter Wang</a> and I started <a href="http://www.continuum.io/">Continuum</a>, created <a href="http://www.pydata.org/">PyData</a>, and released <a href="http://continuum.io/downloads">Anaconda</a>. We also worked closely with members of the community to establish <a href="http://www.numfocus.org/">NumFOCUS</a> as an independent organization. In order to give NumFOCUS the attention it deserved, we hired the indefatigable <a href="http://www.numfocus.org/staff.html">Leah Silen</a> and donated her time entirely to the non-profit so she could work with the community to grow PyData and the Open Data Science community and ecosystem. It has been amazing to watch the community-based, organic, and independent growth of NumFOCUS. It took effort and resources to jump-start, but now it is moving along with a <a href="http://www.numfocus.org/board.html">diverse community driving it</a>. It is a great organization to join and contribute effort to.<br />
<br />
A huge reason we started Continuum was to bring the NumPy stack to parallel computing --- for both scale-up (many cores) and scale-out (many nodes). We knew that we could not do this alone and it would require creating a company and rallying a community to pull it off. We worked hard to establish PyData as a conference and concept and then transitioned the effort to the community through NumFOCUS to rally the community behind the long-term mission of enabling data-, quantitative-, and computational-scientists with open-source software. To ensure everyone in the community could get the software they needed to do data science with Python quickly and painlessly, we also created Anaconda and made it freely available.<br />
<br />
In addition to important community work, we knew that we would need to work alone on specific, hard problems to also move things forward. As part of our goals in starting Continuum we wanted to significantly improve the status of Python in the JVM-centric Hadoop world. Conda, Bokeh, Numba, and Blaze were the four technologies we started specifically related to our goals as a company beginning in 2012. Each had a relationship to parallel computing including Hadoop.<br />
<br />
<a href="http://conda.pydata.org/">Conda</a> enables easy creation and replication of environments built around deep and complex software dependencies that often exist in the data-scientist workflow. This is a problem on a single node --- it's an even bigger problem when you want that environment easily updated and replicated across a cluster.<br />
<br />
<a href="http://bokeh.pydata.org/">Bokeh</a> allows visualization-centric applications backed by quantitative-science to be built easily in the browser --- by non web-developers. With the release of Bokeh 0.11 it is extremely simple to create <a href="http://demo.bokehplots.com/">visualization-centric-web-applications and dashboards</a> with simple Python scripts (or also R-scripts thanks to <a href="http://hafen.github.io/rbokeh/">rBokeh</a>).<br />
<br />
With Bokeh, Python data scientists now have the power of both d3 and Shiny, all in one package. One of the driving use-cases of Bokeh was also easy visualization of large data. Connecting the visualization pipeline with large-scale cluster processing was always a goal of the project. Now, with <a href="https://github.com/bokeh/datashader">datashader</a>, this goal is now also being realized to <a href="http://go.continuum.io/datashader/">visualize billions of points in seconds and display them in the browser</a>.<br />
<br />
Our scale-up computing efforts centered on the open-source <a href="http://numba.pydata.org/">Numba</a> project as well as our <a href="https://docs.continuum.io/accelerate/index">Accelerate product</a>. Numba has made tremendous progress in the past couple of years, and is in production use in multiple places. Many are taking advantage of <a href="http://numba.pydata.org/numba-doc/dev/user/vectorize.html">numba.vectorize</a> to create array-oriented solutions and program the GPU with ease. The <a href="http://numba.pydata.org/numba-doc/dev/cuda/index.html">CUDA Python</a> support in Numba makes it the easiest way to program the GPU that I'm aware of. The <a href="http://numba.pydata.org/numba-doc/dev/cuda/simulator.html">CUDA simulator</a> provided in Numba makes it much simpler to debug in Python the logic of CUDA-based GPU programming. The addition of parallel-contexts to numba.vectorize mean that any many-core architecture can now be exploited in Python easily. Early <a href="http://numba.pydata.org/numba-doc/dev/hsa/index.html">HSA support</a> is also in Numba now meaning that Numba can be used to program novel hardware devices from many vendors.<br />
<br />
<h3 style="text-align: left;">
Summarizing Blaze </h3>
The ambitious <a href="http://blaze.pydata.org/">Blaze project</a> will require another blog-post to explain its history and progress well. I will only try to summarize the project and where it's heading. Blaze came out of a combination of deep experience with industry problems in finance, oil&gas, and other quantitative domains that would benefit from a large-scale logical array solution that was easy to use and connected with the Python ecosystem. We observed that the MapReduce engine of Hadoop was definitely not what was needed. We were also aware of Spark and RDD's but felt that they too were also not general enough (nor flexible enough) for the demands of distributed array computing we encountered in those fields.<br />
<br />
<h4 style="text-align: left;">
DyND, Datashape, and a vision for the future of Array-computing </h4>
After early work trying to extend the NumPy code itself led to struggles because of both the organic complexity of the code base and the stability needs of a mature project, the Blaze effort started with an effort to re-build the core functionality of NumPy and Pandas to fix some major warts of NumPy that had been on my mind for some time. With Continuum support, Mark Wiebe decided to continue to develop a C++ library that could then be used by Python and any-other data-science language (<a class="gr-progress" href="https://github.com/libdynd">DyND</a>). This necessitated defining a new data-description language (<a href="https://github.com/blaze/datashape">datashape</a>) that generalizes NumPy's dtype to structures of arrays (column-oriented layout) as well as variable-length strings and categorical types. This work continues today and is making rapid progress which I will leave to others to <a href="https://www.continuum.io/blog/developer-blog/dynd-callables-speed-and-flexibility">describe in more detail</a>. I do want to say, however, that dynd is implementing my "Pluribus" vision for the future of array-oriented computing in Python. We are factoring the core capability into 3 distinct parts: the type-system (or data-declaration system), a generalized function mechanism that can interact with any "typed" memory-view or "typed" buffer, and finally the container itself. We are nearing release of a separated type-library and are working on a separate C-API to the generalized function mechanism. This is where we are heading and it will allow maximum flexibility and re-use in the dynamic and growing world of Python and data-analysis. The DyND project is worth checking out right now (if you have desires to contribute) as it has made rapid progress in the past 6 months.<br />
<br />
As we worked on the distributed aspects of Blaze it centered on the realization that to scale array computing to many machines you fundamentally have to move code and not data. To do this well means that how the computer actually sees and makes decisions about the data must be exposed. This information is usually part of the type system that is hidden either inside the compiler, in the specifics of the data-base schema, or implied as part of the runtime. To fundamentally solve the problem of moving code to data in a general way, a first-class and wide-spread data-description language must be created and made available. Python users will recognize that a subset of this kind of information is contained in the struct module (the struct <a href="https://docs.python.org/3.0/library/struct.html">"format" strings</a>), in the Python 3 extended buffer protocol definition (<a href="https://www.python.org/dev/peps/pep-3118/">PEP 3118</a>), and in NumPy's <a href="http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.dtype.html">dtype system</a>. Extending these concepts to any language is the purpose of datashape.<br />
<br />
In addition, run-times that understand this information and can execute instructions on variables that expose this information must be adapted or created for every system. This is part of the motivation for DyND and why very soon the datashape system and its C++ type library will be released independently from the rest of DyND and Blaze. This is fundamentally why DyND and datashape are such important projects to me. I see in them the long-term path to massive code-reuse, the breaking down of data-silos that currently cause so much analytics algorithm duplication and lack of cooperation.<br />
<br />
Simple algorithms from data-munging scripts to complex machine-learning solutions must currently be re-built for every-kind of data-silo unless there is a common way to actually functionally bring code to data. Datashape and the type-library runtime from DyND (ndt) will allow this future to exist. I am eager to see the <a href="https://arrow.apache.org/">Apache Arrow</a> project succeed as well because it has related goals (though more narrowly defined).<br />
<br />
The next step in this direction is an on-disk and in-memory<a href="https://github.com/blaze/datafabric"> data-fabric</a> that allows data to exist in a distributed file-system or a shared-memory across a cluster with a pointer to the head of that data along with a data-shape description of how to interpret that pointer so that any language that can understand the bytes in that layout can be used to execute analytics on those bytes. The C++ type run-time stands ready to support any language that wants to parse and understand data-shape-described pointers in this future data-fabric.<br />
<br />
From one point of view, this DyND and data-fabric effort are a natural evolution of the efforts I started in 1998 that led to the creation of SciPy and NumPy. We built a system that allows existing algorithms in C/C++ and Fortran to be applied to any data in Python. The evolution of that effort will allow algorithms from many other languages to be applied to any data in memory across a cluster.<br />
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
Blaze Expressions and Server</h4>
The key part of Blaze that is also important to mention is the notion of the Blaze server and user-facing Blaze expressions and functions. This is now what Blaze the project actually entails --- while other aspects of Blaze have been pushed into their respective projects. Functionally, the Blaze server allows the data-fabric concept on a machine or a cluster of machines to be exposed to the rest of the internet as a data-url (e.g. http://mydomain.com/catalog/datasource/slice). This data-url can then be consumed as a variable in a Blaze expression --- first across entire organizations and then across the world.<br />
<br />
This is the truly exciting part of Blaze that would enable all the data in the world to be as accessible as an already-loaded data-frame or array. The logical expressions and transformations you can then write on those data to be your "logical computer" will then be translated at compute time to the actual run-time instructions as determined by the Blaze server which is mediating communication with various backends depending on where the data is actually located. We are realizing this vision on many data-sets and a certain set of expressions already with a growing collection of backends. It is allowing true "write-once and run anywhere" to be applied to data-transformations and queries and eventually data-analytics. Currently, the data-scientists finds herself to be in a situation similar to the assembly programmer in the 1960s who had to know what machine the code would run on before writing the code. Before beginning a data analytics task, you have to determine which data-silo the data is located in before tackling the task. SQL has provided a database-agnostic layer for years, but it is too limiting for advanced analytics --- and user-defined functions are still database specific.<br />
<br />
Continuum's support of blaze development is currently taking place as defined by our consulting customers as well as by the demands of our Anaconda platform and the feature-set of an exciting new product for the Anaconda Platform that will be discussed in the coming weeks and months. This new product will provide a simplified graphical user-experience on top of Blaze expressions, and Bokeh visualizations for rapidly connecting quantitative analysts to their data and allowing explorations that retain provenance and governance. General availability is currently planned for August.<br />
<br />
Blaze also spawned additional efforts around fast compressed storage of data (blz which formed the inspiration and initial basis for <a href="https://github.com/Blosc/bcolz">bcolz</a>) and experiments with <a href="https://github.com/blaze/castra">castra</a> as well as a popular and straight-forward tool for quickly copying data from one data-silo kind to another (<a href="https://github.com/blaze/odo">odo</a>).<br />
<br />
<h3 style="text-align: left;">
Developing dask the library and Dask the project</h3>
The most important development to come out of Blaze, however, will have tremendous impact in the short term well before the full Blaze vision is completed. This project is Dask and I'm excited for what Dask will bring to the community in 2016. It is helping us finally deliver on scaled-out NumPy / Pandas and making Anaconda a first-class citizen in Hadoop.<br />
<br />
In 2014, Matthew Rocklin started working at Continuum on the Blaze team. Matthew is the well-known author of many functional tools for Python. He has a <a href="http://matthewrocklin.com/blog/">great blog</a> you should read regularly. His first contribution to Blaze was to adapt a multiple-dispatch system he had built which formed the foundation of both odo and Blaze. He also worked with Andy Terrel and Phillip Cloud to clarify the Blaze library as a front-end to multiple backends like Spark, Impala, Mongo, and NumPy/Pandas.<br />
<br />
With these steps taken, it was clear that the Blaze project needed its own first-class backend as well something that the community could rally around to ensure that Python remained a first-class participant in the scale-out conversation --- especially where systems that connected with Hadoop were being promoted. Python should not ultimately be relegated to being a mere front-end system that scripts Spark or Hadoop --- unable to talk directly to the underlying data. This is not how Python achieved its place as a <i>de-facto</i> data-science language. Python should be able to access and execute on the data directly inside Hadoop.<br />
<br />
Getting there took time. The first version of dask was released in early 2015 and while distributed work-flows were envisioned, the first versions were focused on out-of-core work-flows --- allowing problem-sizes that were too big to fit in memory to be explored with simple pandas-like and numpy-like APIs.<br />
<br />
When Matthew showed me his first version of dask, I was excited. I loved three things about it: 1) It was simple and could, therefore, be used as a foundation for parallel PyData. 2) It leveraged already existing code and infrastructure in NumPy and Pandas. 3) It had very clean separation between collections like arrays and data-frames, the directed graph representation, and the schedulers that executed those graphs. This was the missing piece we needed in the Blaze ecosystem. I immediately directed people on the Blaze team to work with Matt Rocklin on Dask and asked Matt to work full-time on it.<br />
<br />
He and the team made great progress and by summer of 2015 had a very nice out-of-core system working with two functioning parallel-schedulers (multi-processing and multi-threaded). There was also a "synchronous" scheduler that could be used for debugging the graph and the system showed well enough throughout 2015 to start to be adopted by other projects (scikit-image and xarray).<br />
<br />
In the summer of 2015, Matt began working on the distributed scheduler. By fall of 2015, he had a very nice core system leveraging the hard work of the Python community. He built the API around the concepts of asynchronous computing already being promoted in Python 3 (<a href="https://docs.python.org/3.2/library/concurrent.futures.html">futures</a>) and built dask.distributed on top of tornado. The next several months were spent improving the scheduler by exposing it to as many work-flows as possible from computational-science, quantitative-science and computational-science. By February of 2016, the system was ready to be used by a variety of people interested in distributed computing with Python. This process continues today.<br />
<br />
Using dask.dataframes and dask.arrays you can quickly build array- and table-based work-flows with a Pandas-like and NumPy-like syntax respectively that works on data sitting across a cluster.<br />
<br />
Anaconda and the PyData ecosystem now had another solution for the scale-out problem --- one whose design and implementation was something I felt could be a default run-time backend for Blaze. As a result, I could get motivated to support, market, and seek additional funding for this effort. Continuum has received some DARPA funding under the <a href="http://opencatalog.darpa.mil/XDATA.html">XDATA program</a>. However, this money was now spread pretty thin among Bokeh, Numba, Blaze, and now Dask.<br />
<br />
<h4 style="text-align: left;">
Connecting to Hadoop</h4>
With the distributed scheduler basically working and beginning to improve, two problems remained with respect to Hadoop interoperability: 1) direct access to the data sitting in HDFS and 2) interaction with the resource schedulers running most Hadoop clusters (YARN or mesos).<br />
<br />
To see how important the next developments are, it is useful to describe an anecdote from early on in our XDATA experience. In the summer of 2013, when the DARPA XDATA program first kicked-off, the program organizers had reserved a large Hadoop cluster (which even had GPUs on some of the nodes). They loaded many data sets onto the cluster and communicated about its existence to all of the teams who had gathered to collaborate on getting insights out of "Big Data." However, a large number of the people collaborating were using Python, R, or C++. To them the Hadoop cluster was inaccessible as there was very little they could use to interact with the data stored in HDFS (beyond some high-latency and low-bandwidth streaming approaches) and nothing they could do to interact with the scheduler directly (without writing Scala or Java code). The Hadoop cluster sat idle for most of the summer while teams scrambled to get their own hardware to run their code on and deliver their results.<br />
<br />
This same situation we encountered in 2013 exists in many organizations today. People have large Hadoop infrastructures, but are not connecting that infrastructure effectively to their data-scientists who are more comfortable in Python, R, or some-other high-level (non JVM language).<br />
<br />
With dask working reasonably well, tackling this data-connection problem head on became an important part of our Anaconda for Hadoop story and so in December of 2015 we began two initiatives to connect Anaconda directly to Hadoop. Getting data from HDFS turned out to be much easier than we had initially expected because of the hard-work of many others. There had been quite a bit of work building a C++ interface to Hadoop at Pivotal that had culminated in a library called <a href="http://pivotalrd.github.io/libhdfs3/">libhdfs3</a>. Continuum wrote a Python interface to that library quickly, and it now exists as the <a href="http://hdfs3.readthedocs.org/en/latest/">hdfs3</a> library under the Dask organization on Github.<br />
<br />
The second project was a little more involved as we needed to integrate with YARN directly. Continuum developers worked on this and produced a Python library that communicates directly to the YARN classes (using Scala) in order to allow the Python developer to control computing resources as well as spread files to the Hadoop cluster. This project is called <a href="http://knit.readthedocs.org/en/latest/">knit</a>, and we expect to connect it to mesos and other cluster resource managers in the near future (if you would like to sponsor this effort, please get in touch with me).<br />
<br />
Early releases of hdfs3 and knit were available by the end of February 2015. At that time, these projects were joined with dask.distributed and the dask code-base into a new <a href="https://github.com/dask">Github organization called Dask</a>. The graduation of Dask into its own organization signified an important milestone that dask was now ready for rapid improvement and growth alongside Spark as a first-class execution engine in the Hadoop ecosystem.<br />
<br />
Our initial goals for Dask are to build enough examples, capability, and awareness so that every PySpark user tries Dask to see if it helps them. We also want Dask to be a compatible and respected member of the growing Hadoop execution-framework community. We are also seeking to enable Dask to be used by scientists of all kinds who have both array and table data stored on central file-systems and distributed file-systems outside of the Hadoop ecosystem.<br />
<br />
<h3 style="text-align: left;">
Anaconda as a first-class execution ecosystem for Hadoop</h3>
With Dask (including hdfs3 and knit), Anaconda is now able to participate on an equal footing with every other execution framework for Hadoop. Because of the vast reaches of Anaconda Python and Anaconda R communities, this means that a lot of native code can now be integrated to Hadoop much more easily, and any company that has stored their data in HDFS or other distributed file system (like s3fs or gpfs) can now connect that data easily to the entire Python and/or R computing stack.<br />
<br />
This is exciting news! While we are cautious because these integrative technologies are still young, they are connected to and leveraging the very mature PyData ecosystem. While benchmarks can be misleading, we have a few benchmarks that I believe accurately reflect the reality of what parallel and distributed Anaconda can do and how it relates to other Hadoop systems. For array-based and table-based computing workflows, Dask will be 10x to 100x faster than an equivalent PySpark solution. For applications where you are not using arrays or tables (i.e. word-count using a dask.bag), Dask is a little bit slower than a similar PySpark solution. However, I would argue that Dask is much more Pythonic and easier to understand for someone who has learned Python.<br />
<br />
It will be very interesting to see what the next year brings as more and more people realize what is now available to them in Anaconda. The PyData crowd will now have instant access to cluster computing at a scale that has previously been accessible only by learning complicated new systems based on the JVM or paying an unfortunate performance penalty. The Hadoop crowd will now have direct and optimized access to entire classes of algorithms from Python (and R) that they have not previously been used to.<br />
<br />
It will take time for this news and these new capabilities to percolate, be tested, and find use-cases that resonate with the particular problems people actually encounter in practice. I look forward to helping many of you take the leap into using Anaconda at scale in 2016.<br />
<br />
We will be showing off aspects of the new technology at Strata in San Jose in the Continuum booth #1336 (look for Anaconda logo and mark). We have already announced at a high-level some of <a href="https://www.continuum.io/blog/news/continuum-analytics-brings-serious-analytics-hadoop">the capabilities</a>: Peter and I will both be at Strata along with several of the talented people at Continuum. If you are attending drop by and say hello.<br />
<br />
We first came to Strata on behalf of Continuum in 2012 in Santa Clara. We announced that we were going to bring you scaled-out NumPy. We are now beginning to deliver on this promise with Dask. We brought you scaled-up NumPy with Numba. Blaze and Bokeh will continue to bring them together along with the rest of the larger data community to provide real insight on data --- where-ever it is stored. <a href="http://matthewrocklin.com/blog/work/2016/02/22/dask-distributed-part-2">Try out Dask</a> and join the new scaled-out PyData story which is richer than ever before, has a larger community than ever before, and has a brighter future than ever before.<br />
<br /></div>
Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com36tag:blogger.com,1999:blog-68730239358084672.post-72857137272222135322013-12-06T01:44:00.001-08:002013-12-09T20:54:16.662-08:00Why I promote conda<div dir="ltr" style="text-align: left;" trbidi="on">
<a href="http://www.continuum.io/downloads">Anaconda</a> users have been enjoying the benefits of conda for quickly and easily<br />
managing their binary Python packages for over a year. During that time conda<br />
has also been steadily improving as a general-purpose package manager. I<br />
have recently been promoting the very nice things that conda can do for Python<br />
users generally --- especially with complex binary extensions to Python as<br />
exist in the NumPy stack. For example, It is very easy to create python 3<br />
environments and python 2 environments on the same system and install<br />
scikit-learn into them. Normally, this process can be painful if you<br />
do not have a suitable build environment, or don't want to wait for<br />
compilation to succeed.<br />
<br />
Naturally, I sometimes get asked, "Why did you promote/write another<br />
python package manager (conda) instead of just contributing to the<br />
standard pip and virtualenv?" The python packaging story is older and<br />
more personal to me than you might think. Python packaging has been a thorn<br />
in my side personally since 1998 when I released my first Python extension<br />
(called numpyio actually). Since then, I've written and personally released<br />
many, many Python packages (Multipack which became SciPy, NumPy, llvmpy,<br />
Numba, Blaze, etc.). There is nothing you want more as a package author than<br />
users. So, to make Multipack (SciPy), then NumPy available, I had to become a<br />
packaging expert by experiencing a lot of pain with the lack of<br />
suitable tools for my (admittedly complex) task.<br />
<br />
Along the way, I've suffered through believing that distutils,<br />
setuptools, distribute, and pip/virtualenv would solve my actual<br />
problem. All of these tools provided some standardization (at least around what somebody<br />
types at the command line to build a package) but no help in actually doing the<br />
build and no real help in getting compatible binaries of things like SciPy<br />
installed onto many users machines.<br />
<br />
I've personally made terrible software engineering mistakes because of the lack of<br />
good package management. For example, I allowed the pressure of "no ABI<br />
changes" to severely hamper the progress of the NumPy API. Instead of pushing<br />
harder and breaking the ABI when necessary to get improvements into NumPy, I<br />
buckled under the pressure and agreed to the requests coming mostly from NumPy<br />
windows users and froze the ABI. I could empathize with people who would spend<br />
days building their NumPy stack and literally become fearful of changing it.<br />
From NumPy 1.4 to NumPy 1.7, the partial date-time addition caused various<br />
degrees of broken-ness and is part of why missing data data-types have never<br />
showed up in NumPy at all. If conda had existed back then with standard<br />
conda binaries released for different projects, there would have been almost<br />
no problem at all. That pressure would have largely disappeared. Just<br />
install the packages again --- problem solved for everybody (not just the<br />
Linux users who had apt-get and yum).<br />
<br />
Some of the problems with SciPy are also rooted in the lack of good packages<br />
and package management. SciPy, when we first released it in 2001 was<br />
basically a distribution of multiple modules from Multipack, some new BLAS /<br />
LAPACK and linear algebra wrappers and nascent plotting tools. It was a SciPy<br />
<b>distribution</b> masquerading as a single library. Most of the effort spent was<br />
a packaging effort (especially on Windows). Since then, the scikits effort<br />
has done a great job of breaking up the domain of SciPy into more manageable<br />
chunks and providing a space for the community to grow. This kind of re-<br />
factoring is only possible with good distributions and is really only<br />
effective when you have good package management. On Mac and Linux<br />
package managers exist --- on Windows things like EPD, Anaconda or C.<br />
Gohlke's collection of binaries have been the only solution.<br />
<br />
Through all of this work, I've cut my fingers and toes and sometimes face on<br />
compilers, shared and static libraries on all kinds of crazy systems (AIX,<br />
Windows NT, etc.). I still remember the night I learned what it meant to have<br />
ABI incompatibilty between different compilers (try passing structs<br />
such as complex-numbers between a file compiled with mingw and a library compiled with<br />
Visual Studio). I've been bitten more than once by unicode-width<br />
incompatibilities, strange shared-library incompatibilities, and the vagaries<br />
of how different compilers and run-times define the `FILE *` file pointer.<br />
<br />
In fact, if you have not read "Linkers and Loaders", you should actually do<br />
that right now as it will open your mind to that interesting limbo between<br />
"developer-code" and "running process" overlooked by even experienced<br />
developers. I'm grateful Dave Beazley recommended it to me over 6 years ago.<br />
Here is a link: <a href="http://www.iecc.com/linker/">http://www.iecc.com/linker/</a><br />
<br />
We in the scientific python community have had difficulty and a rocky<br />
history with just waiting for the Python.org community to solve the<br />
problem. With distutils for example, we had to essentially re-write<br />
most of it (as numpy.distutils) in order to support compilation of<br />
extensions that needed Fortran-compiled libraries. This was not an<br />
easy task. All kinds of other tools could have (and, in retrospect,<br />
should have) been used. Most of the design of distutils did not help<br />
us in the NumPy stack at all. In fact, numpy.distutils replaces most<br />
of the innards of distutils but is still shackled by the architecture<br />
and imperative approach to what should fundamentally be a declarative<br />
problem. We should have just used or written something like waf or<br />
bento or cmake and encouraged its use everywhere. However, we buckled<br />
under the pressure of the distutils promise of "one right way to do<br />
it" and "one-size fits all" solution that we all hoped for, but<br />
ultimately did not get. I appreciate the effort of the distutils<br />
authors. Their hearts were in the right place and they did provide a<br />
useful solution for their use-cases. It was just not useful for ours,<br />
and we should not have tried to force the issue. Not all code is<br />
useful to everyone. The real mistake was the Python community picking<br />
a "standard" that was actually limiting for a sizeable set of users.<br />
This was the real problem --- but it should be noted that this<br />
"problem" is only because of the incredible success and therefore<br />
influence of python developers and python.org. With this influence, however,<br />
comes a certain danger of limiting progress if all advances have to be<br />
made via committee --- working out specifications instead of watching for<br />
innovation and encouraging it.<br />
<br />
David Cooke and many others finally wrestled numpy.distutils to the<br />
point that the library does provide some useful functionality for<br />
helping build extensions requiring NumPy. Even after all that effort,<br />
however, some in the Python community who seem to have no idea of the<br />
history of how these things came about and simply claim that setup.py<br />
files that need numpy.distutils are "broken" because they import numpy<br />
before "requiring" them. To this, I reply that what is actually<br />
broken is the design that does not have a delcarative meta-data file<br />
that describes dependencies and then a build process that creates the<br />
environment needed <b>before</b> running <b>any</b> code to do the actual build.<br />
This is what `<a href="http://docs.continuum.io/conda/build.html">conda build</a>` does and it works beautifully to create any<br />
kind of binary package you want from any list of dependencies you may<br />
have. Anything else is going to require all kinds of "bootstrap"<br />
gyrations to fit into the square hole of a process that seems to<br />
require that all things begin with the python setup.py incantation.<br />
<br />
Therefore, you can't really address the problem of Python packaging without<br />
addressing the core problems of trying to use distutils (at least for the<br />
NumPy stack). The problems for us in the NumPy stack started there and have<br />
to be rooted out there as well. This was confirmed for me at the first PyData<br />
meetup at Google HQ, where several of us asked Guido what we can do to fix<br />
Python packaging for the NumPy stack. Guido's answer was to "solve the<br />
problem ourselves". We at Continuum took him at his word. We looked at dpkg,<br />
rpm, pip/virtualenv, brew, nixos, and 0installer, and used our past experience<br />
with EPD. We thought hard about the fundamental issues, and created the conda<br />
package manager and conda environments. We who have been working on this for<br />
the past year have decades of Python packaging experience between us: me,<br />
Peter Wang, Ilan Schnell, Bryan Van de Ven, Mark Wiebe, Trent Nelson, Aaron<br />
Meurer, and now Andy Terrel are all helping improve things. We welcome<br />
contributions, improvements, and updates from anyone else as conda is <a href="https://github.com/ContinuumIO/conda/blob/master/LICENSE.txt">BSD</a><br />
<a href="https://github.com/ContinuumIO/conda/blob/master/LICENSE.txt">licensed</a> and completely open source and can be used and re-used by<br />
anybody. We've also recently made a mailing list<br />
conda@continuum.io which is open to anyone to join and participate:<br />
<a href="https://groups.google.com/a/continuum.io/forum/#!forum/conda">https://groups.google.com/a/continuum.io/forum/#!forum/conda</a><br />
<br />
Conda pkg files are similar to .whl files except they are Python-agnostic. A<br />
conda pkg file is a bzipped tar file with an 'info' directory, and then<br />
whatever other directory structure is created by the install process in<br />
"prefix". It's the equivalent of taking a file-system diff pre and post-<br />
install and then tarring the result up. It's more general than .whl files and<br />
can support any kind of binary file. Making conda packages is as simple as making a recipe for it. We make a growing collection of public-domain, <a href="https://github.com/ContinuumIO/conda-recipes">example recipes</a> available to everyone and also encourage attachment of a conda recipe directory to every project that needs binaries.<br />
<br />
At the heart of conda package installation is the concept of environments.<br />
Environments are like namespaces in Python -- but for binary packages. Their<br />
applicability is extensive. We are using them within Anaconda and Wakari for<br />
all kinds of purposes (from testing to application isolation to easy<br />
reproducibility to supporting multiple versions of packages in different<br />
scripts that are part of the same installation). Truly, to borrow the famous<br />
Tim Peters' quip: "Environments are one honking great idea -- let's do more of<br />
those". Rather than tacking this on after the fact like virtualenv does to<br />
pip, OS-level environments are built-in from the beginning. As a result,<br />
every conda package is <b>always</b> installed into an environment. There is a<br />
default (root) environment if you don't explicitly specify another one.<br />
Installation of a package is simply merging the unpacked binary into the union<br />
of unpacked binaries already at the root-path of the environment. If union<br />
filesystems were better implemented in different operating systems, then each<br />
environment would simply be a union of the untarred binary packages. Instead<br />
we accomplish the same thing with hard-linking, soft-linking, and (when<br />
necessary) copying of files.<br />
<br />
The design is simple, which helps it be easy to understand and easy to<br />
mix with other ideas. We don't see easily how to take these simple,<br />
powerful ideas and adapt them to .whl and virtualenv which are trying<br />
to fit-in to a world created by distutils and setuptools. It was<br />
actually <b>much</b> easier to just write our own solution and create<br />
hundreds of packages and make them available and provide all the tools<br />
to reproduce what we have done inside conda than to try and untangle<br />
how to provide our solution in that world and potentially even not<br />
quite get the result we want (which can be argued is what happened<br />
with numpy.distutils).<br />
<br />
You can use conda to build your own distribution of binaries that<br />
compete with Anaconda if you like. Please do. I would be completely<br />
thrilled if every other Python distribution (python.org, EPD,<br />
ActiveState, etc.) just used conda packages that they build and in so<br />
doing helped improve the conda package manager. I recognize that<br />
conda emerged at the same time as the Anaconda distribution was<br />
stabilizing and so there is natural confusion over the two. So,<br />
I will try to clarify: <a href="https://github.com/ContinuumIO/conda">Conda</a> is an open-source, general,<br />
cross-platform package manager. One could accurately describe it as a<br />
cross-platform hombrew written in Python. Anyone can use the tool and<br />
related infrastructure to build and distribute whatever packages they<br />
want.<br />
<br />
Anaconda is the collection of conda packages that we at Continuum provide for<br />
free to everyone, based on a particular base Python we choose (which you can<br />
download at <a href="http://continuum.io/downloads">http://continuum.io/downloads</a> as Miniconda). In the past it has<br />
been some work to get conda working outside Miniconda or Anaconda because our<br />
first focus was creating a working solution for our users. We have been<br />
fixing those minor issues and have now released a version of conda that can be<br />
'pip installed'. As conda has significant overlap with virtualenv in<br />
particular we are still working out kinks in the interop of these two<br />
solutions. But, it all <b>can</b> and <b>should</b> work together and we fix issues as<br />
quickly as we can identify them.<br />
<br />
We also provide a service called <a href="http://binstar.org/">http://binstar.org</a> (register with beta-code<br />
"binstar in beta") which allows you to host your own binary conda packages.<br />
With this missing piece, you just tell people to point their conda<br />
repositories to your collection -- and they can easily install everything you<br />
want them to. You can also build your own conda repositories and host them on<br />
your own servers. It all works, today, now -- for hundreds of thousands of<br />
people. In this context, Anaconda could be considered a "reference"<br />
distribution and a proof of concept of how to use the conda package manager.<br />
<a href="http://wakari.io/">Wakari</a> also uses the conda package manager at its core to share bundles.<br />
Bundles are just conda packages (with a set of dependencies) and capture the<br />
core problems associated with reproducible computing in a light-weight and<br />
easily reproduced way. We have made the tools available for *anyone* to re-<br />
create this distribution pretty easily and compete with us.<br />
<br />
It is very important to keep in mind that we created conda to solve<br />
the problem of distributing an environment to end-users that allow<br />
them do to advanced data analytics, scientific discovery, and general<br />
engineering work. Python has a chance to play a major role in this<br />
space. However, it is not the only player. Other solutions exist in<br />
the space we are targeting (SAS, Matlab, SPSS, and R). We want Python<br />
to dominate this space. We could not wait for the packaging solution<br />
we needed to evolve from the lengthy discussions that are on-going<br />
which also have to untangle the history of distutils, setuptools,<br />
easy_install, and distribute. What we <b>could</b> do is solve our problem<br />
and then look for interoperability and influence opportunities once we<br />
had something that worked for our needs. That the approach we took<br />
and I'm glad we did. We have a working solution now which benefits<br />
hundreds of thousands of users (and could benefit millions more if<br />
IT administrators recognized conda as an acceptable packaging approach<br />
from others in the community). <br />
<br />
We are going to keep improving conda until it becomes an obvious<br />
solution for everyone: users, developers, and IT administrators alike.<br />
We welcome additions and suggestions that allow it to interoperate<br />
with anything else in the Python packaging space. I do believe that the group of people working on Python packaging and Nick Coghlan in particular are doing a valuable service. It's a very difficult job to take into account the history of Python packaging, fix all the little issues around it, *and* provide a binary distribution system that allows users to not have to think about packaging and distribution. With our resources we did just the latter. I admire those who are on the front lines of the former and look to provide as much context as I can to ensure that any future decisions take our use-cases into account. I am looking forward to continuing to work with the community to reach future solutions that benefit everyone.<br />
<br />
If you would like to see more detail about conda and how it can be used here are some<br />
resources:<br />
<br />
Documentation: <a href="http://docs.continuum.io/conda/index.html">http://docs.continuum.io/conda/index.html</a><br />
Talk at PyData NYC 2013:<br />
- Slides: <a href="https://speakerdeck.com/teoliphant/packaging-and-deployment-with-conda">https://speakerdeck.com/teoliphant/packaging-and-deployment-with-conda</a><br />
- Video: <a href="http://vimeo.com/79862018">http://vimeo.com/79862018</a><br />
<br />
Blog Posts:<br />
- <a href="http://continuum.io/blog/anaconda-python-3">http://continuum.io/blog/anaconda-python-3</a><br />
- <a href="http://continuum.io/blog/new-advances-in-conda">http://continuum.io/blog/new-advances-in-conda</a><br />
- <a href="http://continuum.io/blog/conda">http://continuum.io/blog/conda</a><br />
<br />
Mailing list:<br />
- conda@continuum.io<br />
- <a href="https://groups.google.com/a/continuum.io/forum/#!forum/conda">https://groups.google.com/a/continuum.io/forum/#!forum/conda</a><br />
<br /></div>
Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com12tag:blogger.com,1999:blog-68730239358084672.post-5675651923129575122013-07-03T11:32:00.002-07:002013-07-09T14:24:11.095-07:00Thoughts after SciPy 2013 and a specific NumPy improvement<div dir="ltr" style="text-align: left;" trbidi="on">
I attended a few days of SciPy 2013 and enjoyed interacting with the many old friends and many new friends that participate in this conference. I thought the program committee did an excellent job of selecting talks and there were more attendees this year which also mirrors my experience with the PyData conference series which sells out every time. Andy Terrell, a NumFOCUS board member and researcher at the University of Texas, and Jonathan Rocher, an Enthought developer, were co-chairs of SciPy this year and did an excellent job of coordination.<br />
<br />
<a href="http://www.continuum.io/">Continuum Analytics</a>, my new company, is the institutional sponsor of the PyData conference series and I know how much work it can be, so my thanks go out to Enthought for their efforts to sponsor the SciPy conference this year and in years past. I'm really looking forward to the day when the SciPy conference, like the PyData conference series, directly benefits <a href="http://www.numfocus.org/">NumFOCUS</a> which is a non-profit organization with 501(c)(3) status started by the scientific Python community and run by the same community behind so much of the SciPy stack. It looks like steps are being taken in that direction which is wonderful to see. At the SciPy conference, Fernando Perez, of IPython fame, led the charge to get fiscal sponsorship documents improved to make it much simpler for people wanting to sponsor the great projects on the scientific python stack (IPython, NumPy, SciPy, Pandas, SymPy, Matplotlib, etc.) to have a vehicle to do it. This year, NumFOCUS was able to sponsor the attendance of two students to the SciPy conference because of generous donors. Right now, NumFOCUS is looking for help for its website to improve the look and feel. It's a great way to get involved with the community and help out. Just send an email to the numfocus google group (a public group for all to get involved with): <a href="mailto:numfocus+subscribe@googlegroups.com?subject=Subscribe">mailto:numfocus+subscribe@googlegroups.com?subject=Subscribe</a>. <br />
<br />
Right now, a conversation involving graph-representations for Python compilation tools is happening on the numfocus mailing list among several interested parties from SymPy, Numba, Theano, Pythran, Parakeet, etc. One of the highlights of the conference for me was meeting and interacting with other people interested in Python-for-science compiler technology as it looks like there is a healthy community developing around this topic. I hope those interested in the topic check out <a href="http://compilers.pydata.org/">compilers.pydata.org</a> and issue pull requests to that <a href="https://github.com/pydata/compilers-webpage">github-hosted page</a> to describe their favorite tool.<br />
<br />
I only attended some of the tutorial given by fellow Continuum team members Ben Zaitlen and Clayton Davis. I was gratified to see that <a href="http://wakari.io/">wakari.io</a> was useful for so many people during the tutorials, and appreciated the feedback on how we can continue to improve the tool. I'm also grateful to see all the people able to productively use <a href="https://store.continuum.io/cshop/anaconda/">Anaconda</a> which is our free, cross-platform, distribution for using Python for scientific work and data analysis.<br />
<br />
It was nice to see David Cournapeau give a detailed discussion of NumPy internals in one of the tutorials. There is much more that could be said about NumPy internals, but David gave a good introduction to the topic. I like how he showed how it is possible to extend the NumPy dtype system --- especially with certain kinds of types. In NumPy, I tried very hard to make the type-system more extensible. It's nice to see it being used more and more. Extending the type system more generally (to include things like variable-length strings, and infinite precision floats) is a bit harder and not very easy to do in current NumPy (especially while trying to keep the foundation stable). In fact, one of the reasons Continuum is sponsoring the development of dynd is precisely to build a foundation with an easier to extend type-system. Making it a C++ library should hopefully allow languages like Javascript, Ruby, Haskell, and others to also benefit from the dynamic type concepts as well.<br />
<br />
I really enjoyed the talk on Spyder by <a href="http://conference.scipy.org/scipy2013/presentation_detail.php?id=172">Carlos Cordoba</a>. The Spyder IDE is a very nice tool and I was happy to see Carlos promoting it. The Spyder IDE is featured in our Anaconda Launcher (part of the Anaconda 1.6 release) along with the IPython notebook and IPython console. The Launcher allows anyone to publish their app to multiple platforms simply by making a conda package (with an icon and an entry-point) and upload it to a repository that the Launcher is looking at. All the dependencies can be specified and they will be installed via conda automatically when the app is selected. The hope is to make it very easy for anyone to get their cool application based on Python in front of people quickly without having to make installers for every platform.<br />
<br />
Besides the excellent keynote talks, by Fernando Perez, William Schroeder, and Olivier Grisel, I also found the talks by <a href="http://conference.scipy.org/scipy2013/presentation_detail.php?id=197">Matthew Rocklin</a>, <a href="http://conference.scipy.org/scipy2013/presentation_detail.php?id=129">Pat Marion</a>, <a href="http://conference.scipy.org/scipy2013/presentation_detail.php?id=135">Ramalingam Saravanan</a>, <a href="http://conference.scipy.org/scipy2013/presentation_detail.php?id=136">Serge Guelton</a>, <a href="http://conference.scipy.org/scipy2013/presentation_detail.php?id=147">Samuel Skillman</a>, <a href="http://conference.scipy.org/scipy2013/presentation_detail.php?id=131">Jake Vanderplas</a>, and <a href="http://conference.scipy.org/scipy2013/presentation_detail.php?id=161">Joshua Warner</a> very interesting. It was especially nice to meet Joshua who was coming from the Mayo Clinic where SciPy began. I started writing the SciPy library in 1999 at the Mayo Clinic while I was a graduate student there (then called Multipack, special, and a bunch of other modules). It was very nice to meet someone from Mayo contributing again to this community with a very nice fuzzy logic package based on the work of an old professor of mine Hal Otteson. His work is now a new <a href="https://github.com/scikit-fuzzy/scikit-fuzzy">scikit</a>. The scikit concept has been a tremendous boon for development of the Scientific Python community as it allows more distributed development and more rapid expansion of the available tools. If better packaging had existed at the time, I would very likely have kept my early modules independent so they could grow with their own developer bases. What is now the SciPy library should most likely have been a SciPy distribution (with perhaps a smaller core). But, hindsight is 20/20 and given the state of the world at the time, the best option seemed to be to create the SciPy library with Eric Jones and Pearu Peterson. <br />
<br />
Mark Wiebe did an excellent job in presenting <a href="https://github.com/ContinuumIO/libdynd">dynd</a>, a C++ library for dynamic multi-dimensional array manipulation with nice <a href="https://github.com/ContinuumIO/dynd-python">python bindings</a>. Mark's work, sponsored by Continuum Analytics, is something that could lead to NumPy 2.0, although nobody has suggested exactly how that might work yet. As dynd forms a foundation for Blaze, and Blaze and NumPy can co-exist for many years, I haven't been thinking much about how NumPy 2.0 could grow out of dynd until now. I do now have some ideas about how NumPy could be improved that I think will help the space evolve more fluidly and productively with many interested people able to coordinate their varied efforts. The most important of these is the introduction of multi-methods into NumPy which I'll outline below. <br />
<br />
I participated on a panel about the future of Array Oriented Computing in Python. Of course, I've been spending a lot of time over the past year working and thinking exactly about that, so I would have preferred a talk versus a panel with only a limited amount of time. However, I have limited time to prepare talks and will be speaking at the upcoming <a href="http://www.pydata.org/bos2013">PyData conference in Boston</a>, so I was grateful for the chance to at least express some of the ideas we've been working on. To be clear, I think that Blaze is the future of Array Oriented Computing in Python, though we have some work ahead to prove that out. Exactly what the transition from NumPy to Blaze looks like for people will be a story I care quite a bit about and will be telling more and more in the coming months and years. I take personal responsibility for anyone who adopted NumPy, and I will do everything I can to make sure their transition to using Blaze is as simple as possible. Backward compatibility is very important to me. I spent many hours making sure that NumPy was compatible with both Numarray and Numeric. Fortunately, Blaze and NumPy can co-exist and so there is less of a story of either / or and more about which / when (especially during the transition phase). <br />
<br />
There is also another possibility that will be interesting to see if it emerges: retro-fitting NumPy with multi-methods (dispatching on python type and also on dtype). I think this is the single-most important thing that can be done for NumPy. If someone is motivated and has budget, I can work with her to do this in about 1-2 months (maybe even sooner depending on the experience). This is not on my immediately funded road-map, however, so it would need outside funding and/or interest. <br />
<br />
There are several different multi-method implementations for Python. For those unfamiliar with the concept, <a href="http://www.artima.com/weblogs/viewpost.jsp?thread=101605">here </a>is a good essay by Guido on the general concept. Multi-methods are also at the heart of <a href="http://docs.julialang.org/en/latest/manual/methods/">Julia</a>. They are a simple concept. Basically, a multi-method is an object that dispatches to a different implementation based on the number and types of the arguments. The idea is that you can add new implementations of the underlying function quite easily without changing the function object itself. So, for example, if <span style="font-family: Courier New, Courier, monospace;"><b>numpy.dot</b></span> were a multi-method, then I could change the implementation of <span style="font-family: Courier New, Courier, monospace;"><b>numpy.dot</b></span> for my new fancy array-object without directly changing the source-code of <span style="font-family: Courier New, Courier, monospace;"><b>numpy.dot</b></span> in NumPy and all downstream functions and methods that use <span style="font-family: Courier New, Courier, monospace;"><b>numpy.dot</b></span> in their implementation would automatically work with my new type of array. Multi-methods allow extensibility in a manner similar to how operator overloading allows extensibility in object-oriented programming. But, it's a much more natural fit for operations where dispatching only on the first argument does not make a lot of sense. <br />
<br />
In fact, at the heart of NumPy's ufuncs is a multi-method dispatch mechanism (on NumPy dtype, instead of Python type), so NumPy users have been using multi-methods for a long time. In fact, if NumPy's ufuncs were true multi-methods to begin with, then all the hassle with __array_wrap__, __array_prepare__, and so forth which are hacks to compensate for the lack of true Python-type-based multi-methods would not be necessary. If you look at the implementation of NumPy's masked array's for example you will see some of the ugliness that is caused by NumPy's lack of a better multi-method mechanism. <a href="http://numba.pydata.org/">Numba's</a> autojit also effectively creates a kind of multi-method as it creates a new function to dispatch to whenever it encounters a new set of types for the arguments. These are the ideas that we are building on and using in Blaze, as we learn from our experience with NumPy.<br />
<br />
The biggest challenge for multi-methods is always what function to return if you don't find an exact match. A simple multi-method is basically a dictionary whose key is the a tuple of the types of the input arguments and whose value is the implementation. But, what do you do if the key does not return an implementation? How do you find a compatible function and use it instead? There is a lot of theory on this and several approaches people have taken. I'm not aware of a universal solution that everybody agrees should be used. However, there are reasonable approaches that can be taken using the idea of typesets or type-hierarchies (for those interested you can read more about <a href="http://en.wikipedia.org/wiki/Covariance_and_contravariance_(computer_science)">contravariance and covariance</a> for other approaches to resolving the type dispatch problem as well).<br />
<br />
I'm confident that useful if not universal approaches to this problem can be found (several are already available for Python and in Julia, for example). For NumPy, what is needed is a two-tiered dispatch mechanism. My view is that all NumPy (and SciPy and Scikit) functions should be multi-methods that dispatch based on Python-type *and* then additionally for memory-view-like objects on the data-type of the elements. The dispatch rules for each of these cases can and should be separate, I think. <br />
<br />
If you are interested in this problem and especially if you have money to fund it, feel free to contact me directly at travis at continuum dot io.<br />
<br />
While I am spending more and more of my conference time with the PyData conference series, I still enjoy reconnecting with people I will always consider friends at the SciPy conference. Fortunately, many speakers participate in both. Having both conferences allows the community to grow and have bigger and better impact as I think can be witnessed by the increased attendance this year at SciPy. </div>
Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com4tag:blogger.com,1999:blog-68730239358084672.post-10599559023215176882012-12-16T21:39:00.000-08:002012-12-17T06:25:40.461-08:00Passing the torch of NumPy and moving on to Blaze<div dir="ltr" style="text-align: left;" trbidi="on">
I wrote this letter tonight to the NumPy mailing list --- a list I have been actively participating in for nearly 15 years. <br />
<div>
<br />
<br />
<div class="p1">
Hello all, </div>
<div class="p2">
<br /></div>
<div class="p1">
There is a lot happening in my life right now and I am spread quite thin among the various projects that I take an interest in. In particular, I am thrilled to publicly announce on this list that Continuum Analytics has received DARPA funding (to the tune of at least $3 million) for Blaze, Numba, and Bokeh which we are writing to take NumPy, SciPy, and visualization into the domain of very large data sets. This is part of the XDATA program, and I will be taking an active role in it. You can read more about Blaze here: http://blaze.pydata.org. You can read more about XDATA here: http://www.darpa.mil/Our_Work/I2O/Programs/XDATA.aspx </div>
<div class="p2">
<br /></div>
<div class="p1">
I personally think Blaze is the future of array-oriented computing in Python. I will be putting efforts and resources next year behind making that case. How it interacts with future incarnations of NumPy, Pandas, or other projects is an interesting and open question. I have no doubt the future will be a rich ecosystem of interoperating array-oriented data-structures. I invite anyone interested in Blaze to participate in the discussions and development at https://groups.google.com/a/continuum.io/forum/#!forum/blaze-dev or watch the project on our public GitHub repo: https://github.com/ContinuumIO/blaze. Blaze is being incubated under the ContinuumIO GitHub project for now, but eventually I hope it will receive its own GitHub project page later next year. Development of Blaze is early but we are moving rapidly with it (and have deliverable deadlines --- thus while we will welcome input and pull requests we won't have a ton of time to respond to simple queries until at least May or June). There is more that we are working on behind the scenes with respect to Blaze that will be coming out next year as well but isn't quite ready to show yet.</div>
<div class="p2">
<br /></div>
<div class="p1">
As I look at the coming months and years, my time for direct involvement in NumPy development is therefore only going to get smaller. As a result it is not appropriate that I remain as "head steward" of the NumPy project (a term I prefer to BDF12 or anything else). I'm sure that it is apparent that while I've tried to help personally where I can this year on the NumPy project, my role has been more one of coordination, seeking funding, and providing expert advice on certain sections of code. I fundamentally agree with Fernando Perez that the responsibility of care-taking open source projects is one of stewardship --- something akin to public service. I have tried to emulate that belief this year --- even while not always succeeding. </div>
<div class="p2">
<br /></div>
<div class="p1">
It is time for me to make official what is already becoming apparent to observers of this community, namely, that I am stepping down as someone who might be considered "head steward" for the NumPy project and officially leaving the development of the project in the hands of others in the community. I don't think the project actually needs a new "head steward" --- especially from a development perspective. Instead I see a lot of strong developers offering key opinions for the project as well as a great set of new developers offering pull requests. </div>
<div class="p2">
<br /></div>
<div class="p1">
My strong suggestion is that development discussions of the project continue on this list with consensus among the active participants being the goal for development. I don't think 100% consensus is a rigid requirement --- but certainly a super-majority should be the goal, and serious changes should not be made with out a clear consensus. I would pay special attention to under-represented people (users with intense usage of NumPy but small voices on this list). There are many of them. If you push me for specifics then at this point in NumPy's history, I would say that if Chuck, Nathaniel, and Ralf agree on a course of action, it will likely be a good thing for the project. I suspect that even if only 2 of the 3 agree at one time it might still be a good thing (but I would expect more detail and discussion). There are others whose opinion should be sought as well: Ondrej Certik, Perry Greenfield, Stefan van der Walt, David Warde-Farley, Pauli Virtanen, Robert Kern, David Cournapeau, Francesc Alted, and Mark Wiebe to name a few (there are many other people as well whose opinions can only help NumPy). For some questions, I might even seek input from people like Konrad Hinsen and Paul Dubois --- if they have time to give it. I will still be willing to offer my view from time to time, and if I am asked. </div>
<div class="p2">
<br /></div>
<div class="p1">
Greg Wilson (of Software Carpentry fame) asked me recently what letter I would have written to myself 5 years ago. What would I tell myself to do given the knowledge I have now? I've thought about that for a bit, and I have some answers. I don't know if these will help anyone, but I offer them as hopefully instructive: </div>
<div class="p2">
<br /></div>
<div class="p1">
<span class="Apple-tab-span"> </span>1) Do not promise to not break the ABI of NumPy --- and in fact emphasize that it will be broken at least once in the 1.X series. NumPy was designed to add new data-types --- but not without breaking the ABI. NumPy has needed more data-types and still needs even more. While it's not beautifully simple to add new data-types, it can be done. But, it is impossible to add them without breaking the ABI in some fashion. The desire to add new data-types *and* keep ABI compatibility has led to significant pain. I think the ABI non-breakage goal has been amplified by the poor state of package management in Python. The fact that it's painful for someone to update their downstream packages when an upstream ABI breaks (on Windows and Mac in particular) has put a lot of unfortunate pressure on this community. Pressure that was not envisioned or understood when I was writing NumPy.</div>
<div class="p2">
<br /></div>
<div class="p1">
(As an aside: This is one reason Continuum has invested resources in building the conda tool and a completely free set of binary packages called Anaconda CE which is becoming more and more usable thanks to the efforts of Bryan Van de Ven and Ilan Schnell and our testing team at Continuum. The conda tool: http://docs.continuum.io/conda/index.html is open source and BSD licensed and the next release will provide the ability to build packages, build indexes on package repositories and interface with pip. Expect a blog-post in the near future about how cool conda is!). </div>
<div class="p2">
<br /></div>
<div class="p1">
<span class="Apple-tab-span"> </span>2) Don't create array-scalars. Instead, make the data-type object a meta-type object whose instances are the items returned from NumPy arrays. There is no need for a separate array-scalar object and in fact it's confusing to the type-system. I understand that now. I did not understand that 5 years ago. </div>
<div class="p2">
<br /></div>
<div class="p1">
<span class="Apple-tab-span"> </span>3) Special-case small arrays to avoid the memory indirection and look at PDL so that generalized ufuncs are supported from the beginning.</div>
<div class="p2">
<br /></div>
<div class="p1">
<span class="Apple-tab-span"> </span>4) Define missing-value data-types and labels on the dimensions and arrays</div>
<div class="p2">
<br /></div>
<div class="p1">
<span class="Apple-tab-span"> </span>5) Define a standard "dictionary of NumPy arrays" interface as the basic "structure of arrays" concept to go with the "array of structures" that structured arrays provide.</div>
<div class="p2">
<br /></div>
<div class="p1">
<span class="Apple-tab-span"> </span>6) Start work on SQL interface to NumPy arrays *now*</div>
<div class="p2">
<br /></div>
<div class="p1">
Additional comments I would make to someone today: </div>
<div class="p2">
<br /></div>
<div class="p1">
<span class="Apple-tab-span"> </span>1) Most of NumPy should be written in Python with Numba used as the compiler (particularly as soon as Numba gets the ability to create Python extension modules which is in the next release). </div>
<div class="p1">
<span class="Apple-tab-span"> </span>2) There are still many, many optimizations that can be made in NumPy run-time (especially in the face of modern hardware). </div>
<div class="p2">
<br /></div>
<div class="p1">
I will continue to be available to answer questions and I may chime in here and there on pull requests. However, most of my time for NumPy will be on administrative aspects of the project where I will continue to take an active interest. To help make sure that this happens in a transparent way, I would like to propose that "administrative" support of the project be left to the NumFOCUS board of which I am currently 1 of 9 members. The other board members are currently: Ralf Gommers, Anthony Scopatz, Andy Terrel, Prabhu Ramachandran, Fernando Perez, Emmanuelle Gouillart, Jarrod Millman, and Perry Greenfield. While NumFOCUS basically seeks to promote and fund the entire scientific Python stack, I think it can also play a role in helping to administer some of the core projects which the board members themselves have a personal interest in. </div>
<div class="p2">
<br /></div>
<div class="p1">
By administrative support, I mean decisions like "what should be done with any NumPy IP or web-domains" or "what kind of commercially-related ads or otherwise should go on the NumPy home page", or "what should be done with the NumPy github account", etc. --- basically anything that requires an executive decision that is not directly development related. I don't expect there to be many of these decisions. But, when they show up, I would like them to be made in as transparent and public of a way as possible. In practice, the way I see this working is that there are members of the NumPy community who are (like me) particularly interested in admin-related questions and serve on a NumPy team in the NumFOCUS organization. I just know I'll be attending NumFOCUS board meetings, and I would like to help move administrative decisions forward with NumPy as part of the time I spend thinking about NumFOCUS. </div>
<div class="p2">
<br /></div>
<div class="p1">
If people on this list would like to play an active role in those admin discussions, then I would heartily welcome them into NumFOCUS membership where they would work with interested members of the NumFOCUS board (like me and Ralf) to direct that organization. I would really love to have someone from this list volunteer to serve on the NumPy team as part of the NumFOCUS project. I am certainly going to be interested in the opinions of people who are active participants on this list and on GitHub pages for NumPy on anything admin related to NumPy, and I expect Ralf would also be very interested in those views.</div>
<div class="p2">
<br /></div>
<div class="p1">
One admin discussion that I will bring up in another email (as this one is already too long) is about making 2 or 3 lists for NumPy such as numpy-admin@numpy.org, numpy-dev@numpy.org, and numpy-users@numpy-org. </div>
<div class="p2">
<br /></div>
<div class="p1">
Just because I'll be spending more time on Blaze, Numba, Bokeh, and the PyData ecosystem does not mean that I won't be around for NumPy. I will continue to promote NumPy. My involvement with Continuum connects me to NumPy as Continuum continues to offer commercial support contracts for NumPy (and SciPy and other open source projects). Continuum will also continue to maintain its Github NumPy project which will contain pull requests from our company that we are working to get into the mainline branch. Continuum will also continue to provide resources for release-management of NumPy (we have been funding Ondrej in this role for the past 6 months --- though I would like to see this happen through NumFOCUS in the future even if Continuum provides much of the money). We also offer optimized versions of NumPy in our commercial Anaconda distribution (Anaconda CE is free and open source). </div>
<div class="p2">
<br /></div>
<div class="p1">
Also, I will still be available for questions and help (I'm not disappearing --- just making it clear that I'm stepping back into an occasional NumPy developer role). It has been extremely gratifying to see the number of pull-requests, GitHub-conversations, and code contributions increase this year. Even though the 1.7 release has taken a long time to stabilize, there have been a lot of people participating in the discussion and in helping to track down the problems, figure out what to do, and fix them. It even makes it possible for people to think about 1.7 as a long-term release. </div>
<div class="p2">
<br /></div>
<div class="p1">
I will continue to hope that the spirit of openness, tolerance, respect, and gratitude continue to permeate this mailing list, and that we continue to seek to resolve any differences with trust and mutual respect. I know I have offended people in the past with quick remarks and actions made sometimes in haste without fully realizing how they might be taken. But, I also know that like many of you I have always done the very best I could for moving Python for scientific computing forward in the best way I know how. </div>
<div class="p2">
<br /></div>
<div class="p1">
Thank you for the great memories. If you will forgive a little sentiment: My daughter who is in college now was 3 years old when I began working with this community and went down a road that would lead to my involvement with SciPy and NumPy. I have marked the building of my family and the passage of time with where the Python for Scientific Computing Community was at. Like many of you, I have given a great deal of attention and time to building this community. That sacrifice and time has led me to love what we have created. I know that I leave this segment of the community with the tools in better hands than mine. I am hopeful that NumPy will continue to be a useful array library for the Python community for many years to come even as we all continue to build new tools for the future. </div>
<div class="p2">
<br /></div>
<div class="p1">
Very best regards,</div>
<div class="p2">
<br /></div>
<div class="p1">
-Travis </div>
<br />
<br /></div>
</div>
Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com9tag:blogger.com,1999:blog-68730239358084672.post-37421456893387334782012-10-10T00:28:00.002-07:002012-10-10T00:28:30.263-07:00Continuum and Open Source<div dir="ltr" style="text-align: left;" trbidi="on">
As an avid open source contributor for nearly 15 years --- and a father with children to provide for --- I've observed intently the discussions about how to monetize open source. As a young PhD student, I even spent hours avoiding my dissertation by reading about philosophy and economics to try and make sense of how an open-source economy might work. <br />
<br />
I love creating and contributing to open source code --- particularly code that has the potential to influence and touch for the better millions of lives. I really enjoy spending as much time as I can on that activity. On the other hand, the wider economy wants money from me for things like college expenses, housing, utilities, and the "camp champions" that I get to attend this week with my 11 year old son. So, I have thought and read a lot about how to make money from open source.<br />
<br />
There are a lot of indirect ways to make money from open source which all amount to giving away the code and then making money doing "something else": training, support, consulting, documentation, etc. These are all ways you can sell the expertise that results from open source. Ultimately, however, under all these models open source is a marketing expense and you end up needing to focus your real attention on the the thing you end up getting paid for -- the service itself. As a result, the open source code you care about tends to receive less attention than you had originally hoped and you can only spend your "free time" on it. I've seen this play out over several years in multiple ways.<br />
<br />
I still believe that a model that is patterned after the original copyright/patent compromise of "limited-time" protection is actually a good one --- especially for certain kinds of software. Under this model, there are two code-bases: an open source one and a proprietary one. People pay for the software they want and use (and therefore developers get paid to write it) while premium features migrate from the paid-for branch to the free-and-open-source code base as the developers get paid. <br />
<br />
While this model would not work for every project, it does have some nice features:<br />
<br />
<ul style="text-align: left;">
<li>it allows developers to work full-time on code that benefit users (as evidenced by those users' willingness to pay for the software)</li>
<li>developers have a livelihood directly writing code that "will become" open source as people pay for it</li>
<li>users only pay for software that they are getting "premium benefits" from and those premium benefits are lifting the state of open-source software over time</li>
</ul>
<div>
It is a wonderful thing for developers to have a user-base of satisfied customers. For all the benefits of open-source, I've also seen first hand the difficulty of supporting a large user-base with no customers who are directly paying for continued support of the code-base which eventually leads to less satisfied customers. </div>
<div>
<br /></div>
<div>
I am thrilled to be part of a forward-thinking company like Continuum Analytics that is committed enough to open source software to both sponsor directly open source projects (like NumPy and Numba) as well as seek to move features from its premium products into open-source. You can read more about Continuum's Open Source philosophy here: <a href="http://www.continuum.io/selling-open-source.html">Continuum and Open Source</a>. </div>
<div>
<br /></div>
<div>
For example, we recently moved a feature from our premium product, NumbaPro, into the open-source project Numba which allows you to compile a python file directly to a shared library. You can read about that feature here: <a href="http://numba.pydata.org/numba-doc/dev/doc/pycc.html">Compiling Python code to Shared Library</a>.</div>
<div>
<br /></div>
<div>
We will continue to develop Numba in the open --- in conjunction with others who wish to participate in the development of that project. Our ability to spend time on this, of course, will be directly impacted by how many licenses of NumbaPro we can sell (along with our other products and services). So, if computing on GPUs, creating NumPy ufuncs and generalized ufuncs easily, or taking advantage of multiple-cores in your Python computations is something that would benefit you, take a look at <a href="https://store.continuum.io/cshop/numbapro">NumbaPro</a> and see if it makes sense for you to purchase it. Hopefully, in addition to great software you appreciate, you will also recognize that you are contributing directly to the development of Numba.</div>
</div>
Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com5tag:blogger.com,1999:blog-68730239358084672.post-52743621659697168962012-09-02T02:37:00.002-07:002012-09-02T02:38:23.878-07:00John Hunter 1968-2012<div dir="ltr" style="text-align: left;" trbidi="on">
It was a shock to hear the news from Fernando that John Hunter needed chemo therapy to respond to the cancer that had attacked him. Literally days previous to the news we had just been talking at the SciPy conference about how to take NumFOCUS to the next level. Together with the other members of NumFocus we have ambitious plans for the Foundation: scholarships and post-doc funds for students and early professionals contributing to open-source, conference sponsorship, packaging and continuous integration sponsorships, etc. We had been meeting via phone in board meetings every other week and he was planning to send a message to the matplotlib mailing list encouraging people to donate to our efforts with NumFOCUS. Working with John in person on a mutual project was gratifying. His intelligence, enthusiasm, humility, and pragmatism were a perfect complement to our board discussions.<br />
<br />
He had also just spoken at SciPy 2012 and gave a great talk discussing his observations and lessons learned from Matplotlib. If you haven't seen the talk, stop reading this and go watch it <a href="http://www.youtube.com/watch?v=e3lTby5RI54">here</a> --- you will see a great and humble man describe a labor of love (and not give himself enough credit for what he accomplished).<br />
<br />
When I heard the news, I wrote a quick note to John expressing my support and appreciation for all he had done for Python --- not only because I truly feel that matplotlib is a major reason that projects I have invested so heavily in (NumPy and SciPy) have become so popular, but also because I knew that I had not shared enough with him how much I think of him. A sinking feeling in my heart was telling me that I may not have much time. <br />
<br />
This is what I sent him:<br />
<blockquote class="tr_bq">
<div class="p1">
Hey John,</div>
<div class="p2">
<br /></div>
<div class="p1">
I am so sorry to hear the news of your diagnosis. I will be praying for you and your family. I understand if you cannot respond. Please let me know if there is anything I can do to help. </div>
<div class="p2">
<br /></div>
<div class="p1">
I have so much respect for you and what you have done to make Python viable as a language for technical computing. I also just think you are an amazing human being with so much to give. </div>
<div class="p2">
<br /></div>
<div class="p1">
All the best for a speedy recovery. </div>
</blockquote>
<blockquote class="tr_bq">
-Travis </blockquote>
<br />
This is the response I received.<br />
<br />
<blockquote>
Thanks so much Travis. We're moving full speed ahead with a treatment plan -- chemo may start Tues. As unpleasant as it can be, I'm looking forward to the start of the fight against this bastard.<br />
<br />
Thanks so much for your other kind words. You've always been a hero to me and they mean a lot. I have great respect for what you are doing for numpy and NUMFOCUS, and even though I am stepping back from work and MPL and everything non-essential right now, I want to continue supporting NF while I'm able. </blockquote>
<blockquote>
All the best,<br />
JDH</blockquote>
<br />
I had no idea how much I would come to appreciate this small but meaningful exchange -- my last communication with John. Only a few weeks later, Fernando Perez (author of IPython and a great friend to John) sent word that our mutual friend had an unexpected but terrible reaction to his initial treatment, and it had placed him in critical condition and the prognosis was not good.<br />
<br />
I ached when literally hours later, John died. I thought of his 3 daughters (each only about 3 years younger than my own 3 daughters) and how they would miss their father. I thought of the time he did not spend with them because he was writing matplotlib. I know exactly what that means because of the time I have sacrificed with my own little girls (and boys) bringing SciPy to life, merging Numarray and Numeric into NumPy, resurrecting llvmpy, and bringing Numba to life. I thought of the future time I would not get to spend with him building NumFOCUS into a foundation worthy of the software it promotes. I have not lost many of my loved ones to death yet. Perhaps this is why I have been so affected by his death. Not since my mother died 2 years ago (<a href="https://www.facebook.com/groups/145650038808004/">August 31, 2010</a>), has the passing of another driven me so.<br />
<br />
When I thought of John's girls, I thought immediately of what could we do to show love and appreciation. What would I want for my own children if I were no longer here to care for them? My oldest daughter had just started college and was experiencing that first transformative week. Perhaps this was why I thought that more than anything if I were not around I would want my girls to have enough money for their education. After speaking with Fernando and with approval from John's wife, Miriam, we setup the <a href="http://numfocus.org/johnhunter/">John Hunter Memorial Fund</a>. Anthony Scopatz, Leah Holdridge, and I have spent several hours since then making sure the site stays operational (mainly overcoming some unexpected difficulties caused by Google on Friday).<br />
<br />
My personal goal is to raise at least $100,000 for John's girls. This will not cover their entire education, but it is will be a good start and will be a symbolic expression of appreciation for all those who work tirelessly on open source software for the benefit of many. After a few days we are at about $20,000 total (from about 450 donors). This is a great start and will be greatly appreciated by John's family --- but I know that all those who benefit from the free use of a high-quality plotting library can do better than that. If you have already given, thank you! If you haven't given something yet, please consider what John has done for you personally, and give your most generous donation. <br />
<br />
There are fees associated with using online payment networks. We will find a way to get those fees waived or covered by specific corporate donations, so don't let concern of the fees stop you from helping. We've worked hard to make sure you have as many options to pay as possible. You can use PayPal or WePay (which both have fees of 2.9% + $0.30), you can use an inexpensive payment network like <a href="https://www.dwolla.com/">Dwolla</a> (only $0.25 for sending more than $10 and free for sending less --- but you have to have a Dwolla account and put money into it), or you can do as <a href="https://twitter.com/dabeaz/status/241837488023425024">David Beazley suggested</a> and just send a check to one of the addresses listed on <a href="http://numfocus.org/johnhunter/">the memorial page</a>.<br />
<br />
Whatever you decide to do, just remember that it is time to give back!<br />
<br />
John has always been supportive of my work in open source. It was his voice that was one of the few positive voices that kept me going in the early days of NumPy when other voices were more discouraging. He has also consistently been a calming and supportive voice on the mailing lists when others have been less considerate and sometimes even hostile. I'm very sorry he will not be able to see even more results of his tireless efforts. I'm very sorry we won't get to feel more of his influence in the world. The world has lost one who truly recognized that great things require cooperation of many people. Obtaining that cooperation takes sacrifice, trust, humility, a willingness to listen, a willingness to speak out with respect, and a willingness to forgive. He exemplified those characteristics. I am truly saddened that I will not be able to learn more from him.<br />
<br />
When SciPy was emerging from my collection of modules in 2001, one of the things Eric Jones and I wanted was an integrated plotting package. We spent time on a couple of plotting tools in early SciPy (a simple WX plotting widget, xplot based on Yorick's gist). These early steps were not going to get us what users needed. Fortunately, John Hunter came along around 2001 and started a new project called Matplotlib which steadily grew in popularity until it literally exploded in about 2004 with funding from the Perry Greenfield and the <a href="http://www.stsci.edu/portal/">Space Science Telescope Institute</a> and the efforts of the current principal developer of Matplotlib: <a href="http://matplotlib.1069221.n5.nabble.com/ANN-Michael-Droettboom-matplotlib-lead-developer-td5037.html">Michael Droettboom</a>.<br />
<br />
I learned from John's project many important things about open source development. A few of them:<br />
<br />
<ul style="text-align: left;">
<li>Examples, documentation, and ease of use matter -- a lot</li>
<li>Large efforts like Python for Science need a lot of people and a distributed, independent development environment (not everything belongs in a single namespace).</li>
<ul>
<li>SciPy needed to be a modular "library" not a replacement for Matlab all by itself. </li>
<li>The community needed a unifying installation to make it easy for the end-user to get everything, but we did not need a single namespace. </li>
<li>Open source projects can only cover as much space as a team of about 5-7 active developers can understand. Then, they need to be organized into a larger integration and distribution projects --- a hierarchical federation of projects. </li>
<li>The only way large projects can survive is by separating concerns, having well defined interfaces, and groups that work on individual pieces they have expertise in. </li>
</ul>
<li>Backwards compatibility matters a great deal to an open source project (he created numerix for Matplotlib to facilitate for end-users the migration of Numeric through Numarray to NumPy in Matplotlib)</li>
</ul>
<div>
I'm sure if John were here, he could improve my rough outline and make it much better. From improving plotting libraries to making useful use of record arrays, he was always doing that. In fact, one of John's last contributions to the world is in improving the mission statement of NumFOCUS. In a recent board meeting, he suggested the word "accessible" to the mission statement: <span style="background-color: white; color: #2e3e4d; font-family: Arial, Helvetica, sans-serif; font-size: 14px; line-height: 21px;"><b>The purpose of NumFOCUS is to promote the use of accessible and reproducible computing in science and technology. </b></span></div>
<div>
<br /></div>
<div>
His life's work has indeed been to make science and technology computing more accessible through making Python the <i>de facto</i> standard for doing science with his excellent plotting tool. Let's continue to improve the legacy he has left us by working together to make computing even more accessible. We have a long way to go, but by standing on the shoulders of giants like John we can see just that much farther and continue the journey. </div>
<div>
<br /></div>
<div>
Besides helping his daughters there is nothing more fitting that we can do to honor John's memory than continuing to promote the other work he spent so many hours of his life pushing by contributing to open source projects and/or <a href="http://numfocus.org/donatejoin/">supporting financially</a> the foundation he wanted to see successful. </div>
<div>
<br /></div>
<div>
Great people lift us both in life and death. In life they are gracious contributors to our well being and encourage us to grow. In death they cause us to reflect on the precious qualities they reflected. They make us want to improve. When we think of them, we want to hold our children close, give an encouraging word to a colleague, feel gratitude for our friends and family, and forgive someone who has hurt us. John Hunter (1968 - 2012) was truly a great man!</div>
<br /></div>
Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com3tag:blogger.com,1999:blog-68730239358084672.post-59247290608562467162012-08-15T17:23:00.001-07:002012-08-15T20:56:46.331-07:00Numba and LLVMPy<div dir="ltr" style="text-align: left;" trbidi="on">
It's been a busy year so far. All the time spent on starting a new company, starting new open source projects, and keeping up with the open source projects that I have interest in, has meant that I haven't written nearly as many blog-posts as I planned on. But, this is probably a good thing at least if you follow the wisdom attributed to <a href="http://www.biblegateway.com/passage/?search=Proverbs+17%3A27-28&version=NIV">Solomon</a> --- which has been paraphrased in <a href="http://www.brainyquote.com/quotes/quotes/a/abrahamlin109276.html">this quote</a> attributed to Abraham Lincoln.<br />
<div>
<br /></div>
<div>
One of the things that has been on my mind for the past year is promoting array-oriented computing as a fundamental concept more developers need exposure to. This is one reason that I am so excited that I've been able to find great people to work on Numba (which intends to be an array-oriented compiler for Python code). I have given a few talks trying to convey what is meant by array-oriented computing, but the essence is captured by the difference between the <a href="http://www.opensource.apple.com/source/python/python-3/python/Demo/curses/life.py">life.py</a> example in the Python code-base and a NumPy version of the <a href="https://gist.github.com/3353411">same code</a>. <br />
<br />
I have seen many, many real world examples of very complicated code that could be simplified and sped up (especially on modern hardware) by just thinking about the problem differently using array-oriented concepts. </div>
<div>
<br /></div>
<div>
One of the goals for Numba is to make it possible to write more vectorized code easily in Python without relying just on the pre-compiled loops that NumPy provides. In order to write Numba, though, we first needed to resurrect the llvm-py project which provides easy access to the LLVM C++ libraries from Python. This project is interesting in its own right and in addition to forming a base tool chain for Numba, allows you to do very interesting things (like instrument C-code compiled with Clang to bitcode), build a compiler, or import bitcode directly into Python (a la <a href="https://github.com/dabeaz/bitey/">bitey</a>). <br />
<br />
While the documentation for llvm-py left me frustrated early on, I have to admit that llvm-py re-kindled some of the joy I experienced when being first exposed to Python. Over the past several weeks we have worked to create the llvmpy project from <a href="http://www.mdevan.org/llvm-py/">llvm-py</a>. We now have a domain <a href="http://www.llvmpy.org/">http://www.llvmpy.org</a>, a GitHub repository, a website served from GitHub, and sphinx-based documents that can be edited via a pull request. The documentation still needs a lot of improvement (even to get it to the state that the old llvm-py project was in), and contributions are welcome. <br />
<br />
I'm grateful to Fernando Perez, author of IPython, for explaining the 4-repository approach to managing an open source web-site and documentation via github. We are using the same pattern that IPython uses for both numba and llvmpy. It took a bit of work to get set-up but it's a nice approach that should make it easier for the community to maintain the documentation and web-site of both of these projects. The idea is simple. Use a project page (repo llvmpy.github.com) to be the web-site but generate this repo from another repo (llvmpy-webpage) which contains the actual sources. I borrowed the scripts from the IPython project to build the page from the sources, check-out the llvmpy.github.com repo, copy the built pages to the repo, and then push the updates back to github which actually updates the site. The same process (slightly modified) is used for the documentation except the sources for the docs live in the llvmpy repo under the docs directory and the built pages are pushed to the gh-pages branch of the llvmpy-doc repo. If you are editing sources you only modify llvmpy/docs and llvmpy-webpage files. The other repos are generated and pushed via scripts.<br />
<br />
We are using the same general scheme to host the numba pages (although there I couldn't get the numba.org domain name and so I am using <a href="http://numba.pydata.org/">http://numba.pydata.org</a>). With llvmpy on a relatively solid footing, attention could be shifted to getting a Numba release out. Today, we finally released <a href="http://www.py2llvm.org/">Numba 0.1</a>. It took longer than expected after the SciPy conference mainly because we were hoping that some of the changes (still currently in a devel branch) to use an AST-based code-generator could be merged into the main-line before the release. <br />
<br />
Jon Riehl did the lion's share of the work to transform Numba from my early prototype to a functioning system in 0.1 with funding from <a href="http://www.continuum.io/">Continuum Analytics, Inc.</a> Thanks to him, I can proudly say that Numba is ready to be tried and used. It is still early software --- but it is ready for wider testing. One of the problems you will have with Numba right now is error reporting. If you make a mistake in the Python code that you are decorating, the error you get will not be informative -- so test the Python code before decorating it with Numba. But, if you get things right, Numba can speed up your Python code by 200 times or more. It's is really pretty fun to be able to write image-processing routines in Python. PyPy can do this too, of course, but with Numba you have full integration with the CPython stack and you don't have to wait for someone to port the library you also want to use to PyPy.<br />
<br />
Numba's road-map is being defined right now by the people involved in the project. On the horizon is support for NumPy index expressions (slices, etc.), merging of the devel branch which uses the AST and Mark Florisson's minivect compiler, improving support for error checking, emitting calls to the Python C-API for code that cannot be type-specialized, and improving complex-number support. Your suggestions are welcome.<br />
<br /></div>
</div>
Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com6tag:blogger.com,1999:blog-68730239358084672.post-25078008935101467792012-07-30T22:21:00.003-07:002012-08-14T15:07:14.115-07:00More PyPy discussions<div dir="ltr" style="text-align: left;" trbidi="on">
I'm very glad that my co-founder of <a href="http://www.continuum.io/">Continuum Analytics</a>, Peter Wang, has published his recent <a href="http://pwang.wordpress.com/2012/07/30/compilers-runtimes-and-users-oh-my/">follow-up blog-post</a> that hopefully clarifies his perspective on the on-going dialogue about CPython and PyPy.<br />
<br />
Peter is a fundamentally good-natured person, and he is a lot of fun to be around --- even when he is disagreeing with you. I'm very fortunate to be working with him on a daily basis. He can be opinionated, but his ability to connect deeply to a wide-variety of subjects means that you come away from a dialogue with him having learned something (even if you still remain unconvinced by his views). <br />
<br />
Peter is also one of the smartest people I've ever met. One of my great memories in life is sitting at dinner with Peter and <a href="http://en.wikipedia.org/wiki/Eric_Weinstein">Eric Weinstein</a> while those two great minds treated me, <a href="http://twitter.com/wesmckinn">Wes McKinney</a>, and <a href="http://www.linkedin.com/profile/view?id=14998834&locale=en_US&trk=tyah">Adam Klein</a> to the most impressive display of metaphor ping-pong I've ever seen covering a wide-variety of topics from social justice to string theory. I could keep up with the dialogue, but not enough to really participate meaningfully --- and the other two Ivy-league-educated dinner partners were in the same boat.<br />
<br />
I fundamentally agree with Peter's perspective that CPython-the-runtime is and will remain the centerpiece of the Python conversation. In fact, I would say that even more focus needs to be on CPython-the-runtime. It is great to see improvements <a href="http://docs.python.org/dev/whatsnew/3.3.html">in Python 3.3</a> like the completion of the memory-view implementation and the fixing of the internal string (Unicode) representation, but there are many other improvements that could be made. <br />
<br />
It is a wonderful and inspiring thing to see great developers think out of the box with novel projects like Jython, IronPython, and PyPy. Nonetheless from my perspective we still have a long way to go to really connect the average developer with ideas of <a href="http://www.slideshare.net/pycontw/largescale-arrayoriented-computing-with-python">array-oriented computing</a> that could really help the continuing onslaught of parallel-devices-in-search-of-software. As a result, it feels like those wanting Java, .NET, and machine-code integration would be better served by more attention on <a href="http://jpype.sourceforge.net/index.html">JPype</a>, <a href="http://pythonnet.sourceforge.net/">Python.NET</a>, <a href="http://numba.github.com/llvm-py/">LLVMPy</a>, and even <a href="http://www.corepy.org/">CorePy</a>. Such efforts would also be better for the entire user-base of Python --- especially a majority of industry uses of Python. <br />
<br />
But regardless of my perspective, I'm encouraged by the PyPy developer enthusiasm, and I do want to encourage dialogue regardless of my views. As a result, I am very happy to report that both <a href="http://www.numfocus.org/">NumFOCUS</a> and <a href="http://www.continuum.io/">Continuum Analytics</a> recently joined forces to sponsor <a href="http://twitter.com/fijall">Maciej Fijalkowski</a> on a small project to create an embedded version of PyPy --- a "PyPy-in-a-Box." This is an integration of PyPy to the CPython run-time (so that you can speed-up a particular CPython function by calling out to a library-version of PyPy). This is proof-of-concept code so it is not appropriate for production --- but it is a good example of what is possible when we all work together to promote the Python ecosystem. <br />
<br />
The online project is here: <a href="https://bitbucket.org/fijal/hack2/src/default/pypyembed">https://bitbucket.org/fijal/hack2/src/default/pypyembed</a> and you can get a binary version that works on 64-bit Linux here: <a href="http://baroquesoftware.com/~fijal/pypy-1.9-in-a-box-linux64.tar.bz2">http://baroquesoftware.com/~fijal/pypy-1.9-in-a-box-linux64.tar.bz2</a>. <br />
<br />
This approach needs more development to be a viable tool in the CPython ecosystem, but one of my suggestions to the PyPy community is that they focus on "shedding-tools" like this one for the CPython world --- so that everyone can benefit from their innovations. With an integration effort like embeded PyPy, one can also make better comparisons with tools like <a href="http://numba.github.com/numba">Numba</a> --- another dynamic-compilation run-time that uses <a href="http://www.llvm.org/">LLVM</a> and <a href="http://numba.github.com/llvm-py/">LLVM-py</a>. Numba has made a lot of progress in the last few months. In fact, I recently gave a talk on the project at the well-attended <a href="http://conference.scipy.org/scipy2012/schedule/conf_schedule_1.php">SciPy2012</a> conference in Austin. You can view <a href="http://www.slideshare.net/teoliphant/numba">my slides</a> that outline and motivate the project online. An actual release of the project is imminent, but you can already use <a href="http://numba.github.com/numba/">Numba</a> to very easily write signficant Python code using <a href="http://www.numpy.org/">NumPy</a> arrays that executes at "C-speeds." But, that is worth another blog-post of its own....<br />
<br /></div>
Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com2tag:blogger.com,1999:blog-68730239358084672.post-15173136851499805132012-01-07T21:40:00.000-08:002012-01-08T12:32:34.772-08:00Transition to Continuum<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
Our lives are punctuated by transformational events: the birth of a child, finishing school, the passing of a loved one, meeting someone special. Even without the regular beating of celestial rhythms to provide opportunities for renewal we would have these moments to measure our lives by. Once in a while, the rhythmic and asynchronous coincide providing a particularly poignant opportunity for change. Jan 1, 2012 was just such a time for me as I left my position as President of Enthought to start a new venture with Peter Wang (author of Chaco) and others. Our new company is Continuum Analytics, Inc. (or just Continuum). Our nascent website initially targeted only to the Python initiate is <a href="http://www.continuum.io/">http://www.continuum.io</a>. <br />
<br />
While I am ecstatic about the new venture, I will definitely miss the team that we've built around the world that has delivered Enthought's second consecutive record year. This team of exceptional individuals has been very successful at improving and expanding the Python story in a few targeted companies inside of the Fortune 50 as well as making it easy to install Python for the masses. Those who have taken the time to first install and then learn Traits, TraitsUI, Chaco, MayaVi, and the rest of the Enthought Tool Suite have had their efforts rewarded with increased productivity in the creation of rich client UIs and improved pluggable, scriptable, and component-based architecture. It has been a highly educational experience to participate with Enthought. There is much you learn about business, people, and the world when a software consulting company grows from 1 office with fewer than 17 people to 4 offices around the world and nearly 50 people. I will always be grateful to the Enthought founders, employees (past and present), and customers (past and present) for the relationships, the trust, and the thoughtful times we shared in learning, growing, and serving each other.<br />
<br />
My heart, however, has always been and continues to be with NumPy and SciPy which need more support than Enthought can currently provide --- so I must move on. It took a lot of trust from my wife when (with 3 small children at home) she patiently waited for me while I spent all of 1999 writing Multipack (which in 2001 formed the bulk of SciPy). It also took trust when in 2005 (with now 5 children at home) she watched me sacrifice my tenure-track position by writing NumPy instead of more papers. In 2012 (with now 6 children at home), I'm asking her to trust me one more time while I leave a comfortable salary with a good company to put more effort full time in helping take NumPy and SciPy to the next level.<br />
<br />
Over the past 4 1/2 years consulting with large companies I have learned a great deal about what NumPy (and SciPy) can and should be. These and related tools in the Python ecosystem need to become significant pieces to real solutions to the data analytics challenges that face us. R, Hadoop, and other (proprietary) solutions are already staking their claim on the space that Python should be dominating. Python has significant traction in science and analysis but too little publicity in the nascent nomenclature of data analytics. In order to accelerate the processing capabilities of Python and related tools, much progress needs to be made. My New Year's resolution this year is to begin to contribute more substantially to that progress by organizing a new company that will hopefully allow many people to spend significant time directly on NumPy and SciPy during working hours. I also hope to assist any public, non-profit efforts towards that mutual goal as well. I also hope to be able to spend more time myself on NumPy and SciPy.<br />
<br />
To realize my hopes long term, the company must succeed. For the company to succeed it must find customers --- people willing to buy something that it sells. People are appropriately particular about what they buy. Making products that delight will require a lot of work from Continuum, but I am excited to help organize and work alongside the best team we can put together to do it. This may also mean different business models and licensing around some of the NumPy-related code that the company writes. I recognize this may cause some raised eye-brows. I deeply value making code freely available. I'm a Jeffersonian at heart and believe that ideas (including code) should be shared freely. Six years ago I experimented by selling my "Guide to NumPy" long enough to make sufficient money to justify the effort. The book ended up in the public domain and contributed substantially to the current NumPy documentation. This is an illustration of how resources can be allocated to full-time attention and then later made available for all to enjoy. Of course there are other models that also work to accomplish similar ends and we will be actively exploring a few of them.<br />
<br />
Despite my ideals, my wife thanks me that I'm a pragmatist with children to provide for. In addition, I have watched wearily as it's been difficult to find volunteer labor (including my own) to turn NumPy into the data-management and data-analytics substrate that it should already be. All of this happens while huge sums of money are wasted at companies large and small inefficiently transforming raw, but inaccessible data into something closer to information that can be used for decisions by the domain experts. The information available is not what it can be. The amount of effort it takes to transform the data to actionable information is not where it can be. The wide-spread understanding about how to program parallel and distributed machines is not where it can be. We can and must do better in figuring out how to get full-time attention on NumPy and related tools while still making them widely available. <br />
<br />
At Continuum, we have a vision for significantly changing how people manipulate, transform, and uncover their data. We also have customer-driven plans to achieve it, and we are going to put our full energy into it. So far, the development team consists of Peter Wang, me, Mark Wiebe, Francesc Alted (PyTables), and Bryan Van de Ven. We will also be getting part-time but important development help from Hugo Shi and Andy Terrel. In addition, we are building an initial support/business staff to help us build and grow the business. We plan to continue to collaborate with others in the community both commercial (e.g. Wes McKinney in his new startup: <a href="http://lamdafoundry.com/rapidquant/">Lambda Foundry</a>) and open (e.g. Fernando Perez, Brian Granger, Min Ragan-Kelley of <a href="http://ipython.org/">IPython</a> fame). If you are interested in either joining us or collaborating with us, please send us an email at <a href="mailto:info@continuum.io">info@continuum.io</a>. Also, please follow us on Twitter <a href="http://twitter.com/ContinuumIO">@ContinuumIO</a> or Like us on <a href="https://www.facebook.com/ContinuumAnalytics">Facebook</a>. <br />
<br />
We are actively looking for customer partners, as well. If you are interested in learning more about where we are heading and how that might help you, please <a href="mailto:info@continuum.io">drop us a line</a>, or come see us at PyCon this year. We will also be at <a href="http://strataconf.com/">Strata</a>, and afterwords we will be hosting a Python Data Workshop ("PyData") at the Googleplex. Please sign up for the PyData workshop wait-list at <a href="http://pydataworkshop.eventbrite.com/">http://pydataworkshop.eventbrite.com/</a> (we could only find room for 50 people at the Googleplex). However, given that the event is free of charge, I'm expecting some people who have reserved their spot may not actually be able to attend. So, signing up on the wait-list is still worthwhile. <br />
<br />
This year will be an exciting one for us. When I get a spare moment, I still hope to finish a few of the blogs that I've started and possibly include some more that describe more of what I've learned over the past several years as a scientist/engineer-turned-software developer, lessons about running a software company, more of where we are headed at Continuum, reflections on open source, and other more technical ramblings.<br />
<div>
<br /></div>
</div>Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com10tag:blogger.com,1999:blog-68730239358084672.post-67761443611724723682011-10-16T16:15:00.000-07:002011-10-16T16:15:36.252-07:00Thoughts on porting NumPy to PyPy<div dir="ltr" style="text-align: left;" trbidi="on">Last weekend, I attended GitHub's PyCodeConf in Miami Florida and had the opportunity to give a talk on array-oriented computing and Python. I would like to thank Tom and Chris (GitHub founders) for allowing me to come speak. I enjoyed my time there, but I have to admit I felt old and a bit out of place. There were a lot of young people there who understood a lot more about making web-pages and working at cool web start-ups than solving partial differential equations with arrays. Fortunately, Dave Beazley and Raymond Hettinger were there so I didn't feel completely ancient. In addition, Wes McKinney and Peter Wang helped me represent the NumPy community. <br />
<br />
At the conference I was reminded of PyPy's recent success at showing the speedups some said weren't possible from a dynamic language like Python --- speed-ups which make it possible to achieve C-like speeds using Python constructs. This is very nice as it illustrates again that high-level languages can be compiled to low-level speeds. I also became keenly aware of the enthusiasm that has cropped up in porting NumPy to PyPy. I am happy for this enthusiasm as it illustrates the popularity of NumPy which pleases me. On the other hand, in every discussion I have heard or read about this effort, I'm not convinced that anyone excited about this porting effort actually understands the complexity of what they are trying to do nor the dangers that it could create in splitting the small community of developers who are regularly contributing to NumPy and SciPy and causing confusion for the user base of Python in Science. <br />
<br />
I'm hopeful that I can provide some perspective. Before I do this, however, I want to congratulate the PyPy team and emphasize that I have the utmost respect for the PyPy developers and what they have achieved. I am also a true-believer in the ability for high-level languages to achieve faster-than-C speeds. In fact, I'm not satisfied with a Python JIT. I want the NumPy constructs such as vectorization, fancy indexing, and reduction to be JIT compiled. I also think that there are use-cases of NumPy all by itself that makes it somewhat interesting to do a NumPy-in-PyPy port. I would also welcome the potential things that can be learned about how to improve NumPy that would come out of trying to write a version of NumPy in RPython. <br />
<br />
However, to avoid detracting from the overall success of Python in Science, Statistics, and Data Analysis, I think it is important that 3 things are completely clear to people interested in the NumPy on PyPy idea. <br />
<ol style="text-align: left;"><li>NumPy is just the beginning (SciPy, matplotlib, scikits, and 100s of other packages and legacy C/C++ and Fortran code are all very important)</li>
<li> NumPy should be a lot faster than it is currently.</li>
<li>NumPy has an ambitious roadmap and will be moving forward rather quickly over the coming years. </li>
</ol><ol style="text-align: left;"></ol><h2>NumPy is just the beginning </h2>Most of the people who use NumPy use it as an entry-point to the entire ecosystem of Scientific Packages available for Python. This ecosystem is huge. There are at least 1 million unique visitors to the <a href="http://www.scipy.org/">http://www.scipy.org</a> site every year and that is just an entry point to the very large and diverse community of technical computing users who rely on Python. <br />
<br />
Most of the scientists and engineers who have come to Python over the past years have done so because it is so easy to integrate their legacy C/C++ and Fortran code into Python. National laboratories, large oil companies, large banks and many other Fortune 50 companies all must integrate their code into Python in order for Python to be part of their story. NumPy is part of the answer that helps them seamlessly view large amounts of data as arrays in Python or as arrays in another compiled language without the non-starter of copying the memory back and forth. <br />
<br />
Once the port of NumPy to PyPy has finished, are you going to port SciPy? Are you going to port matplotlib? Are you going to port scikits.learn, or scikits.statsmodels? What about Sage? Most of these rely on not just the Python C-API but also the NumPy C-API which you would have to have a story for to make a serious technical user of Python get excited about a NumPy port to PyPy. <br />
<br />
To me it is much easier to think about taking the ideas of PyPy and pulling them into the Scientific Python ecosystem then going the other way around. It's not to say that there isn't some value in re-writing NumPy in PyPy, it just shouldn't be over-sold and those who fund it should understand what they aren't getting in the transaction. <br />
<h2>C-speed is the wrong target</h2>Several examples including my own previous blog post has shown that vectorized Fortran 90 can be 4-10 times faster than NumPy. Thus, we know there is room for improvement even on current single-core machines. This doesn't even take into account the optimizations that should be possible for multiple-cores, GPUs and even FPGAs all of which are in use today but are not being utilized to the degree they should be. NumPy needs to adapt to make use of this kind of hardware and will adapt in time. <br />
<h2>NumPy will be evolving rapidly over the coming years</h2>The pace of NumPy development has leveled off in recent years, but this year has created a firestorm of new ideas that will be coming to fruition over the next 1-2 years and NumPy will be evolving fairly rapidly during that time. I am committed to making this happen and will be working very hard in 2012 on the code-base itself to realize some of the ideas that have emerged. Some of this work will require some re-factoring and re-writing as well. I would honestly rather collaborate with PyPy than compete, but my constraints are that I care very much about backward compatibility and very much about the entire SciPy ecosystem. I sacrificed a year of my life in 1999 (delaying my PhD graduation by at least 6-12 months) bringing SciPy to life. I sacrificed my tenure-track position in academia bringing NumPy to life in 2005. Constraints of keeping my family fed, clothed, and housed seem to keep me on this 6-7 year sabbatical-like cycle for SciPy/NumPy but it looks like next year I will finally be in a position to spend substantial time and take the next steps with NumPy to help it progress to the next stage.<br />
<br />
Some of the ideas that will be implemented include:<br />
<ul style="text-align: left;"><li>integration of non-contiguous memory chunks into the NumPy array structure (generalization of strides)</li>
<li> addition of labels to axes and dimensions (generalization of shape)</li>
<li>derived fields, enumerated data-types, reference data-types, and indices for structured arrays </li>
<li>improvements to the data-type infrastructure to make it easier to add new data-types</li>
<li>improvements to the calculation infrastructure (iterators and fast general looping constructs)</li>
<li>fancy-indexing as views</li>
<li>integration of Pandas group-by features</li>
<li>missing data bit-patterns</li>
<li>distributed arrays </li>
</ul>Over conversations with many people this year, more ideas than room to talk about them have emerged and I am excited to start seeing these ideas come to fruition to make NumPy and Python the best solution for data-analysis. Beginning next year, I will be pushing hard for their introduction into the NumPy/SciPy ecosystem --- with a careful eye on backward compatibility which has long been one of NumPy's strengths. <br />
<h2>A way forward</h2>I would love to see more scientific code written at a high-level without sacrificing run-time performance. The high-level intent allows for the creation of faster machine code than lower-level translations of the intent often does. I know this is possible and I intend to do everything I can professionally to see that happen (but from within the context of the entire SciPy ecosystem). As this work emerges, I will encourage PyPy developers to join us using the hard-won knowledge and tools they have created. <br />
<br />
Even if PyPy continues as a separate eco-system, then there are points of collaboration that will benefit both groups. One of these is to continue the effort Microsoft initially funded to separate the C-parts of NumPy away from the CPython interface to NumPy. This work is now in a separate branch that has diverged from the main NumPy branch and needs to be re-visited. If people interested in NumPy on PyPy spent time on improving this refactoring into basically a NumPy C-library, then PyPy can call this independent library using its methods for making native calls just as CPython can call it using its extension approach. Then IronPython, Jython (and for that matter Ruby, Javascript, etc.) can all call the C-library and leverage the code. There is some effort to do this and it's not trivial. Perhaps, there is even a way for PyPy to generate C-libraries from Python source code --- now that would be an interesting way to collaborate. <br />
<br />
The second way forward is for PyPy to interact better with the Cython community. Support in PyPy for Cython extension modules would be a first step. There is wide agreement among NumPy developers that more of NumPy should be written at a high-level (probably using Cython). Cython already is used to implement many, many extension modules for the Sage project. William Stein's valiant efforts in that community have made Cython the de-facto standard for how most scientists and engineers are writing extensions modules for Python these days. This is a good thing for efforts like PyPy because it adds a layer of indirection that allows PyPy to make a Cython backend and avoid the Python C-API. <br />
<br />
I was quite astonished that Cython never came up in the panel discussion at the last PyCon when representatives from CPython, PyPy, IronPython, and Jython all talked about the Python VMs. To me that oversite was very troublesome. I was left doubting the PyPy community after Cython was not mentioned at all --- even when the discussion about how to manage extensions to the language came up during the panel discussion. It shows that pure Python developers on all fronts have lost sight of what the scientific Python community is doing. This is dangerous. I encourage Python developers to come to a SciPy conference and take a peek at what is going on. I hope to be able to contribute more to the discussion as well. <br />
<br />
If you are a Python developer and want to extend an olive leaf, then put a matrix infix operator into the language. It's way past time :-)<br />
<br />
</div>Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com19tag:blogger.com,1999:blog-68730239358084672.post-40599932785459989362011-07-04T23:15:00.000-07:002011-07-04T23:15:28.558-07:00Speeding up Python Again<div dir="ltr" style="text-align: left;" trbidi="on"><br />
After getting a few great comments on my <a href="http://technicaldiscovery.blogspot.com/2011/06/speeding-up-python-numpy-cython-and.html">recent post</a> --- especially regarding using PyPy and Fortran90 to speed up Python --- I decided my simple comparison needed an update. <br />
<br />
The big-news is that my tests for this problem actually showed PyPy quite favorably (even a bit faster than the CPython NumPy solution). This is very interesting indeed! I knew PyPy was improving, but this shows it has really come a long way. <br />
<br />
Also, I updated the Python-only comparison to not use NumPy arrays at all. It is well-known that NumPy arrays are not very efficient containers for doing element-by-element calculations in Python syntax. There is both more overhead for getting and setting elements than there is for simple lists, and the NumPy scalars that are returned when specific elements of NumPy arrays are selected can be a bit slow when doing scalar math computations on the Python side. <br />
<br />
Finally, I included a Fortran 90 example based on the code and comments provided by SymPy author Ondrej Certik. Fortran 77 was part of the original comparison that Prabhu Ramanchandran put together several years ago. Fortran 90 includes some nice constructs for vectorization that make it's update code very similar to the NumPy update solution. Apparently, gfortran can optimize this kind of code very well. In fact, the Fortran 90 solution was the very best of all of the approaches I took (about 4x faster than the NumPy solution and 2x faster than the other compiled approaches). <br />
<br />
At Prabhu's suggestion, I made the code available at <a href="https://github.com/scipy/speed">github</a> under a new GitHub repository in the SciPy project so that others could contribute and provide additional comparisons. <br />
<br />
The new results are summarized in the following table which I updated to running on a 150x150 grid with again 8000 iterations. <br />
<br />
<table border="1"><tbody>
<tr><th>Method</th> <th>Time (sec)</th> <th>Relative Speed</th> </tr>
<tr> <td>Pure Python</td> <td>202</td> <td>36.3</td> </tr>
<tr> <td>NumExpr</td> <td>8.04</td> <td>1.45</td> </tr>
<tr> <td>NumPy</td> <td>5.56</td> <td>1</td> </tr>
<tr> <td><b>PyPy</b></td> <td>4.71</td> <td><b>0.85</b></td> </tr>
<tr> <td>Weave</td> <td>2.42</td> <td>0.44</td> </tr>
<tr> <td>Cython</td> <td>2.21</td> <td>0.40</td> </tr>
<tr> <td>Looped Fortran</td> <td>2.19</td> <td>0.39</td> </tr>
<tr> <td><b>Vectorized Fortran</b></td> <td>1.42</td> <td><b>0.26</b></td> </tr>
</tbody></table><br />
The code for both the Pure Python and the PyPy solution is <a href="https://github.com/scipy/speed/blob/master/laplace/laplace2.py">laplace2.py</a>. This code uses a list-of-lists for the storage of the values. The same code produces the Pure Python solution and the PyPy solution. The only difference is that one is run with the standard CPython and the other with the PyPy binary. Here is sys.version from the PyPy binary used to obtain these results: <br />
<br />
<pre>'2.7.1 (b590cf6de419, Apr 30 2011, 03:30:00)\n[PyPy 1.5.0-alpha0 with GCC 4.0.1]'
</pre><br />
This is a pretty impressive achievement for the PyPy team. Kudos!<br />
<br />
For the other solutions, the code that was executed is located at <a href="https://github.com/scipy/speed/blob/master/laplace/laplace.py">laplace.py</a>. The Fortran 90 module compiled and made available to Python with f2py is located at <a href="https://github.com/scipy/speed/blob/master/laplace/_laplace.f90">_laplace.f90</a>. The single Cython solution is located at <a href="https://github.com/scipy/speed/blob/master/laplace/_laplace.pyx">_laplace.pyx</a>.<br />
<br />
It may be of interest to some to see what the actual calculated potential field looks like. Here is an image of the 150x150 grid after 8000 iterations: <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5i8r99pvpzyKuFEoHWKG98hF7TixGOsNT9LJed9e57NPtMxmcD_kQ9cHow5zRiLPQDqX02eWz_LphDdBVSSL4XbFiL7iJ4O8t70wt1T6bD2dTNbSJSJGBFH_Bv2FdJ2TBna2-mwZJiaI/s1600/image.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5i8r99pvpzyKuFEoHWKG98hF7TixGOsNT9LJed9e57NPtMxmcD_kQ9cHow5zRiLPQDqX02eWz_LphDdBVSSL4XbFiL7iJ4O8t70wt1T6bD2dTNbSJSJGBFH_Bv2FdJ2TBna2-mwZJiaI/s400/image.png" width="400" /></a></div><br />
Here is a plot showing three lines from the image (at columns 30, 80, 130 respectively): <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCxgLnPNY88cgdIHLGFxhmzI-npRAKZO_9nuUeJCkBWHejo6pQzX1HlOZ7UnQ2BMUWvoiJGxhslkT4Uymh5vSmo0K-h74J45W4pK2l0hmeusLecqbSAHHodPTdeb28EGx1ThxfBsY2wRY/s1600/plots.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCxgLnPNY88cgdIHLGFxhmzI-npRAKZO_9nuUeJCkBWHejo6pQzX1HlOZ7UnQ2BMUWvoiJGxhslkT4Uymh5vSmo0K-h74J45W4pK2l0hmeusLecqbSAHHodPTdeb28EGx1ThxfBsY2wRY/s400/plots.png" width="400" /></a></div><br />
<br />
It would be interesting to add more results (from IronPython, Jython, pure C++, etc.). Feel free to check out the code from github and experiment. Alternatively, add additional problems to the speed project on SciPy and make more comparisons. It is clear that you can get squeeze that last ounce of speed out of Python by linking to machine code. It also seems clear that there is enough information in the vectorized NumPy expression to be able to produce fast machine code automatically --- even faster than is possible with an explicit loop. The PyPy project shows that generally-available JIT-technology for Python is here and the scientific computing community should grapple with how we will make use of it (and improve upon it). My prediction is that we can look forward to more of that in the coming months and years. <br />
<br />
</div>Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com14tag:blogger.com,1999:blog-68730239358084672.post-10844295303439476502011-06-20T23:23:00.000-07:002011-07-04T11:34:37.445-07:00Speeding up Python (NumPy, Cython, and Weave)<div dir="ltr" style="text-align: left;" trbidi="on">The high-level nature of Python makes it very easy to program, read, and reason about code. Many programmers report being more productive in Python. For example, Robert Kern once told me that "Python gets out of my way" when I asked him why he likes Python. Others express it as "Python fits your brain." My experience resonates with both of these comments. <br />
<br />
It is not rare, however, to need to do many calculations over a lot of data. No matter how fast computers get, there will always be cases where you still need the code to be as fast as you can get it. In those cases, I first reach for <a href="http://www.numpy.org/">NumPy</a> which provides high-level expressions of fast low-level calculations over large arrays. With NumPy's rich slicing and broadcasting capabilities, as well as its full suite of vectorized calculation routines, I can quite often do the number crunching I am trying to do with very little effort. <br />
<br />
Even with NumPy's fast vectorized calculations, however, there are still times when either the vectorization is too complex, or it uses too much memory. It is also sometimes just easier to express the calculation with a simple loop. For those parts of the application, there are two general approaches that work really well to get you back to compiled speeds: weave or Cython. <br />
<br />
<a href="http://www.scipy.org/Weave">Weave</a> is a sub-package of SciPy and allows you to inline arbitrary C or C++ code into an extension module that is dynamically loaded into Python and executed in-line with the rest of your Python code. The code is compiled and linked at run-time the very first time the code is executed. The compiled code is then cached on-disk and made available for immediate later use if it is called again. <br />
<br />
<a href="http://cython.org/">Cython</a> is an extension-module generator for Python that allows you to write Python-looking code (Python syntax with type declarations) that is then pre-compiled to an extension module for later dynamic linking into the Python run-time. Cython translates Python-looking code into "not-for-human-eyes" C-code that compiles to reasonably fast C-code. Cython has been gaining a lot of momentum in recent years as people who have never learned C, can use Cython to get C-speeds exactly where they want them starting from working Python code. Even though I feel quite comfortable in C, my appreciation for Cython has been growing over the past few years, and I know am an avid supporter of the Cython community and like to help it whenever I can. <br />
<br />
Recently I re-did the same example that Prabhu Ramachandran first created several years ago which is reported <a href="http://www.scipy.org/PerformancePython">here</a>. This example solves Laplace's equation over a 2-d rectangular grid using a simple iterative method. The code finds a two-dimensional function, u, where ∇<sup>2</sup> u = 0, given some fixed boundary conditions.<br />
<br />
<h2>Pure Python Solution</h2><br />
The pure Python solution is the following:<br />
<br />
<pre style="background: #f6f8ff; color: #000020;"><span style="color: #200080; font-weight: bold;">from</span> numpy <span style="color: #200080; font-weight: bold;">import</span> zeros
<span style="color: #200080; font-weight: bold;">from</span> scipy <span style="color: #200080; font-weight: bold;">import</span> weave
dx <span style="color: #308080;">=</span> <span style="color: green;">0.1</span>
dy <span style="color: #308080;">=</span> <span style="color: green;">0.1</span>
dx2 <span style="color: #308080;">=</span> dx<span style="color: #308080;">*</span>dx
dy2 <span style="color: #308080;">=</span> dy<span style="color: #308080;">*</span>dy
<span style="color: #200080; font-weight: bold;">def</span> py_update<span style="color: #308080;">(</span>u<span style="color: #308080;">)</span><span style="color: #308080;">:</span>
nx<span style="color: #308080;">,</span> ny <span style="color: #308080;">=</span> u<span style="color: #308080;">.</span>shape
<span style="color: #200080; font-weight: bold;">for</span> i <span style="color: #200080; font-weight: bold;">in</span> <span style="color: #e34adc;">xrange</span><span style="color: #308080;">(</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span>nx<span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">)</span><span style="color: #308080;">:</span>
<span style="color: #200080; font-weight: bold;">for</span> j <span style="color: #200080; font-weight: bold;">in</span> <span style="color: #e34adc;">xrange</span><span style="color: #308080;">(</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span> ny<span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">)</span><span style="color: #308080;">:</span>
u<span style="color: #308080;">[</span>i<span style="color: #308080;">,</span>j<span style="color: #308080;">]</span> <span style="color: #308080;">=</span> <span style="color: #308080;">(</span><span style="color: #308080;">(</span>u<span style="color: #308080;">[</span>i<span style="color: #308080;">+</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span> j<span style="color: #308080;">]</span> <span style="color: #308080;">+</span> u<span style="color: #308080;">[</span>i<span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span> j<span style="color: #308080;">]</span><span style="color: #308080;">)</span> <span style="color: #308080;">*</span> dy2 <span style="color: #308080;">+</span>
<span style="color: #308080;">(</span>u<span style="color: #308080;">[</span>i<span style="color: #308080;">,</span> j<span style="color: #308080;">+</span><span style="color: #008c00;">1</span><span style="color: #308080;">]</span> <span style="color: #308080;">+</span> u<span style="color: #308080;">[</span>i<span style="color: #308080;">,</span> j<span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">]</span><span style="color: #308080;">)</span> <span style="color: #308080;">*</span> dx2<span style="color: #308080;">)</span> <span style="color: #308080;">/</span> <span style="color: #308080;">(</span><span style="color: #008c00;">2</span><span style="color: #308080;">*</span><span style="color: #308080;">(</span>dx2<span style="color: #308080;">+</span>dy2<span style="color: #308080;">)</span><span style="color: #308080;">)</span>
<span style="color: #200080; font-weight: bold;">def</span> calc<span style="color: #308080;">(</span>N<span style="color: #308080;">,</span> Niter<span style="color: #308080;">=</span><span style="color: #008c00;">100</span><span style="color: #308080;">,</span> func<span style="color: #308080;">=</span>py_update<span style="color: #308080;">,</span> args<span style="color: #308080;">=</span><span style="color: #308080;">(</span><span style="color: #308080;">)</span><span style="color: #308080;">)</span><span style="color: #308080;">:</span>
u <span style="color: #308080;">=</span> zeros<span style="color: #308080;">(</span><span style="color: #308080;">[</span>N<span style="color: #308080;">,</span> N<span style="color: #308080;">]</span><span style="color: #308080;">)</span>
u<span style="color: #308080;">[</span><span style="color: #008c00;">0</span><span style="color: #308080;">]</span> <span style="color: #308080;">=</span> <span style="color: #008c00;">1</span>
<span style="color: #200080; font-weight: bold;">for</span> i <span style="color: #200080; font-weight: bold;">in</span> <span style="color: #e34adc;">range</span><span style="color: #308080;">(</span>Niter<span style="color: #308080;">)</span><span style="color: #308080;">:</span>
func<span style="color: #308080;">(</span>u<span style="color: #308080;">,</span><span style="color: #308080;">*</span>args<span style="color: #308080;">)</span>
<span style="color: #200080; font-weight: bold;">return</span> u
</pre><br />
This code takes a very long time to run in order to converge to the correct solution. For a 100x100 grid, visually-indistinguishable convergence occurs after about 8000 iterations. The pure Python solution took an estimated 560 seconds (9 minutes) to finish (using <a href="http://scienceoss.com/test-the-speed-of-your-code-interactively-in-ipython/">IPython's %timeit</a> magic command). <br />
<br />
<h2>NumPy Solution</h2><br />
Using NumPy, we can speed this code up significantly by using slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution. The NumPy update code is:<br />
<br />
<pre style="background: #f6f8ff; color: #000020;"><span style="color: #200080; font-weight: bold;">def</span> num_update<span style="color: #308080;">(</span>u<span style="color: #308080;">)</span><span style="color: #308080;">:</span>
u<span style="color: #308080;">[</span><span style="color: #008c00;">1</span><span style="color: #308080;">:</span><span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span><span style="color: #008c00;">1</span><span style="color: #308080;">:</span><span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">]</span> <span style="color: #308080;">=</span> <span style="color: #308080;">(</span><span style="color: #308080;">(</span>u<span style="color: #308080;">[</span><span style="color: #008c00;">2</span><span style="color: #308080;">:</span><span style="color: #308080;">,</span><span style="color: #008c00;">1</span><span style="color: #308080;">:</span><span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">]</span><span style="color: #308080;">+</span>u<span style="color: #308080;">[</span><span style="color: #308080;">:</span><span style="color: #308080;">-</span><span style="color: #008c00;">2</span><span style="color: #308080;">,</span><span style="color: #008c00;">1</span><span style="color: #308080;">:</span><span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">]</span><span style="color: #308080;">)</span><span style="color: #308080;">*</span>dy2 <span style="color: #308080;">+</span>
<span style="color: #308080;">(</span>u<span style="color: #308080;">[</span><span style="color: #008c00;">1</span><span style="color: #308080;">:</span><span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span><span style="color: #008c00;">2</span><span style="color: #308080;">:</span><span style="color: #308080;">]</span> <span style="color: #308080;">+</span> u<span style="color: #308080;">[</span><span style="color: #008c00;">1</span><span style="color: #308080;">:</span><span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span><span style="color: #308080;">:</span><span style="color: #308080;">-</span><span style="color: #008c00;">2</span><span style="color: #308080;">]</span><span style="color: #308080;">)</span><span style="color: #308080;">*</span>dx2<span style="color: #308080;">)</span> <span style="color: #308080;">/</span> <span style="color: #308080;">(</span><span style="color: #008c00;">2</span><span style="color: #308080;">*</span><span style="color: #308080;">(</span>dx2<span style="color: #308080;">+</span>dy2<span style="color: #308080;">)</span><span style="color: #308080;">)</span>
</pre><br />
Using <b><span style="font-family: "Courier New",Courier,monospace;">num_update</span></b> as the calculation function reduced the time for 8000 iterations on a 100x100 grid to only 2.24 seconds (a 250x speed-up). Such speed-ups are not uncommon when using NumPy to replace Python loops where the inner loop is doing simple math on basic data-types.<br />
<br />
Quite often it is sufficient to stop there and move on to another part of the code-base. Even though you might be able to speed up this section of code more, it may not be the critical path anymore in your over-all problem. Programmer effort should be spent where more benefit will be obtained. Occasionally, however, it is essential to speed-up even this kind of code. <br />
<br />
Even though NumPy implements the calculations at compiled speeds, it is possible to get even faster code. This is mostly because NumPy needs to create temporary arrays to hold intermediate simple calculations in expressions like the average of adjacent cells shown above. If you were to implement such a calculation in C/C++ or Fortran, you would likely create a single loop with no intermediate temporary memory allocations and perform a more complex computation at each iteration of the loop.<br />
<br />
In order to get an optimized version of the update function, we need a machine-code implementation that Python can call. Of course, we could do this manually by writing the inner call in a compilable language and using Python's <a href="http://docs.python.org/extending/index.html#extending-index">extension facilities</a>. More simply, we can use Cython and Weave which do most of the heavy lifting for us. <br />
<br />
<h2>Cython solution</h2><br />
Cython is an extension-module writing language that looks a lot like Python except for optional type declarations for variables. These type declarations allow the Cython compiler to replace generic, highly dynamic Python code with specific and very fast compiled code that is then able to be loaded into the Python run-time dynamically. Here is the Cython code for the update function:<br />
<br />
<pre style="background: #f6f8ff; color: #000020;">cimport numpy as np
<span style="color: #200080; font-weight: bold;">def</span> cy_update<span style="color: #308080;">(</span>np<span style="color: #308080;">.</span>ndarray<span style="color: #308080;">[</span>double<span style="color: #308080;">,</span> ndim<span style="color: #308080;">=</span><span style="color: #008c00;">2</span><span style="color: #308080;">]</span> u<span style="color: #308080;">,</span> double dx2<span style="color: #308080;">,</span> double dy2<span style="color: #308080;">)</span><span style="color: #308080;">:</span>
cdef unsigned <span style="color: #e34adc;">int</span> i<span style="color: #308080;">,</span> j
<span style="color: #200080; font-weight: bold;">for</span> i <span style="color: #200080; font-weight: bold;">in</span> <span style="color: #e34adc;">xrange</span><span style="color: #308080;">(</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span>u<span style="color: #308080;">.</span>shape<span style="color: #308080;">[</span><span style="color: #008c00;">0</span><span style="color: #308080;">]</span><span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">)</span><span style="color: #308080;">:</span>
<span style="color: #200080; font-weight: bold;">for</span> j <span style="color: #200080; font-weight: bold;">in</span> <span style="color: #e34adc;">xrange</span><span style="color: #308080;">(</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span> u<span style="color: #308080;">.</span>shape<span style="color: #308080;">[</span><span style="color: #008c00;">1</span><span style="color: #308080;">]</span><span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">)</span><span style="color: #308080;">:</span>
u<span style="color: #308080;">[</span>i<span style="color: #308080;">,</span>j<span style="color: #308080;">]</span> <span style="color: #308080;">=</span> <span style="color: #308080;">(</span><span style="color: #308080;">(</span>u<span style="color: #308080;">[</span>i<span style="color: #308080;">+</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span> j<span style="color: #308080;">]</span> <span style="color: #308080;">+</span> u<span style="color: #308080;">[</span>i<span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">,</span> j<span style="color: #308080;">]</span><span style="color: #308080;">)</span> <span style="color: #308080;">*</span> dy2 <span style="color: #308080;">+</span>
<span style="color: #308080;">(</span>u<span style="color: #308080;">[</span>i<span style="color: #308080;">,</span> j<span style="color: #308080;">+</span><span style="color: #008c00;">1</span><span style="color: #308080;">]</span> <span style="color: #308080;">+</span> u<span style="color: #308080;">[</span>i<span style="color: #308080;">,</span> j<span style="color: #308080;">-</span><span style="color: #008c00;">1</span><span style="color: #308080;">]</span><span style="color: #308080;">)</span> <span style="color: #308080;">*</span> dx2<span style="color: #308080;">)</span> <span style="color: #308080;">/</span> <span style="color: #308080;">(</span><span style="color: #008c00;">2</span><span style="color: #308080;">*</span><span style="color: #308080;">(</span>dx2<span style="color: #308080;">+</span>dy2<span style="color: #308080;">)</span><span style="color: #308080;">)</span>
</pre><br />
This code looks very similar to the original Python-only implementation except for the additional type-declarations. Notice that even NumPy arrays can be declared with Cython and Cython will correctly translate Python element selection into fast memory-access macros in the generated C code. When this function was used for each iteration in the inner calculation loop, the 8000 iterations on a 100x100 grid took only 1.28 seconds.<br />
<br />
For completeness, the following shows the contents of the setup.py file that was also created in order to produce a compiled-module where the cy_update function lived.<br />
<br />
<pre style="background: #f6f8ff; color: #000020;"><span style="color: #200080; font-weight: bold;">from</span> distutils<span style="color: #308080;">.</span>core <span style="color: #200080; font-weight: bold;">import</span> setup
<span style="color: #200080; font-weight: bold;">from</span> distutils<span style="color: #308080;">.</span>extension <span style="color: #200080; font-weight: bold;">import</span> Extension
<span style="color: #200080; font-weight: bold;">from</span> Cython<span style="color: #308080;">.</span>Distutils <span style="color: #200080; font-weight: bold;">import</span> build_ext
<span style="color: #200080; font-weight: bold;">import</span> numpy
ext <span style="color: #308080;">=</span> Extension<span style="color: #308080;">(</span><span style="color: #1060b6;">"laplace"</span><span style="color: #308080;">,</span> <span style="color: #308080;">[</span><span style="color: #1060b6;">"laplace.pyx"</span><span style="color: #308080;">]</span><span style="color: #308080;">,</span>
include_dirs <span style="color: #308080;">=</span> <span style="color: #308080;">[</span>numpy<span style="color: #308080;">.</span>get_include<span style="color: #308080;">(</span><span style="color: #308080;">)</span><span style="color: #308080;">]</span><span style="color: #308080;">)</span>
setup<span style="color: #308080;">(</span>ext_modules<span style="color: #308080;">=</span><span style="color: #308080;">[</span>ext<span style="color: #308080;">]</span><span style="color: #308080;">,</span>
cmdclass <span style="color: #308080;">=</span> <span style="color: #406080;">{</span><span style="color: #1060b6;">'build_ext'</span><span style="color: #308080;">:</span> build_ext<span style="color: #406080;">}</span><span style="color: #308080;">)</span>
</pre><br />
The extension module was then built using the command: <b style="font-family: "Courier New",Courier,monospace;">python setup.py build_ext --inplace</b><br />
<br />
<h2>Weave solution</h2><br />
An older, but still useful, approach to speeding up code is to use weave to directly embed a C or C++ implementation of the algorithm into the Python program directly. Weave is a module that surrounds the bit of C or C++ code that you write with a template to on-the-fly create an extension module that is compiled and then dynamically loaded into the Python run-time. Weave has a caching mechanism so that different strings or different types of inputs lead to a new extension module being created, compiled, and loaded. The first time code using weave runs, the compilation has to take place. Subsequent runs of the same code will load the cached extension module and run the machine code.<br />
<br />
For this particular case, an update routine using weave looks like:<br />
<br />
<pre style="background: #f6f8ff; color: #000020;"><span style="color: #200080; font-weight: bold;">def</span> weave_update<span style="color: #308080;">(</span>u<span style="color: #308080;">)</span><span style="color: #308080;">:</span>
code <span style="color: #308080;">=</span> <span style="color: #595979;">"""</span>
<span style="color: #595979;"> int i, j;</span>
<span style="color: #595979;"> for (i=1; i<Nu[0]-1; i++) {</span>
<span style="color: #595979;"> for (j=1; j<Nu[1]-1; j++) {</span>
<span style="color: #595979;"> U2(i,j) = ((U2(i+1, j) + U2(i-1, j))*dy2 + \</span>
<span style="color: #595979;"> (U2(i, j+1) + U2(i, j-1))*dx2) / (2*(dx2+dy2));</span>
<span style="color: #595979;"> }</span>
<span style="color: #595979;"> }</span>
<span style="color: #595979;"> """</span>
weave<span style="color: #308080;">.</span>inline<span style="color: #308080;">(</span>code<span style="color: #308080;">,</span> <span style="color: #308080;">[</span><span style="color: #1060b6;">'u'</span><span style="color: #308080;">,</span> <span style="color: #1060b6;">'dx2'</span><span style="color: #308080;">,</span> <span style="color: #1060b6;">'dy2'</span><span style="color: #308080;">]</span><span style="color: #308080;">)</span>
</pre><br />
The inline function takes a string of C or C++ code plus a list of variable names that will be pushed from the Python namespace into the compiled code. The inline function takes this code and the list of variables and either loads and executes a function in a previously-created extension module (if the string and types of the variables have been previously created) or else creates a new extension module before compiling, loading, and executing the code.<br />
<br />
Notice that weave defines special macros so that <b><span style="font-family: "Courier New",Courier,monospace;">U2</span></b> allows referencing the elements of the 2-d array <b style="font-family: "Courier New",Courier,monospace;">u</b> using simple expressions. Weave also defines the special C-array of integers <b><span style="font-family: "Courier New",Courier,monospace;">Nu</span></b> to contain the shape of the <b><span style="font-family: "Courier New",Courier,monospace;">u</span></b> array. There are also special macros similarly defined to access the elements of array u if it would have been a 1-, 3-, or 4-dimensional array (<b><span style="font-family: "Courier New",Courier,monospace;">U1</span></b>, <b style="font-family: "Courier New",Courier,monospace;">U3</b>, and <b style="font-family: "Courier New",Courier,monospace;">U4</b>). Although not used in this snippet of code, the C-array <b style="font-family: "Courier New",Courier,monospace;">Su</b> containing the strides in each dimension and the integer <b style="font-family: "Courier New",Courier,monospace;">Du</b> defining the number of dimensions of the array are both also defined. <br />
<br />
Using the <b style="font-family: "Courier New",Courier,monospace;">weave_update</b> function, 8000 iterations on a 100x100 grid took only 1.02 seconds. This was the fastest implementation of all of the methods used. Knowing a little C and having a compiler on hand can certainly speed up critical sections of code in a big way.<br />
<br />
<h2>Faster Cython solution (Update)</h2><br />
After I originally published this post, I received some great feedback in the Comments section that encouraged me to add some parameters to the Cython solution in order to get an even faster solution. I was also reminded about pyximport and given example code to make it work more easily. Basically by adding some compiler directives to Cython to avoid some checks at each iteration of the loop, Cython generated even faster C-code. To the top of my previous Cython code, I added a few lines: <br />
<br />
<pre style="background: #f6f8ff; color: #000020;"><span style="color: #595979;">#cython: boundscheck=False</span>
<span style="color: #595979;">#cython: wraparound=False</span>
</pre><br />
I then saved this new file as _laplace.pyx, and added the following lines to the top of the Python file that was running the examples: <br />
<br />
<pre style="background: #f6f8ff; color: #000020;"><span style="color: #200080; font-weight: bold;">import</span> pyximport
<span style="color: #200080; font-weight: bold;">import</span> numpy as np
pyximport<span style="color: #308080;">.</span>install<span style="color: #308080;">(</span>setup_args<span style="color: #308080;">=</span><span style="color: #406080;">{</span><span style="color: #1060b6;">'include_dirs'</span><span style="color: #308080;">:</span><span style="color: #308080;">[</span>np<span style="color: #308080;">.</span>get_include<span style="color: #308080;">(</span><span style="color: #308080;">)</span><span style="color: #308080;">]</span><span style="color: #406080;">}</span><span style="color: #308080;">)</span>
<span style="color: #200080; font-weight: bold;">from</span> _laplace <span style="color: #200080; font-weight: bold;">import</span> cy_update as cy_update2
</pre><br />
<br />
This provided an update function <b style="font-family: "Courier New",Courier,monospace;">cy_update2</b> that resulted in the very fastest implementation (943 ms) for 8000 iterations of a 100x100 grid. <br />
<br />
<h2>Summary</h2><br />
The following table summarizes the results which were all obtained on a 2.66 Ghz Intel Core i7 MacBook Pro with 8GB of 1067 Mhz DDR3 Memory. The relative speed column shows the speed relative to the NumPy implementation. <br />
<br />
<table border="1"><tbody>
<tr><th>Method</th> <th>Time (sec)</th> <th>Relative Speed</th> </tr>
<tr> <td>Pure Python</td> <td>560</td> <td>250</td> </tr>
<tr> <td>NumPy</td> <td>2.24</td> <td>1</td> </tr>
<tr> <td>Cython</td> <td>1.28</td> <td>0.57</td> </tr>
<tr> <td>Weave</td> <td>1.02</td> <td>0.45</td> </tr>
<tr> <td>Faster Cython</td> <td>0.94</td> <td>0.42</td> </tr>
</tbody></table><br />
Clearly when it comes to doing a lot of heavy number crunching, Pure Python is not really an option. However, perhaps somewhat surprisingly, NumPy can get you most of the way to compiled speeds through vectorization. In situations where you still need the last ounce of speed in a critical section, or when it either requires a PhD in NumPy-ology to vectorize the solution or it results in too much memory overhead, you can reach for Cython or Weave. If you already know C/C++, then weave is a simple and speedy solution. If, however, you are not already familiar with C then you may find Cython to be exactly what you are looking for to get the speed you need out of Python. <br />
<br />
</div>Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com147tag:blogger.com,1999:blog-68730239358084672.post-26028981532204961142011-06-18T16:34:00.000-07:002011-06-18T21:42:28.663-07:00Python Enhancement Proposals I Wish I Had Time to Champion<div dir="ltr" style="text-align: left;" trbidi="on"><div dir="ltr" style="text-align: left;" trbidi="on">Today I was trying to make progress on a few different NumPy proposal enhancements and ended up frustrated knowing that come Monday morning, I will not have any time to follow-up on them. Managing a growing consulting company takes a lot of time (<a href="http://www.enthought.com">Enthought</a> is over 30 people now and growing to near 50 by the end of the year). There are countless meetings devoted to new hires, program development, project reviews, customer relations, budgeting, and sales. I also take a direct role in delivering on training and select consulting projects. Someday I may get a chance to write something of use about things I've learned along the way, but that is for another day (and likely another blog). This post is to get a few ideas I've been sitting on written down in the hopes that somebody might read them and get excited about contributing. At the very least anybody that reads this post, will know (at least some of) my current opinion about a few technical proposals. <br />
<br />
About a month ago, I had the privilege of organizing a "data-array" summit in which several people in the NumPy and SciPy community came together at the Enthought offices to discuss some ideas related to how to improve data analysis with the NumPy and SciPy stack. We spent 3-days thinking and brainstorming which led to many fruitful discussions. I expect that some of the ideas generated will result in important and interesting changes to NumPy and SciPy over the coming months and years. More information about the summit can be learned by listening to the <a href="http://inscight.org/2011/05/18/episode_13/">relevant inSCIght podcast</a>. <br />
<br />
It's actually a very exciting time to get involved in the SciPy community as Python takes its place as one of the approaches people will be using to analyze all the data that we are generating. In that spirit, I wanted to express a few of what I consider to be important enhancements that are needed to Python and NumPy. <br />
<br />
I will start with Python and leave NumPy to another post. Here there are really three big missing features that would really benefit those of us who use Python for technical computing. Unfortunately, I don't think there is enough representation of the Python for Science crowd in the current batch of Python developers. This is not due to any exclusion from the Python developers who have always been very accommodating. It is simply due to the scarcity of people who understand the SciPy perspective and use-cases also willing to engage with developers in the Python world. Those (like Mark Dickinson) who cross the chasm are a real gem. <br />
<br />
If anyone has an interest in shepherding a PEP in any of the following directions, you will have my '+1' support (and any community-organizing that I can muster to help you). Honestly, if these things were put into Python 3, there would be a serious motivation to move to Python 3 for the scientific community (which is otherwise going to lag in the great migration). <br />
<br />
<h1>Python Enhancements I Want </h1><br />
<h2>Adding additional operators</h2><br />
We need additional operators to easily represent at least matrix multiplication, matrix power, and matrix solve). I could possibly back-off the last two if we at least had matrix multiplication. This should have been done a long time ago. If I had been able to spare the time, I would have pushed to hold off porting of NumPy to Python 3 until we got matrix multiplication operators. Yes, I know that blackmail usually backfires and thankfully Pauli Virtanen and Charles Harris acted before I even had a chance to suggest such a thing :-). But, seriously, we need this support in the language. <br />
<br />
The reasons are quite simple: <br />
<ul style="text-align: left;"><li>Syntax matters: writing <b><span style="font-family: "Courier New",Courier,monospace;">d = numpy.solve(numpy.dot(numpy.dot(a,b),c), x)</span></b> is a whole lot more ugly than something like <b><span style="font-family: "Courier New",Courier,monospace;">d = (a*b*c) \ x</span></b>. If the former is fine, then we should all just go back to writing LISP. The point of having nice syntax is to minimize the line-noise and mental overhead of mapping the mental idea to working code. For Python to be used with mental efficiency in technical computing you need to write expressions involving higher-order operations like this all the time. </li>
<li>Right now, the recommended way to do this is to convert a, b, c, and x to "matrices", perform the computation in a nice expression and then convert back to arrays. This is clunky at best.</li>
</ul>I've been back and forth on this for 13 years and can definitively say that we would be much better off in Python if we had a matrix multiplication operator. Please, please, can we get one! The relevant PEPS where this has been discussed are: <a href="http://www.python.org/dev/peps/pep-0211/">PEP 211</a> and <a href="http://www.python.org/dev/peps/pep-0225/">PEP 225</a>. I think I like having more than just one operator added (ala PEP 225, but the subject would have to be re-visited by a brave soul).<br />
<br />
<h2>Overloadable Boolean Operations</h2><br />
<a href="http://www.python.org/dev/peps/pep-0335/">PEP 335</a> was a fantastic idea. I really wish we had the ability to overload <b><span style="font-family: "Courier New",Courier,monospace;">and</span></b>, <b><span style="font-family: "Courier New",Courier,monospace;">or</span></b>, and <b style="font-family: "Courier New",Courier,monospace;">not</b>. Among other things, this would allow the very nice syntax so that <b><span style="font-family: "Courier New",Courier,monospace;">mask = 2<a<10</span></b> would generate an array of True and False values when a is an array. Currently, to generate this same mask you have to do <b><span style="font-family: "Courier New",Courier,monospace;">(2<a)&(a<4)</span></b>. The PEP has other important use-cases as well. It would be excellent if this PEP were re-visited, championed, and put into Python 3. <br />
<br />
<h2>Allowing slice object literals outside of []</h2><br />
Python's syntax allows construction of a slice object inside brackets so that one can write <span style="font-family: "Courier New",Courier,monospace;">a[1:3]</span> which is equivalent to <span style="font-family: "Courier New",Courier,monospace;">a.__getitem__(slice(1,3))</span>. Many times over the years, I have wanted to be able to specify a slice object using the syntax start:stop:step outside of the getitem. Even, if Python's parser were extended to allow the slice literal to be accepted as the input to a function it would be preferred. The biggest wart this would remove is the (ab)use of getitem to return new ranges and grids in NumPy (go use <span style="font-family: "Courier New",Courier,monospace;">mgrid</span> and <span style="font-family: "Courier New",Courier,monospace;">r_</span> in NumPy to see what I mean). I would prefer that these were functions, but I would need <span style="font-family: "Courier New",Courier,monospace;">mgrid(1:5, 1:5</span>) to work. <br />
<br />
There was a PEP for range literals (<a href="http://www.python.org/dev/peps/pep-0204/">PEP 204</a>) once upon a time. There were some interesting aspects about that proposal, but frankly I don't want the slice syntax to produce ranges. I would just be content for it always to produce slice objects --- just allow it outside of brackets.<br />
<br />
While I started by lamenting my lack of time to implement NumPy enhancements, I will leave the discussion of NumPy enhancements I'm dreaming about to another post. I would be thrilled if somebody took up the charge to push any of these Python enhancements in Python 3. If Python 3 ends up with any of them, it would be a huge motivation to me to migrate to Python 3 entirely. </div></div>Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com7tag:blogger.com,1999:blog-68730239358084672.post-73469153533738178892011-02-11T23:54:00.000-08:002011-06-18T21:25:32.602-07:00MVSDIST in SciPyMy pathway to probability theory was a little tortured. Like most people, I sat through my first college-level "Statistics" class fairly befuddled. I was good at math and understood calculus pretty well. As a result, I did well in the course, but didn't feel that I really understood what was going on. I took a course that used as its text <a href="http://www.amazon.com/Probability-Random-Variables-Stochastic-Processes/dp/0070484775">this book by Papoulis</a>. Now the text is a great reference for me, but at the time I didn't really understand the point of most of the more theoretical ideas. It wasn't until later after I had studied <a href="http://en.wikipedia.org/wiki/Measure_%28mathematics%29">measure theory</a>, and understood more of the implications of the set-theory studies of <a href="http://en.wikipedia.org/wiki/Georg_Cantor">George Cantor</a> that I began to see the significance of a Borel algebra and why some of the complexity was necessary from a foundational perspective. <br />
<br />
I still believe, however, that diving into the details of measure theory is over-kill for introducing probability theory. I've been convinced by <a href="http://omega.albany.edu:8008/JaynesBook.html">E.T. Jaynes</a> that probability theory is an essential component of any education and as such should be presented in multiple ways at multiple times and certainly not thrown at you as "just an application of measure theory" the way it sometimes is in math courses. I think this is improving, but there is still work to do. <br />
<br />
What typically still happens is that people get their "taste" of probability theory (or worse, their taste of "statistics") and then move on not ever really assimilating the lessons in their life. The trouble is everyone must deal with uncertainty. Our brains are <a href="http://en.wikipedia.org/wiki/Confirmation_bias">hard-wired to deal with it</a> --- often in ways that can be counter-productive. At its core, probability theory is just a consistent and logical way to deal with uncertainty using real numbers. In fact, it can <a href="http://en.wikipedia.org/wiki/Cox%27s_theorem">be argued</a> that it is the <b>only</b> way to deal with uncertainty. <br />
<br />
I've done a bit of experimentation over the years and dealt with a lot of data (MRI, ultrasound, electrical impedance data). In probability theory, I found a framework for understanding what the data really tells me which led me to spend several years studying inverse problems. There are a lot of problems that can be framed as inverse problems. Basically, inverse problem theory can be applied to any problem where you have data and you want to understand what the data tells you. To apply probability theory to solve an inverse problem you have to have some model that determines how what you want to know leads to the data you've got. Then, you basically invert the model. <a href="http://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes' theorem</a> provides a beautiful framework for this inversion. <br />
<br />
The result of this Bayesian approach to inverse problems, though, is not just a number. It is explicitly a probability density function (or probability mass function). In other words, the result of a proper solution to an inverse problem is a random variable, or probability distribution. Seeing the result of any inverse problem as a random variable changes the way you think about drawing conclusions from data. <br />
<br />
Think about the standard problem of fitting a line to data. You plug-and-chug using a calculator or a spreadsheet (or a function call in <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html">NumPy</a>), and you get two numbers (the slope and intercept). If you properly understand inverse problems as requiring the production of a random variable, then you will not be satisfied with just these numbers. You will want to know, how certain am I about these numbers. How much should I trust them? What if I am going to make a decision on the basis of these numbers? (Speaking of making decisions, someday, I would like to write about how probability theory is also under-utilized in standard business financial projections and business decisions). <br />
<br />
Some statisticians when faced with this regression problem will report the "goodness" of fit and feel satisfied, but as one who sees the power and logical simplicity of Bayesian inverse theory, I'm not satisfied by such an answer. What I want is the joint probability distribution for slope and intercept based on the data. A lot of common regression techniques do not provide this. I'm not going to go into details regarding the historical reasons for why this is. You can <a href="http://www.google.com/search?q=bayesian+vs+frequentist">use google</a> to explore some of the details if you are interested. A lot of it comes down to the myth of objectivity and the desire to eliminate the need for a prior which Bayesian inverse theory exposes. <br />
<br />
As an once very active contributor to SciPy (now an occasional contributor who is still very interested in its progress), I put in the scipy.stats package a few years ago a little utility for estimating the mean, standard deviation, and variance from data that expresses my worldview a little bit. I recently updated this utility and created a function called mvsdist. This function finally returns random variable objects (as any good inverse problem solution should!) for the <b>M</b>ean, <b>S</b>tandard deviation, and <b>V</b>ariance derived from a vector of data. The assumptions are 1) the data were all sampled from a random variable with the same mean and variance, 2) the standard deviation and variance are "scale" parameters, and 3) non-informative (improper) priors. <br />
<br />
The details of the derivation are recorded in <a href="http://hdl.handle.net/1877/438">this paper</a>. Any critiques of this paper are welcome as I never took the time to try and get formal review for it (I'm not sure where I would have submitted it for one --- and I'm pretty sure there is a paper out there that already expresses all of this, anyway). <br />
<br />
It is pretty simple to get started playing with mvsdist (assuming you have SciPy 0.9 installed). This function is meant to be called any time you have a bunch of data and you want to "compute the mean" or "find the standard deviation." You collect the data into a list or NumPy array of numbers and pass this into the mvsdist function: <br />
<br />
<pre style='color:#000000;background:#ffffff;'><span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span> <span style='color:#800000; font-weight:bold; '>from</span> scipy<span style='color:#808030; '>.</span>stats <span style='color:#800000; font-weight:bold; '>import</span> mvsdist
<span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span> data <span style='color:#808030; '>=</span> <span style='color:#808030; '>[</span><span style='color:#008c00; '>9</span><span style='color:#808030; '>,</span> <span style='color:#008c00; '>12</span><span style='color:#808030; '>,</span> <span style='color:#008c00; '>10</span><span style='color:#808030; '>,</span> <span style='color:#008c00; '>8</span><span style='color:#808030; '>,</span> <span style='color:#008c00; '>6</span><span style='color:#808030; '>,</span> <span style='color:#008c00; '>11</span><span style='color:#808030; '>,</span> <span style='color:#008c00; '>7</span><span style='color:#808030; '>]</span>
<span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span> mean<span style='color:#808030; '>,</span> var<span style='color:#808030; '>,</span> std <span style='color:#808030; '>=</span> mvsdist<span style='color:#808030; '>(</span>data<span style='color:#808030; '>)</span>
</pre><br />
This returns three distribution objects which I have intentionally named mean, var, and std because they represent the estimates of mean, variance, and standard-deviation of the data. Because they are estimates, they are not just numbers, but instead are (frozen) <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html">probability distribution objects</a>. These objects have methods that let you evaluate the probability density function: <font face="courier">.pdf(x)</font>, compute the cumulative density function: <font face="courier">.cdf(x)</font>, generate random samples drawn from the distribution: <font face="courier">.rvs(size=N)</font>, determine an interval that contains some percentage of the random draws from this distribution: <font face="courier">.interval(alpha)</font>, and calculate simple statistics: <font face="courier">.stats(), .mean(), .std(), .var()</font>. <br />
<br />
In this case, consider the following example: <br />
<br />
<pre style='color:#000000;background:#ffffff;'><span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span> mean<span style='color:#808030; '>.</span>interval<span style='color:#808030; '>(</span><span style='color:#008000; '>0.90</span><span style='color:#808030; '>)</span>
<span style='color:#808030; '>(</span><span style='color:#008000; '>7.4133999449331132</span><span style='color:#808030; '>,</span> <span style='color:#008000; '>10.586600055066887</span><span style='color:#808030; '>)</span>
<span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span> mean<span style='color:#808030; '>.</span>mean<span style='color:#808030; '>(</span><span style='color:#808030; '>)</span>
<span style='color:#008000; '>9.0</span>
<span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span> mean<span style='color:#808030; '>.</span>std<span style='color:#808030; '>(</span><span style='color:#808030; '>)</span>
<span style='color:#008000; '>0.99999999999999989</span>
<span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span> std<span style='color:#808030; '>.</span>interval<span style='color:#808030; '>(</span><span style='color:#008000; '>0.90</span><span style='color:#808030; '>)</span>
<span style='color:#808030; '>(</span><span style='color:#008000; '>1.4912098929401241</span><span style='color:#808030; '>,</span> <span style='color:#008000; '>4.137798046658852</span><span style='color:#808030; '>)</span>
<span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span> std<span style='color:#808030; '>.</span>mean<span style='color:#808030; '>(</span><span style='color:#808030; '>)</span>
<span style='color:#008000; '>2.4869681414837035</span>
<span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span> std<span style='color:#808030; '>.</span>std<span style='color:#808030; '>(</span><span style='color:#808030; '>)</span>
<span style='color:#008000; '>0.90276766847572409</span>
</pre><br />
Notice that once we have the probability distribution we can report many things about the estimate that provide for not only the estimate itself, but also any question we might have regarding the uncertainty in the estimate. Often, we may want to visualize the probability density function as is shown below for the standard deviation estimate and the mean estimate <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3nWeQ65oZmd8VfCZTck4Mmxx0GvulXGokDUf0kjRZguQozVzuN4gve6BotGXZzhK5dsxTYdw8M7699Th85cFj-Kbxl7G6Inh3zO9oPCn_NI9kWjv2mYCr9CtCVvSx0XK_hVoNenX6qrk/s1600/myfig.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="300" width="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3nWeQ65oZmd8VfCZTck4Mmxx0GvulXGokDUf0kjRZguQozVzuN4gve6BotGXZzhK5dsxTYdw8M7699Th85cFj-Kbxl7G6Inh3zO9oPCn_NI9kWjv2mYCr9CtCVvSx0XK_hVoNenX6qrk/s400/myfig.png" /></a></div><br />
<br />
<br />
It is not always easy to solve an inverse problem by providing the full probability distribution object (especially in multiple dimensions). But, when it's possible, it really does provide for a more thorough understanding of the problem. I'm very interested in SciPy growing more of these kinds of estimator approaches where possible.Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com0tag:blogger.com,1999:blog-68730239358084672.post-26275291047862885582010-11-30T15:10:00.000-08:002011-06-18T21:25:25.046-07:00Zen of NumPyWhile I was on-site working for a client, one of the developers I worked with would begin each day with a brief discussion of one of the tenets from the "Zen of Python." For those who are not familiar with this little pearl of Python goodness. You can find the "Zen of Python" as an Easter egg inside a Python distribution:<br />
<br />
<div style="background: #c0e0ff; overflow:auto;width:auto;color:black;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;"><pre style="margin: 0; line-height: 125%"><span style="color: #303030">>>></span> <span style="color: #008000; font-weight: bold">import</span> <span style="color: #0e84b5; font-weight: bold">this</span>
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
</pre></div><br />
The Zen of Python is often quoted from one Python user to another in trying to communicate something of the essence of what makes programming in Python different. While we were discussing one of the points, one of my co-workers suggested that there should be a "Zen of NumPy". This isn't the first time I've heard that suggestion. Actually David Morrill (author of Traits) was the first person who suggested there should be a book about the "Zen of NumPy." I totally agree with him. The only problem is that everybody involved with NumPy has apparently been too busy to write one :-)<br />
<br />
With this idea in my mind, when it came time to give a talk on NumPy at the New York Python Meetup group in Manhattan, I decided to create a first-draft of the Zen of NumPy. The phrases are included on one slide in the deck shared <a href="http://www.slideshare.net/enthought/talk-at-nyc-python-meetup-group">here</a>. <br />
<br />
I'm interested in feedback on these before proposing them for placement as <pre>numpy.this</pre><br />
<br />
Here is my attempt at a "Zen of NumPy"<br />
<br />
<pre>Strided is better than scattered
Contiguous is better than strided
Descriptive is better than imperative (use data-types)
Array-oriented is often better than object-oriented
Broadcasting is a great idea -- use where possible
Vectorized is better than an explicit loop
Unless it’s complicated --- then use numexpr, weave, or Cython
Think in higher dimensions
</pre><br />
I think there are useful edits as well as more statements that could be added to this list. Your feedback is welcome.Travis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com6tag:blogger.com,1999:blog-68730239358084672.post-13278977970125561022010-11-19T19:57:00.000-08:002010-11-20T12:59:19.515-08:00A New BlogLately I have been finding a need to have a voice --- an authentic voice. A voice, which I occasionally expressed in the days when I had the time to be more active on open source mailing lists (SciPy, NumPy, and even Python itself). When I was younger, I didn't have as many endearing entanglements to the future that depend on my present. As a result, I could spend much time pursuing efforts that gave me a tremendous sense of accomplishment. <br />
<br />
For as long as I can remember, I have been driven by discovery. Much to their annoyance, I would constantly ask my parents and 9 siblings "Why?" I used to be quite proud of myself as they would relate these stories of my inquisitive childhood at family gatherings. My particular combination of infused biochemistry that led to my knowledge addiction certainly drove most pursuits during my formative years, and this has had a strong impact in my life. <br />
<br />
During my nearly 40 years, however, I have encountered an impressive cadre of awe-inspiring people each uniquely different. This has led me to conclude that it is not the particular current physical emergence that I find myself in. Rather, it is the particular use I am making of it. Do I pursue an agenda that barely extends beyond my internal neurobiology, or do I use my combination of skills and knowledge to seek a wider consistency that can harmonize with a beautifully complex society. <br />
<br />
Earlier tonight, I listened to technology leaders and entrepreneurs tell their view of what society would be like if their respective companies were wildly successful. I listened to this message in a stunning lecture hall in Peterhouse at Cambridge University. While they each brought a distinct perspective, their unifying message was the power of technology to change the world. <br />
<br />
Search for "Silicon valley comes to Cambridge" in a few days to get a summary and possibly even video of the talks. Megan Smith from Google (www.google.org) spoke of the power of big data to solve social injustices such as the sexual exploitation of children. Reid Hoffman, co-founder of LinkedIn, spoke of the power of inter-connectedness to solve big problems by bringing the right people together quickly. <br />
<br />
Other people spoke and gave interesting perspectives including Mike Lynch, founder of <a href="http://www.autonomy.com/">Autonomy</a>, who gave a wonderful talk on the importance of meaningful interaction with data so that our lives are enhanced and not enslaved by the explosion of data and technology. He also gave a tribute to Thomas Bayes. By looking on his site, I noticed that he gives similar props to Claude Shannon. I'm already impressed. These are two thinkers who were able to present important concepts that remain under-appreciated. <br />
<br />
I do think it's important what people think. The ideas we carry in our heads are critical. It is these ideas which drive our necessarily individual pursuits and can lead to disharmony. I like to pass along useful information, colored of course by my own experiences and perspective in the simple and perhaps naive hopes that sustainable, lasting solutions can be discovered. <br />
<br />
Most of my posts will be technical, as I am hoping to use this forum as a way to write about the thoughts I am having in my own attempt to hone and pare them. In particular, most of these posts will be about technology that I am involved with or have some exposure to. Upcoming posts include "The Zen of NumPy", "7 Heresies of Technical Computing", and "What I've learned from SciPy and Open Source"<br />
<br />
If you happen to come across these musings, your feedback is welcome. I would love to hear about your experiences with any thoughts that are covered in my posts. <br />
<br />
-TravisTravis Oliphanthttp://www.blogger.com/profile/04514536132317233988noreply@blogger.com1