[bip] Blog post on bioinformatics and Python

Wed Sep 17 13:15:00 PDT 2008

On 17-Sep-08, at 12:11 PM, Andrew Dalke wrote:

> On Sep 17, 2008, at 5:57 PM, Ryan Raaum wrote:
>> 1. It is monolithic. It is like one of those honking huge Swiss Army
>> knives. It has a big knife, a small knife, a saw, a scissor, a  
>> fork, a
>> spoon, a magnifying glass, a toothpick, a small shovel, a beach
>> umbrella, and more!
>
> How come with Python "batteries included" is a good thing,
> but with Biopython it's not?
>
> Is the solution like what Zope's been doing - split itself
> into many smaller packages, and distribute them as eggs?
>

Well, I'd argue that "batteries included" in Python isn't all good -  
there are pro's and con's. It's awesome for beginners since they have  
a useful set of libraries that they can start using right away. It  
helps prevent unnecessary forking by making people standardize on one  
implementation. But on the flipside it means that people can go longer  
before in a Python project before worrying about packaging and  
deployment problems - and learning these skills are important. They  
have made the Python standard library a little smaller in Python 3,  
and I think that's a good thing.

I like the analogy of a bicycle with training wheels on it. Awesome  
when you are learning to ride, but when you're in a race and you want  
to take a hairpin corner at speed, the wheels are going to get in the  
way. People try riding with the training wheels off, they naturally  
fall over at first and skin their knees. Some dust themselves off and  
try again, others say, "that's it I'm putting the wheels back on and  
never taking them off again!"

Zope got very huge, and it made the learning curve incredibly  
intimidating for anyone new to the project. Splitting the project into  
eggs has been very beneficial for the project. People can now approach  
the project more easily, and more easily make contributions to a  
single package without needing to feel like they need to first  
understand every package in the zope ecosystem first. People who don't  
use Zope at all are now able to use packages from the project.

We used to say, "well you can install all of Zope and then just ignore  
the parts you don't need, disk space is cheap" but dependencies  
between higher level parts of the system and lower level parts snuck  
in here an there and it usually meant that pulling in one package  
meant "pulling in the world". Splitting a project into multiple eggs  
"keeps you honest" in terms of dependencies. And because you used to  
have to take an "all-or-nothing" approach with Zope, some chose "all"  
and others chose "nothing" and this contributed to a cultural divide  
between Zope and the rest of the Python web world.

Mark Ramm's recent talk, "A TurboGears Guy talks about what Django can  
learn from Zope" is very relevant to the topic of a large, centralized  
project versus many smaller, interconnected projects:

http://compoundthinking.com/blog/index.php/2008/09/17/djangocon-and-learning-from-zope-2/

As is answered in the Q&A period at the end, there is a cost/benefit  
trade-off between these two approaches and there is no easy "this way  
is right, the other is wrong" answer - but exploring both approaches  
regardless of which decision is made is absolutely beneficial.

Packaging and deployment in Python is still an area of  
experimentation, and it's kind of a hard one to approach, but it's  
getting better. I wrote up some of my experiences with this stuff  
recently:

http://www.bud.ca/blog/pony

I think using tools such as Buildout to produce "repeatable  
deployments" are particularly relevant in bioinformatics. In a non- 
scientific web app, people can say, "why bother going to all the work  
of maintaining a detailed description of all the parts that compose my  
project and recording each change to each part over time" when the  
only use case for this might be historical curiosity. But all the time  
in bioinformatics we see research produced from a "one-off  
deployment" (aka Works On My Machine (TM)) and it's often not possible  
or feasible to go back and re-create the system at the time a  
particular result set was generated.