[bip] Blog post on bioinformatics and Python
Kevin Teague
kteague at bcgsc.ca
Wed Sep 17 13:15:00 PDT 2008
On 17-Sep-08, at 12:11 PM, Andrew Dalke wrote:
> On Sep 17, 2008, at 5:57 PM, Ryan Raaum wrote:
>> 1. It is monolithic. It is like one of those honking huge Swiss Army
>> knives. It has a big knife, a small knife, a saw, a scissor, a
>> fork, a
>> spoon, a magnifying glass, a toothpick, a small shovel, a beach
>> umbrella, and more!
>
> How come with Python "batteries included" is a good thing,
> but with Biopython it's not?
>
> Is the solution like what Zope's been doing - split itself
> into many smaller packages, and distribute them as eggs?
>
Well, I'd argue that "batteries included" in Python isn't all good -
there are pro's and con's. It's awesome for beginners since they have
a useful set of libraries that they can start using right away. It
helps prevent unnecessary forking by making people standardize on one
implementation. But on the flipside it means that people can go longer
before in a Python project before worrying about packaging and
deployment problems - and learning these skills are important. They
have made the Python standard library a little smaller in Python 3,
and I think that's a good thing.
I like the analogy of a bicycle with training wheels on it. Awesome
when you are learning to ride, but when you're in a race and you want
to take a hairpin corner at speed, the wheels are going to get in the
way. People try riding with the training wheels off, they naturally
fall over at first and skin their knees. Some dust themselves off and
try again, others say, "that's it I'm putting the wheels back on and
never taking them off again!"
Zope got very huge, and it made the learning curve incredibly
intimidating for anyone new to the project. Splitting the project into
eggs has been very beneficial for the project. People can now approach
the project more easily, and more easily make contributions to a
single package without needing to feel like they need to first
understand every package in the zope ecosystem first. People who don't
use Zope at all are now able to use packages from the project.
We used to say, "well you can install all of Zope and then just ignore
the parts you don't need, disk space is cheap" but dependencies
between higher level parts of the system and lower level parts snuck
in here an there and it usually meant that pulling in one package
meant "pulling in the world". Splitting a project into multiple eggs
"keeps you honest" in terms of dependencies. And because you used to
have to take an "all-or-nothing" approach with Zope, some chose "all"
and others chose "nothing" and this contributed to a cultural divide
between Zope and the rest of the Python web world.
Mark Ramm's recent talk, "A TurboGears Guy talks about what Django can
learn from Zope" is very relevant to the topic of a large, centralized
project versus many smaller, interconnected projects:
http://compoundthinking.com/blog/index.php/2008/09/17/djangocon-and-learning-from-zope-2/
As is answered in the Q&A period at the end, there is a cost/benefit
trade-off between these two approaches and there is no easy "this way
is right, the other is wrong" answer - but exploring both approaches
regardless of which decision is made is absolutely beneficial.
Packaging and deployment in Python is still an area of
experimentation, and it's kind of a hard one to approach, but it's
getting better. I wrote up some of my experiences with this stuff
recently:
http://www.bud.ca/blog/pony
I think using tools such as Buildout to produce "repeatable
deployments" are particularly relevant in bioinformatics. In a non-
scientific web app, people can say, "why bother going to all the work
of maintaining a detailed description of all the parts that compose my
project and recording each change to each part over time" when the
only use case for this might be historical curiosity. But all the time
in bioinformatics we see research produced from a "one-off
deployment" (aka Works On My Machine (TM)) and it's often not possible
or feasible to go back and re-create the system at the time a
particular result set was generated.
More information about the biology-in-python
mailing list