[bip] Blog post on bioinformatics and Python

Wed Sep 17 08:57:27 PDT 2008

> As previously been suggested on this list, what are the problems with
> BioPython and can these be fixed?

I have some issues with BioPython that are not necessarily _problems_
of BioPython, but choices that need to be made for any project. Much
of the time, I need to use something that has made different choices.

1. It is monolithic. It is like one of those honking huge Swiss Army
knives. It has a big knife, a small knife, a saw, a scissor, a fork, a
spoon, a magnifying glass, a toothpick, a small shovel, a beach
umbrella, and more! (The analogy breaks down a bit because most every
tool on a Swiss Army knife isn't very good, while much of BioPython is
very good). Almost all of the time, I need a very small number of
tools for a project, not a huge, all-capable framework. IF I only
worked on my computer and never uploaded anything to a web server nor
shared a tool with a non-bioinformatic friend, the monolithic approach
could work for me. However, much of the time I do need to distribute
stuff in some way, and it can be very difficult to distribute tools
built on top of BioPython. Why do pygr, pycogent, and others roll
their own sequence import and file reading tools rather than using
BioPython? Some of it probably has to do with different needs, but
I'll bet some of it has to do with the monolithic design of BioPython.

2. It is not pure python. I recognize the need for Numeric and C for
speed in many circumstances, but having those in the core framework
limits where and how it can be used. This could be worked around -
even in the monolithic approach - by having the largest possible core
that is pure python, BioPython-Numeric for everything that is based on
numeric but is otherwise pure python, and BioPython-C for extensions
requiring compilation. However, I don't see this fairly radical change
happening - the project is too big and has too many interwoven
dependencies to allow a shake-up this big at this time.

3. It has dependencies that can make it difficult to install. I've
installed BioPython a number of times over the years. Most of the time
it goes reasonably smoothly. A few times I've had relatively minor
problems with my system configuration that I could solve (but that
none of my less computationally ept biologist colleagues would have
been able to manage), and once I spent a full day and couldn't get it
to happen because I could not get a functional mxTextTools
installation (and never got it to happen on that machine). I will say
that the most recent times I've done it, it went well, and there has
been significant progress on the dependencies. It is good that the
only required dependency is now a C compiler and numerical python, but
for full functionality, and to achieve the full benefits of the
monolithic philosophy, you have to install a host of large,
complicated external dependencies.

I do use BioPython from time to time, but most often through finding a
small piece of functionality that I need that can be extracted. I
recently needed a basic, pure python pairwise sequence alignment tool,
and the align2 module in BioPython did the trick and could be -
thankfully - pulled out of BioPython easily.

And, to add some kudos, I think the BioPython people have made some
very good choices in the past few years.

1. Changing from GPL to the BioPython license significantly expanded
the number of people who could contribute and use the project
(especially among those in industry).

2. Reducing the number of required dependencies, and _especially_
working to reduce reliance on mxTextTools has been a HUGE improvement.

3. The current work to switch to numpy will also make a big impact
when it is completed.

-Ryan