[bip] Blog post on bioinformatics and Python

Wed Sep 17 09:25:14 PDT 2008

On Wed, Sep 17, 2008 at 4:57 PM, Ryan Raaum <ryan.raaum at lehman.cuny.edu> wrote:
>
>> As previously been suggested on this list, what are the problems with
>> BioPython and can these be fixed?
>
> I have some issues with BioPython that are not necessarily _problems_
> of BioPython, but choices that need to be made for any project. Much
> of the time, I need to use something that has made different choices.
>
> 1. It is monolithic. It is like one of those honking huge Swiss Army
> knives. It has a big knife, a small knife, a saw, a scissor, a fork, a
> spoon, a magnifying glass, a toothpick, a small shovel, a beach
> umbrella, and more! (The analogy breaks down a bit because most every
> tool on a Swiss Army knife isn't very good, while much of BioPython is
> very good). Almost all of the time, I need a very small number of
> tools for a project, not a huge, all-capable framework. IF I only
> worked on my computer and never uploaded anything to a web server nor
> shared a tool with a non-bioinformatic friend, the monolithic approach
> could work for me. However, much of the time I do need to distribute
> stuff in some way, and it can be very difficult to distribute tools
> built on top of BioPython. Why do pygr, pycogent, and others roll
> their own sequence import and file reading tools rather than using
> BioPython? Some of it probably has to do with different needs, but
> I'll bet some of it has to do with the monolithic design of BioPython.

Having one large package to install (with optional dependencies) to
support your distributed tool is surely easier than dealing with
several small ones?

> 2. It is not pure python. I recognize the need for Numeric and C for
> speed in many circumstances, but having those in the core framework
> limits where and how it can be used. This could be worked around -
> even in the monolithic approach - by having the largest possible core
> that is pure python, BioPython-Numeric for everything that is based on
> numeric but is otherwise pure python, and BioPython-C for extensions
> requiring compilation. However, I don't see this fairly radical change
> happening - the project is too big and has too many interwoven
> dependencies to allow a shake-up this big at this time.

You can install Biopython without Numeric or Numpy installed - and it
will work fine, assuming you don't use the cluster library, PDB
parsing or the other numerical bits.  If all you care about is
sequences, BLAST and Entrez for example, you'll be fine.

> 3. It has dependencies that can make it difficult to install. I've
> installed BioPython a number of times over the years. Most of the time
> it goes reasonably smoothly. A few times I've had relatively minor
> problems with my system configuration that I could solve (but that
> none of my less computationally ept biologist colleagues would have
> been able to manage), and once I spent a full day and couldn't get it
> to happen because I could not get a functional mxTextTools
> installation (and never got it to happen on that machine). I will say
> that the most recent times I've done it, it went well, and there has
> been significant progress on the dependencies.

Yes mxTextTools was a pain, but (as you note below) for several
releases it has been an unnecessary optional library, as it is/was
only used in a few deprecated or rewritten parsers.

> It is good that the
> only required dependency is now a C compiler and numerical python,

On Windows we provide Biopython pre-compiled, so you don't even need a
C compiler.

> ... but for full functionality, and to achieve the full benefits of the
> monolithic philosophy, you have to install a host of large,
> complicated external dependencies.

Yes, if you want to use 3rd party tools like an SQL database or the
NCBI's standalone BLAST tools, or say ClustalW you will need to
install them too.  I don't see how this is related to a "monolithic"
Biopython.  You can install Biopython and as and when you need
additional packages, install them too.

> I do use BioPython from time to time, but most often through finding a
> small piece of functionality that I need that can be extracted.
> I recently needed a basic, pure python pairwise sequence alignment tool,
> and the align2 module in BioPython did the trick and could be -
> thankfully - pulled out of BioPython easily.

The license here helps (as you note below)

> And, to add some kudos, I think the BioPython people have made some
> very good choices in the past few years.
>
> 1. Changing from GPL to the BioPython license significantly expanded
> the number of people who could contribute and use the project
> (especially among those in industry).

Was BioPython ever GPL?  I was under the impression that it has always
been under the (BSD/MIT style) Biopython License.

> 2. Reducing the number of required dependencies, and _especially_
> working to reduce reliance on mxTextTools has been a HUGE improvement.

I am hoping that for the next release all remaining bits of Biopython
using mxTextTools will be deprecated - we're almost there with the
current release.

> 3. The current work to switch to numpy will also make a big impact
> when it is completed.

This particular change may make more of an impact that I personally
had expected.

Peter