[bip] Blog post on bioinformatics and Python

Wed Sep 17 11:42:09 PDT 2008

Peter wrote:
> Hello all,
>
> I would have joined in with this conversation earlier but have been on holiday.
>
> I must confess I haven't read nearly as many of Walter Scott's novels
> as I would like, so didn't realise Titus was actually quoting "Old
> Mortality" with his 'tweaking the proboscis' line until Andrew's
> informative post.
>
> Bruce wrote:
>   
>> Also missing is the community because I tend to concur with Andrew,
>> communications must to go both ways. Part of any project is foster a
>> community as it changes and for the community to foster back.  The
>> latter is hard in multiple ways especially the requirement to understand
>> a person's coding style that is not your own. But I definitely agree
>> that starting a new project or forking one (what happened to that other
>> one) should be an option of last resort.
>>     
>
> For the record, I also agree with this stance about avoiding starting
> new projects or forking old ones - which might be inferred from the
> fact that I got involved with Biopython rather than starting something
> from scratch.
>
>   
>> As previously been suggested on this list, what are the problems with
>> BioPython and can these be fixed?
>>     
>
> Specific feedback to Biopython would be very welcome - ideally on the
> Biopython mailing lists, but I am subscribed here too.
>   
Yeah mostly lurk there with the expectation that I'll eventually get 
back to BioPython.

>   
>> Also keep in mind that Python 3K is very near that will require
>> extensive changes to any code base. So it is probably a good time to
>> start addressing any issues as the developers are likely to be amenable
>> to changes as the code has to change.
>>     
>
> I was under the impression that the python team are trying to
> encourage libraries NOT to use Python 3k as an excuse for API changes.
>  While we can continue to gradually evolve the current Biopython API
> in a backwards compatible way, discussion about bigger API changes
> could help define goals for a possible "Biopython 2.0".
>   
Correct because Guido pointed out that it would be hard to identify what 
is API and what is Python 3K. Rather he recommended API changes before 
migrating to Python 3K 
(http://www.artima.com/weblogs/viewpost.jsp?thread=227041). However, the 
Python developers did change the API that will force numpy (see 
http://www.scipy.org/Python3k ) to change it's API.

Yes, I would agree that aiming for BioPython 2.0 would be appropriate.

>   
>> Obviously there are dependencies
>> on third-party modules that also have to change or these dependencies
>> (e.g., Numeric) have to be removed in some way.
>>
>> To start this, one issue for me is the use of the unmaintained Numeric
>> but numarray (a fork of Numeric) was being maintained. Consequently,
>> BioPython did not fit into my code.
>>     
>
> I am a little surprised at your view - but hearing another perspective
> is instructive.  Yes, Numeric is unmaintained, but it is stable and
> pretty robust.  As there is nothing to prevent having both Numeric and
> numpy installed together, I personally had no problem with this when I
> started to use Biopython.  Were there any other stumbling blocks for
> you when you initially looked at Biopython?
Sure, BioPython is essentially a standalone module so what Numerical 
Python module I use is somewhat irrelevant. But it did make me think 
twice as my work at the time was more numerical/statistical than 
bioinformatics.

Other important ones include:

Hitting known 'bugs' because some database changed (SwissProt) that 
required workarounds to avoid complete crashes. (Relying on distros to 
provide things that I need does not work especially when someone says 
get the latest version from the svn or the distro provides broken 
packages.)

Use of iterators and what/how to get specific information out of 
BioPython objects.

Lack of elementtree - I respect the reasons for this.

I have my own sequence format that precluded 'off the shelf' parsers 
(basically needed a simple way to contain to support metadata - support 
for ';' in the fasta would have been so so useful).

Yes, I fully agree you can use Numeric, numarray, numpy as separate 
entities. Yes, I agree that you can use BioPython alone. But my answer 
to the question of whether or not BioPython would help me was no for 
these different reasons. Once you start down that path you are not 
really coming back unless there is a good reason. In that regards, it 
doesn't help that Python is such as great language. Thus, the net result 
is that it seemed faster and easier to write the stuff I needed myself.
>   
>> NumPy (yes, another fork of Numeric
>> with some of the pieces of numarray) superseded both Numeric and
>> numarray and version 1.2 is due very, very soon (that includes some of
>> Andrew's fixes). Finally it would appear that BioPython will move to
>> numpy although there is an mingw windows-64 bug affects the compilation
>> on that platform that may delay things.
>>     
>
> The current release, Biopython 1.48 does only support Numeric.
> However, in CVS we are currently moving to supporting this and numpy.
> For the pure python modules this is fairly simple, but we do have C
> code to deal with too - but I'm sure additional people getting
> involved and testing things would help.
>   
Yes, it requires some deep knowledge of the code base to fix and easy to 
be out of ones depth!

>   
>> Just a few worthless cents (bailouts are welcome),
>>     
>
> I wouldn't have said worthless (insert joke about the US dollar value here).
>
> Peter
>
>   

However, stumbling blocks are a little useless without trying to remove 
them. Also referring to other posts that came while writing this. I 
think that BioPython needs to be split. While maintaining multiple 
packages is a problem which is why I like how Scientific Python  does 
it. Scientific Python is really NumPy, SciPy and SciPy kits - ignoring 
the fact that scipy has extra dependencies (like some language that I 
don't know) and was/is hard to install (try getting Atlas or when 
distros screwup). Scikits (these were the sandboxes of earlier 
releases), like learn (machine learning), require SciPy but are 
otherwise independent and do develop at a different pace. Really it 
allows updating certain components and avoiding dependencies.

Sequence stuff (handling sequences and database records; addressing 
BLAST and multiple alignment etc.would be one component. I'll split 
these further but probably no gain and most people would want both 
anyhow. The second part would be things like logistic regression, 
cluster, and microarray-related affy stuff as well as most of the topics 
covered in Jason Kinser's book.

Bruce