[bip] Blog post on bioinformatics and Python

Thu Sep 18 08:55:05 PDT 2008

Peter wrote:
> Bruce wrote:
>   
>> Other important ones include:
>>
>> Hitting known 'bugs' because some database changed (SwissProt) that
>> required workarounds to avoid complete crashes. (Relying on distros to
>> provide things that I need does not work especially when someone says
>> get the latest version from the svn or the distro provides broken
>> packages.)
>>     
>
> This isn't a problem in Biopython per se - having to update parsers
> due to file format changes (be these from databases or updated
> software tools) is something any bioinformatics library has to deal
> with. Most "stable" Linux distributions won't track the latest version
> of ANY software, so unfortunately if/when some file format next
> changes and breaks a parser, you will need to update Biopython
> manually - rather than via your distribution's packaging system.
> Would having official Biopython (or BioPerl etc) hosted debian (etc)
> packages help here?  In theory you could add this to your list of
> repositories and then automatically get official Biopython releases.
> This would be quite a big effort and we would need people with
> packaging experience to get involved.
>   
I am not the person to ask. While I use distro's packages for most of 
the system, I install the important software including svn versions from 
source (I've been in Linux a long long time). The worst is getting those 
dependencies installed which supports at least a core component that 
does not use any dependencies. 

This is a community thing and requires people trained to do it. I don't 
fully remember but recently SUSE and Fedora (?) were offering ways to 
repackage software for different distributions.

>> Use of iterators and what/how to get specific information out of
>> BioPython objects.
>>     
>
> Could you clarify these points please?  Are you in favour of Biopython
> using python iterators (e.g. via generator functions)?  And what
> Biopython objects in particular were you trying to extract data from?
>   
Part of it is a lack of understanding but I have not bothered to go 
back. So what I say is probably wrong and out of date. I do not really 
understand Python iterators and generators as my knowledge is still 
mainly Python 2.0 and have not bothered that much with the new language 
features. For what I wanted using .next() really was not an option 
because I thought that I would need to get specific entries not proceed 
in ordered approach. Now I definitely need to access specific entries to 
match across files or databases. Today I looked at at the BioPython 
tutorial Chapter 4 and saw SeqIO.to_dict which would have helped in that 
regard.

I tend to blast multiple sequences at the same time (it is faster than 
one at a time) so the .next() is not an option (and I do not see to_dict 
option in the BLAST part of the tutorial). At the time is was also 
trying to extract things less common things like 'Hsp_hit-frame' (?) 
that I did not find as being outputted. I thought I would use BioPython 
1.47 (not sure how to find out version under Python) to check this so I 
just tried to run the tutorial code in Section 6.6.2  'Parsing a file 
full of BLAST runs' on one of my xml files. First problem undefined 
variables (did file a bug - now fixed!!). Second problem 'ValueError: 
Unexpected end of stream' which is hard to determine the cause. However, 
this may be due to using blast version 2.2.18 (released March 08) as I 
think that similar occurrence happened when I first was trying 
BioPython. Just highlights a frustration of using the BioPython codebase 
as there is no clue to the problem or solution (could BioPython at least 
track which version of blast is known to work?).

Bruce