[bip] Blog post on bioinformatics and Python

Peter biopython at maubp.freeserve.co.uk
Thu Sep 18 09:25:53 PDT 2008


>>> Use of iterators and what/how to get specific information out of
>>> BioPython objects.
>>
>> Could you clarify these points please?  Are you in favour of Biopython
>> using python iterators (e.g. via generator functions)?  And what
>> Biopython objects in particular were you trying to extract data from?
>>
> Part of it is a lack of understanding but I have not bothered to go
> back. So what I say is probably wrong and out of date. I do not really
> understand Python iterators and generators as my knowledge is still
> mainly Python 2.0 and have not bothered that much with the new language
> features. For what I wanted using .next() really was not an option
> because I thought that I would need to get specific entries not proceed
> in ordered approach. Now I definitely need to access specific entries to
> match across files or databases. Today I looked at at the BioPython
> tutorial Chapter 4 and saw SeqIO.to_dict which would have helped in that
> regard.

I sort of follow you.  I personally think using iterators in for loops
is elegant, while the .next() thing is ugly, but has its place.

> I tend to blast multiple sequences at the same time (it is faster than
> one at a time) so the .next() is not an option (and I do not see to_dict
> option in the BLAST part of the tutorial).

I also tend to use BLAST on multiple queries.  I tend run my analysis
on the results query by query (looping over the returned results - but
you could use the next() method if you preferred).  I'm unclear on why
this was not an option for you.

If you have hundreds (or more) queries, then loading all the BLAST
results into memory at once as a dictionary may be a bad idea (too
much data).   If I want to cross-reference queries back to the
original input (e.g. to get the full query sequence), then I may store
the query FASTA file in memory as a dictionary.  If I really wanted to
analyse the BLAST results in a particular order, I would construct the
query input FASTA file in that order.

> At the time is was also
> trying to extract things less common things like 'Hsp_hit-frame' (?)
> that I did not find as being outputted. I thought I would use BioPython
> 1.47 (not sure how to find out version under Python) to check this so I
> just tried to run the tutorial code in Section 6.6.2  'Parsing a file
> full of BLAST runs' on one of my xml files. First problem undefined
> variables (did file a bug - now fixed!!).  Second problem 'ValueError:
> Unexpected end of stream' which is hard to determine the cause. However,
> this may be due to using blast version 2.2.18 (released March 08) as I
> think that similar occurrence happened when I first was trying
> BioPython.

Thanks for that report about the variable names.  Did you realise that
Section 6.6 "Deprecated BLAST parsers" is on the plain text parser,
rather than the XML parser discussed in Section 6.4 "Parsing BLAST
output"?  Would you mind posting over on the biopython mailing list
with a little more detail about your problem (e.g. snippet of code and
the trace track)?  Thanks

> Just highlights a frustration of using the BioPython codebase
> as there is no clue to the problem or solution (could BioPython at least
> track which version of blast is known to work?).

Biopython 1.48 works with XML from BLAST version 2.2.18 and 2.2.18+
used by the NCBI online (and we have unit tests going back to 2.2.12).
 We don't encourage people to use the plain text parser, but it should
work on recent versions of BLAST for single query output.  It doesn't
currently work on multi-query output due to one of the NCBI's recent
formatting changes.  Quoting the tutorial:

"Our plain text BLAST parser works a bit better [than the HTML
parser], but use it at your own risk. It may or may not work,
depending on which BLAST versions or programs you're using."

How would you suggest tracking this information?  Via a table on a
wiki page maybe?

Peter



More information about the biology-in-python mailing list