[bip] Blog post on bioinformatics and Python

C. Titus Brown ctb at msu.edu
Thu Sep 18 11:44:51 PDT 2008


-> > Why not:
-> >
-> >    data = list(data)
-> >
-> > ?  That will take any iterator/generator and turn it into a list.
-> >   
-> (Ignoring the previous answer given)
-> Easy: It is a waste of effort converting one unknown object into a list 
-> of unknown objects that may or may not be the same.
-> Slightly harder: need to preselect the data by different criteria (hit, 
-> score, evalue, query name) - now would have to parse a list of unknowns...
-> 
-> > There's no real penalty for doing this (if you need a random access
-> > list, then you need to fully parse the file anyway!), 
-> But you would be parsing the input multiple times.
-> 
-> > and you can
-> > convert it into a dictionary pretty easily, too.
-> >   
-> Sure, but it is a different matter to get adequate keys.
-> 
-> I do know that computers are faster and memory is than before (and going 
-> to change again - core i7). However, I do try to code 'efficiently' so 
-> converting multiple data types does not fit when you can do it the 
-> desired way the first time.

Bruce, either way you need to treat the objects less "opaquely", right?
That is, if you're doing a BLAST of multiple sequences against a
database, and you need to access the results by (say) query sequence
name, you need to have some way to get at the name in the BLAST result.
Broadly speaking, then, you need to minimally parse the BLAST result.

Using

	data = list(some_generator_or_iterator())

and then accessing 'data' doesn't run the generator more than once;
'data' is now a static list.

The only place where you would see a difference in performance is where
you were first doing partial parsing of the header (to e.g. extract the query
sequence name) and then selecting records to completely parsed based on
the header information.  I don't think BioPython (or my blastparser)
uses this kind of lazy parsing, so you would have to parse the entire
record anyway.

(Probably you understand this and have something else in mind, but I
wanted to point out that using iterators or generators for parsing is
generally not a disadvantage in terms of speed or memory.)

cheers,
--titus
-- 
C. Titus Brown, ctb at msu.edu



More information about the biology-in-python mailing list