[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd

Andrew Dalke dalke at dalkescientific.com
Tue Feb 5 19:48:27 PST 2008


On Feb 6, 2008, at 2:35 AM, Titus Brown wrote:
> Well, the mailing list archives are open and searchable, so I'm sure
> you're welcome to do so ;)

I've no problem.

> I suspect that the code could use some systematic code review; "we"
> (i.e. someone else :) could even write up something semi-formal if it
> turns out that the results are bogus.

It's hard to evaluate the BLAST parsing time without knowing how they  
generated the data file.  I can read a million lines a second while  
they report

    Python was the worst performer for parsing a BLAST file (Fig 3),
    taking more than 38 minutes to process the file compared to Perl,
    which took only 7.28 minutes. This difference did not arise from
    any inability of Python to handle large files, since it took only
    3.2 minutes to read the file without processing the lines. Perl
    accomplished the same task in only 1.4 minutes.

Assuming the same disk speed,

   1000000 line * 30 bytes / line = 30MB / sec
   3.2 minutes * 60 sec/min * 30MB/sec = 5+ GB

As a rough guess that's 4GB or larger.  Hmm, but I was reading from a  
gzipped file.  Still, it'll have to be a *huge* file to get that slow  
performance.


What constitutes bogus enough?  I think the results for Python are  
bogus, the methodology is bogus, and two of the three benchmarks,  
being without test data, are also bogus.



				Andrew
				dalke at dalkescientific.com





More information about the biology-in-python mailing list