[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd

Tue Feb 5 21:19:27 PST 2008

On Feb 6, 2008, at 4:40 AM, Bruce Southey wrote:
> The concern that I have is that showing incorrect code or results is
> not usually a constructive approach.  A far bigger concern for me is
> what is actually being compared.

I haven't gotten to the point of looking at the other codes, or  
reading the paper in detail.  At this point I'm reviewing the Python  
code because I know it the best.

What I've found so far is that it wasn't written by someone who knows  
Python well. There's a lot of code that is stylistically Perl or  
Fortran.  This affects both the line count and the performance  
numbers.  My version of "parse" takes 1/2 the number of non-comment/ 
non-blank lines as reported by the paper.

In the one case where I was able to generate numbers, changing a  
"0.0" to a "0" gave a 10% performance improvement, and using psyco  
gave a 94% speedup.  The paper talks about using C extensions for  
Perl and Java, but not for Python.

These to me are methodological errors dealing with the Python code.   
I assume they are true for the other projects, more or less.

Looking a bit at the other code bases, I see what you mean.  A  
serious questions is, what's the point of the paper?

    This benchmark provides a comparison of six commonly used  
programming
    languages under two different operating systems. The overall  
comparison
    shows that a developer should choose an appropriate language  
carefully,
    taking into account the performance expected and the library
    availability for each language.

That's not a shocker.  That's ... boring.

> I don't think that the same algorithm
> is being implemented in each language! For example, the readFasta.py
> involves a list of classes that I don't think is done in the other
> languages - anyhow it is also so inefficient that it is trivial to
> reduce its time by nearly half. Also, the code don't seem to have any
> error checking ability or test ability to assess if the code is even
> correct.

The output of the perl and python ones are the same, except for the  
newline added because the author thought that python's "print" needed  
a "\n" just like perl's does.  The C output adds the text "Align1 "  
and "Align2 " to the output, making it harder to diff.

Some additional questions to address :
   - Should each of the implementations be idiomatically correct?

   - Should an implementation trade off performance vs. readability?
        (which is itself language independent; Python programmers don't
         read regexs as much as Perl)

   - Are 3rd party packages allowed?  For example, Numeric for Python,
      or ragel for making a parser for C.  Or psyco.

   - Should the programs be "normal" or "expert"?  That is, if I were  
to write a Python parser for performance, with access to 3rd party  
packages, I could pull in mxTextTools and make a very fast system.   
But it wouldn't be what most people would do and it wouldn't be  
maintainable for most environments.

Or I could use mmap and probably get a decent boost that way.  See  
for example the discussion on "wide finder", like at
   http://www.dalkescientific.com/writings/diary/archive/2007/10/07/ 
wide_finder.html

   - Why is the Python version for Linux running 2.4.4 (a bug fix  
released October 18, 2006) and not Python 2.5 (released September 19,  
2006)?  I ask this in part because I help add some string  
optimizations to Python's core specifically to make some parsing  
tasks go faster, and others added other optimizations.

   The paper says "Only C# and Python appeared consistently faster in  
every program on Windows." That's not because of Windows being  
special.  It's because the Windows tests used Python 2.5, which  
included all those the optimizations.

> I found it more disturbing the lack of literature review including
> different papers (http://www.cis.udel.edu/~silber/470STUFF/article.pdf
> or http://page.mi.fu-berlin.de/prechelt/Biblio/jccpprtTR.pdf ) or

The latter Prechelt paper is [8] in the reference list.  However,  
it's a different topic.  In those two the purpose was to see how  
people with relatively equal skills on a given language, implement  
solutions to the same problem, then evaluate the aggregate properties  
of those solutions.  There were multiple people implementing the  
program for each language.

In this case there's one person implementing the same program in  
different languages, and with different skill levels in each language.

Because the author of the programs was a coauthor of the paper,  
there's a possibility of bias as well.  This wasn't a blinded  
comparison, and there was no incentive to improve every program for  
better performance, less memory use or fewer LOC.

				Andrew
				dalke at dalkescientific.com