[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd

Tue Feb 5 15:18:23 PST 2008

On Feb 5, 2008, at 11:36 PM, Cory Tobin wrote:
> One example, on line 54 of parse.py he compiles a
> regular expression inside of a loop.  Placing the re.compile() before
> the loop could save them plenty of CPU cycles.

The performance is going to suffer, but because of the re cache it's  
a couple of function call overheads and a bit more, and not a full re- 
eval.  More importantly, that regexp isn't even needed.  As far as I  
can tell, it's better to use a strip()

> I got a good chuckle on line 45 of parse.py, they use the deprecated
> string module to remove the newline character rather than the strip()
> function.

The main loop of that file starts

while 1:
         line = f.readline()

         if not line:
                 break
         line = string.replace(line,'\n','')

         if re.match('>', line):
                 name=line.replace('>','',1)

                 while 1:
                         line = f.readline()
                         if re.match("\s+Length", line):
                                 break
                         p = re.compile("^\s+(\S.+)$")
                         line = string.replace(line,'\n','')
                         m = p.match(line)
                         if m:
                                 name+=" "+m.group(1)

                 name=name.replace(",", ".")
                 continue

The more idiomatic approach, assuming I understand the logic  
correctly and assuming the file format is always correct (note the  
written code has an infinite loop if the "Length" is never found), is

f = iter(f)

for line in f:
   if line[:1] == ">":
     name = line[1:-1]
     for line in f:
       if line.startswith("    Length"):  # assuming that \s+ is a  
fixed length
         break
       # O(n**2) operation, but n is nearly always less than 4
       name += " " + line.lstrip()
     name = name.replace(",", ".")
     continue

That's followed by

   #Score
         m_score=p_score.match(line)
         if m_score:
                 bit=m_score.group(1)
                 e=m_score.group(2)
                 continue

   #Identities
         m_id=p_id.match(line)
         if m_id:
                 idtt=m_id.group(1)
                 pos=m_id.group(2)
                 print name+","+bit+","+e+","+idtt+","+pos+"\n"

where
p_score = re.compile("^\sScore\s+=\s+(.+)\s+bits\s+\(\d+\),\s+Expect\ 
(?\d?\)?\s+=\s+(.+)")
p_id    = re.compile("^\sIdentities\s+=\s+(\d+\/\d+)\s+\(\d+%\),\s 
+Positives\s+=\s+(\d+
\/\d+)\s+\(\d+%\).*")

This could as correctly be written with the faster and easier to  
understand

   words = line.split()
   if words[0] == "Score":
     bit = words[2]
     e = words[-1]
   elif words[0] == "Identities":
     idtt = words[2]
     pos = words[8]  # again, this is a guess
     print "%s,%s,%s,%s,%s" % (name, bit, e, idtt, pos)

Now, you might respond "Andrew, it doesn't verify that the line  
matches correctly," but as I pointed out, other bits of the code  
assumes the format is correct.  I could add a couple of more tests to  
make sure that there's no accidental confusion with other lines in  
the file.

Then again, even these regular expressions assume the input is only  
somewhat correct, and something like:

   Score = 123 bits (456), Expect(789) = 123

The pattern will also accept an ill-formatted line like

   Score = 123 there is extra junk here bits (456), Expect(789) = 123

with $1 = "123 there is extra junk here".

This is due to the (.+) in the pattern match.  Which ends up doing a  
lot of back tracking because it greedily matches to the end, then  
backs up until it finds each space.  It really should be (\S+).

Grrr.

> The methodology of this paper was a complete disgrace and lacked any
> scientific objectivity.  If they actually wanted to be somewhat
> objective they should have found people who are adept in each of those
> languages and told them to write the fastest code they could.

Hear hear.  It's an "I can write Perl in language X" paper.

This paper should have been rejected, or sent back for massive  
cleanup, by the reviewers.

				Andrew
				dalke at dalkescientific.com