[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd

Tue Feb 5 15:38:19 PST 2008

I will comment on my blog and if permitted I would like to include 
everyone's view about it.

I have found some othe "strange" stuff too, but I can't say I am that 
expert on Python to weigh that much.

Cheers

Paulo

Andrew Dalke wrote:
> On Feb 5, 2008, at 11:36 PM, Cory Tobin wrote:
>   
>> One example, on line 54 of parse.py he compiles a
>> regular expression inside of a loop.  Placing the re.compile() before
>> the loop could save them plenty of CPU cycles.
>>     
>
> The performance is going to suffer, but because of the re cache it's  
> a couple of function call overheads and a bit more, and not a full re- 
> eval.  More importantly, that regexp isn't even needed.  As far as I  
> can tell, it's better to use a strip()
>
>
>   
>> I got a good chuckle on line 45 of parse.py, they use the deprecated
>> string module to remove the newline character rather than the strip()
>> function.
>>     
>
>
> The main loop of that file starts
>
> while 1:
>          line = f.readline()
>
>          if not line:
>                  break
>          line = string.replace(line,'\n','')
>
>          if re.match('>', line):
>                  name=line.replace('>','',1)
>
>                  while 1:
>                          line = f.readline()
>                          if re.match("\s+Length", line):
>                                  break
>                          p = re.compile("^\s+(\S.+)$")
>                          line = string.replace(line,'\n','')
>                          m = p.match(line)
>                          if m:
>                                  name+=" "+m.group(1)
>
>                  name=name.replace(",", ".")
>                  continue
>
>
> The more idiomatic approach, assuming I understand the logic  
> correctly and assuming the file format is always correct (note the  
> written code has an infinite loop if the "Length" is never found), is
>
> f = iter(f)
>
> for line in f:
>    if line[:1] == ">":
>      name = line[1:-1]
>      for line in f:
>        if line.startswith("    Length"):  # assuming that \s+ is a  
> fixed length
>          break
>        # O(n**2) operation, but n is nearly always less than 4
>        name += " " + line.lstrip()
>      name = name.replace(",", ".")
>      continue
>
>
>
> That's followed by
>
>    #Score
>          m_score=p_score.match(line)
>          if m_score:
>                  bit=m_score.group(1)
>                  e=m_score.group(2)
>                  continue
>
>    #Identities
>          m_id=p_id.match(line)
>          if m_id:
>                  idtt=m_id.group(1)
>                  pos=m_id.group(2)
>                  print name+","+bit+","+e+","+idtt+","+pos+"\n"
>
>
>
>
> where
> p_score = re.compile("^\sScore\s+=\s+(.+)\s+bits\s+\(\d+\),\s+Expect\ 
> (?\d?\)?\s+=\s+(.+)")
> p_id    = re.compile("^\sIdentities\s+=\s+(\d+\/\d+)\s+\(\d+%\),\s 
> +Positives\s+=\s+(\d+
> \/\d+)\s+\(\d+%\).*")
>
> This could as correctly be written with the faster and easier to  
> understand
>
>    words = line.split()
>    if words[0] == "Score":
>      bit = words[2]
>      e = words[-1]
>    elif words[0] == "Identities":
>      idtt = words[2]
>      pos = words[8]  # again, this is a guess
>      print "%s,%s,%s,%s,%s" % (name, bit, e, idtt, pos)
>
>
> Now, you might respond "Andrew, it doesn't verify that the line  
> matches correctly," but as I pointed out, other bits of the code  
> assumes the format is correct.  I could add a couple of more tests to  
> make sure that there's no accidental confusion with other lines in  
> the file.
>
> Then again, even these regular expressions assume the input is only  
> somewhat correct, and something like:
>
>    Score = 123 bits (456), Expect(789) = 123
>
> The pattern will also accept an ill-formatted line like
>
>    Score = 123 there is extra junk here bits (456), Expect(789) = 123
>
> with $1 = "123 there is extra junk here".
>
> This is due to the (.+) in the pattern match.  Which ends up doing a  
> lot of back tracking because it greedily matches to the end, then  
> backs up until it finds each space.  It really should be (\S+).
>
> Grrr.
>
>   
>> The methodology of this paper was a complete disgrace and lacked any
>> scientific objectivity.  If they actually wanted to be somewhat
>> objective they should have found people who are adept in each of those
>> languages and told them to write the fastest code they could.
>>     
>
>
> Hear hear.  It's an "I can write Perl in language X" paper.
>
> This paper should have been rejected, or sent back for massive  
> cleanup, by the reviewers.
>
> 				Andrew
> 				dalke at dalkescientific.com
>
>
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki. 
>