[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd
Paulo Nuin
nuin at genedrift.org
Tue Feb 5 15:38:19 PST 2008
I will comment on my blog and if permitted I would like to include
everyone's view about it.
I have found some othe "strange" stuff too, but I can't say I am that
expert on Python to weigh that much.
Cheers
Paulo
Andrew Dalke wrote:
> On Feb 5, 2008, at 11:36 PM, Cory Tobin wrote:
>
>> One example, on line 54 of parse.py he compiles a
>> regular expression inside of a loop. Placing the re.compile() before
>> the loop could save them plenty of CPU cycles.
>>
>
> The performance is going to suffer, but because of the re cache it's
> a couple of function call overheads and a bit more, and not a full re-
> eval. More importantly, that regexp isn't even needed. As far as I
> can tell, it's better to use a strip()
>
>
>
>> I got a good chuckle on line 45 of parse.py, they use the deprecated
>> string module to remove the newline character rather than the strip()
>> function.
>>
>
>
> The main loop of that file starts
>
> while 1:
> line = f.readline()
>
> if not line:
> break
> line = string.replace(line,'\n','')
>
> if re.match('>', line):
> name=line.replace('>','',1)
>
> while 1:
> line = f.readline()
> if re.match("\s+Length", line):
> break
> p = re.compile("^\s+(\S.+)$")
> line = string.replace(line,'\n','')
> m = p.match(line)
> if m:
> name+=" "+m.group(1)
>
> name=name.replace(",", ".")
> continue
>
>
> The more idiomatic approach, assuming I understand the logic
> correctly and assuming the file format is always correct (note the
> written code has an infinite loop if the "Length" is never found), is
>
> f = iter(f)
>
> for line in f:
> if line[:1] == ">":
> name = line[1:-1]
> for line in f:
> if line.startswith(" Length"): # assuming that \s+ is a
> fixed length
> break
> # O(n**2) operation, but n is nearly always less than 4
> name += " " + line.lstrip()
> name = name.replace(",", ".")
> continue
>
>
>
> That's followed by
>
> #Score
> m_score=p_score.match(line)
> if m_score:
> bit=m_score.group(1)
> e=m_score.group(2)
> continue
>
> #Identities
> m_id=p_id.match(line)
> if m_id:
> idtt=m_id.group(1)
> pos=m_id.group(2)
> print name+","+bit+","+e+","+idtt+","+pos+"\n"
>
>
>
>
> where
> p_score = re.compile("^\sScore\s+=\s+(.+)\s+bits\s+\(\d+\),\s+Expect\
> (?\d?\)?\s+=\s+(.+)")
> p_id = re.compile("^\sIdentities\s+=\s+(\d+\/\d+)\s+\(\d+%\),\s
> +Positives\s+=\s+(\d+
> \/\d+)\s+\(\d+%\).*")
>
> This could as correctly be written with the faster and easier to
> understand
>
> words = line.split()
> if words[0] == "Score":
> bit = words[2]
> e = words[-1]
> elif words[0] == "Identities":
> idtt = words[2]
> pos = words[8] # again, this is a guess
> print "%s,%s,%s,%s,%s" % (name, bit, e, idtt, pos)
>
>
> Now, you might respond "Andrew, it doesn't verify that the line
> matches correctly," but as I pointed out, other bits of the code
> assumes the format is correct. I could add a couple of more tests to
> make sure that there's no accidental confusion with other lines in
> the file.
>
> Then again, even these regular expressions assume the input is only
> somewhat correct, and something like:
>
> Score = 123 bits (456), Expect(789) = 123
>
> The pattern will also accept an ill-formatted line like
>
> Score = 123 there is extra junk here bits (456), Expect(789) = 123
>
> with $1 = "123 there is extra junk here".
>
> This is due to the (.+) in the pattern match. Which ends up doing a
> lot of back tracking because it greedily matches to the end, then
> backs up until it finds each space. It really should be (\S+).
>
> Grrr.
>
>
>> The methodology of this paper was a complete disgrace and lacked any
>> scientific objectivity. If they actually wanted to be somewhat
>> objective they should have found people who are adept in each of those
>> languages and told them to write the fastest code they could.
>>
>
>
> Hear hear. It's an "I can write Perl in language X" paper.
>
> This paper should have been rejected, or sent back for massive
> cleanup, by the reviewers.
>
> Andrew
> dalke at dalkescientific.com
>
>
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>
More information about the biology-in-python
mailing list