[bip] Bioinformatics Programming Language Shootout, Python performance poopoo'd
Andrew Dalke
dalke at dalkescientific.com
Tue Feb 5 15:18:23 PST 2008
On Feb 5, 2008, at 11:36 PM, Cory Tobin wrote:
> One example, on line 54 of parse.py he compiles a
> regular expression inside of a loop. Placing the re.compile() before
> the loop could save them plenty of CPU cycles.
The performance is going to suffer, but because of the re cache it's
a couple of function call overheads and a bit more, and not a full re-
eval. More importantly, that regexp isn't even needed. As far as I
can tell, it's better to use a strip()
> I got a good chuckle on line 45 of parse.py, they use the deprecated
> string module to remove the newline character rather than the strip()
> function.
The main loop of that file starts
while 1:
line = f.readline()
if not line:
break
line = string.replace(line,'\n','')
if re.match('>', line):
name=line.replace('>','',1)
while 1:
line = f.readline()
if re.match("\s+Length", line):
break
p = re.compile("^\s+(\S.+)$")
line = string.replace(line,'\n','')
m = p.match(line)
if m:
name+=" "+m.group(1)
name=name.replace(",", ".")
continue
The more idiomatic approach, assuming I understand the logic
correctly and assuming the file format is always correct (note the
written code has an infinite loop if the "Length" is never found), is
f = iter(f)
for line in f:
if line[:1] == ">":
name = line[1:-1]
for line in f:
if line.startswith(" Length"): # assuming that \s+ is a
fixed length
break
# O(n**2) operation, but n is nearly always less than 4
name += " " + line.lstrip()
name = name.replace(",", ".")
continue
That's followed by
#Score
m_score=p_score.match(line)
if m_score:
bit=m_score.group(1)
e=m_score.group(2)
continue
#Identities
m_id=p_id.match(line)
if m_id:
idtt=m_id.group(1)
pos=m_id.group(2)
print name+","+bit+","+e+","+idtt+","+pos+"\n"
where
p_score = re.compile("^\sScore\s+=\s+(.+)\s+bits\s+\(\d+\),\s+Expect\
(?\d?\)?\s+=\s+(.+)")
p_id = re.compile("^\sIdentities\s+=\s+(\d+\/\d+)\s+\(\d+%\),\s
+Positives\s+=\s+(\d+
\/\d+)\s+\(\d+%\).*")
This could as correctly be written with the faster and easier to
understand
words = line.split()
if words[0] == "Score":
bit = words[2]
e = words[-1]
elif words[0] == "Identities":
idtt = words[2]
pos = words[8] # again, this is a guess
print "%s,%s,%s,%s,%s" % (name, bit, e, idtt, pos)
Now, you might respond "Andrew, it doesn't verify that the line
matches correctly," but as I pointed out, other bits of the code
assumes the format is correct. I could add a couple of more tests to
make sure that there's no accidental confusion with other lines in
the file.
Then again, even these regular expressions assume the input is only
somewhat correct, and something like:
Score = 123 bits (456), Expect(789) = 123
The pattern will also accept an ill-formatted line like
Score = 123 there is extra junk here bits (456), Expect(789) = 123
with $1 = "123 there is extra junk here".
This is due to the (.+) in the pattern match. Which ends up doing a
lot of back tracking because it greedily matches to the end, then
backs up until it finds each space. It really should be (\S+).
Grrr.
> The methodology of this paper was a complete disgrace and lacked any
> scientific objectivity. If they actually wanted to be somewhat
> objective they should have found people who are adept in each of those
> languages and told them to write the fastest code they could.
Hear hear. It's an "I can write Perl in language X" paper.
This paper should have been rejected, or sent back for massive
cleanup, by the reviewers.
Andrew
dalke at dalkescientific.com
More information about the biology-in-python
mailing list