[bip] agile software development
Andrew Dalke
dalke at dalkescientific.com
Tue Jul 31 19:36:24 PDT 2007
On Jul 31, 2007, at 6:22 PM, Titus Brown wrote:
> I like this idea.
>
> Chris Lee and I are planning to bring some code to the biology BoF at
> SciPy. I don't know that we'll have time for much of a review, but
> it's
> a good idea for the next SciPy.
I don't know about SciPy the conference, but SciPy the
workshop (the year before the first conference) didn't have
many bio people. But starting small is perhaps best.
> -> > See http://genomebiology.com/2007/8/2/103 for one significant
> -> > example of computational science gone awry.
with commentary in
> http://pyre.third-bit.com/blog/archives/877.html
Unlike some of the comments, I don't think that will change
much of anything. Mistakes happen. Sign errors happen.
Some years back I implemented:
[KAB76] Kabsch, W.. A Solution for the Best Rotation to Relate Two
Sets of Vectors,
Acta Cryst. vol. A32. 1976, pp. 922-923.
It has a sign error. It would sometimes return the mirror image.
The corrected algorithm is in:
[Kabsch, 1978] Kabsch, W. (1978). A discussion of the solution
for the best rotation to related two sets of vectors. Acta.
Crystal,
34A:827-828.
It wasn't even in code - it was in the math.
Aren't papers supposed to be one step in the process, with
the next step being that other people try to reproduce or
otherwise verify the result? Peer review is only one part
of the self-correcting mechanism, yes?
Because of the previously mentioned lack of interest in
doing code reviews, do you think that if the code had been
available (along with everyone else's code) then the bug
would have been found earlier?
Linus's Law, via Eric S. Raymond:
given enough eyeballs, all bugs are shallow
There aren't enough interested eyeballs in this field to
review all of the source code that's made.
> You might also be interested in the Software Carpentry stuff that Greg
> Wilson and others (me included) are pushing. http://
> www.swc.scipy.org/
Well, I am part of the PSF, which helped fund that ;)
It's ... not at the right level for what I'm talking
about. I'm trying to come up with a better description than that.
There's nothing there on state machines. The closest are
regular expressions. There's nothing on parsing with tools
like Ply, lex/yacc, etc. There's nothing on estimating
algorithm runtime performance, nor measuring and tuning
a programs performance. How to support plugins. Nothing on
HCI, including nothing on API design.
(Nice video on that last topic, titled
How to Design a Good API and Why it Matters
at http://video.google.com/videoplay?docid=-3733345136856180693 )
The Software Carpentry work is useful. I and others have
pointed students over to it. But my dismay is more that
these interesting, useful, and relevant topics are not more
widely known by biology and chemistry developers.
> -> I think the reference to "waterfall design", when brought up
> -> in the various pro-agile/xp debates, is a straw man argument.
> -> Very few people use that approach.
>
> ... you would be surprised how many people are *taught* that, and how
> many teachers will *defend* it. It's not a straw man to them ;)
Sorry, it's my lack of experience in the Big Wide World showing
through here. In the world of scientific software, and in Python
software, there's very few people using that approach.
My knowledge of the rest of the world comes through reading, as
in things like:
http://www.idinews.com/waterfall.html
There's no such thing as the Waterfall Approach!
(and there never was)
The Wikipedia page claims that most people use "modified waterfall
models."
Reading "The Language Log"
http://itre.cis.upenn.edu/~myl/languagelog/
I read over and over about the different ways that various
people assert rules and guidelines about the English language,
such as "no split infinitives" and "no passive tense/voice".
The authors take a firm stance against "The Elements of Style",
for exampe, "Strunk and White's poisonous little collection of
bad grammatical advice"; see
http://itre.cis.upenn.edu/%7Emyl/languagelog/archives/000994.html
The idea that "once upon a time everyone did the waterfall
model" is an easy one to believe, and most CS teachers and
software developers are not trained in software engineering
or even have a background in software project management or
methodologies.
I read "Mythical Man Month". That project didn't use the
waterfall model.
My deepest knowledge about waterfall model comes from McConnell's
"Rapid Development", which pretty much says "don't use it".
Note that the Wikipedia page does not link to any projects
which actually used waterfall (compared with more iterative
variations of waterfall), and the page they make after
"upon many large government projects"
goes to archive.org copy of an NASA site saying
The standard waterfall model is associated with the
failure or cancellation of a number of large systems.
It can also be very expensive.
So until I have a few pointers to actual projects done
using waterfall, I'm going to be very skeptical about
the claim that real projects are done using "the standard
waterfall model."
> -> Have you come across the term "quality without a name", usually
> -> written "QWAN"? It has unfortunate Zen connotations. I think
> -> of it in part as a way to say "but I'm not naming any buzzword."
>
> Yes -- see Richard Gabriel's _Patterns of Software_,
>
> http://www.dreamsongs.com/Files/PatternsOfSoftware.pdf
>
> for an excellent discussion of software development in general.
That book was where I first learned the term. I didn't know
it was downloadable now - I bought the book. His commentary
on Lisp and the "New Jersey" school of thought was also thought
provoking.
I think most of the discussion about the QWAN is too mystical.
Search for: qwan Quality Without A Name
and the description in the first hit uses "ineffable" and
in the second "expressing a oneness".
When I was in grad school, trying to become a physicist,
my professor pointed out that my problem with a certain
assignment was that I "wasn't thinking like a physicist."
Indeed, I was thinking like a mathematician.
How does a baker know how long to knead bread?
How does an artist know how to sketch a line to capture
the essence of a shape?
How does a batter know when to swing?
Lots of practice, and review, and feedback.
Q: How do you loose weight and get in shape?
A #1: The Atkin's diet (or any of a bajillion other buzzwords)
A #2: Eat less, eat healthy foods, and exercise.
But there's no buzzword for the latter.
> Are you familiar with pygr,
>
> http://bioinfo.mbi.ucla.edu/pygr
>
> ?
You mentioned it on your blog a year or two back. I looked
at it then. It wasn't useful for what I'm doing in chemistry,
and I noticed enough code that suggested it was written by
a novice programmer that I decided to go no further.
Graphs are like strings in that you think they are simple
until you get into the details. Strings are simple? See
http://www.and.org/vstr/comparison
although it isn't working for me right now. For molecules
I want various atom/bond properties, I want substructure
searching (via graph isomorphism), and a few other things.
Looking at it again now:
It unzips to a non-versioned directory 'pygr' instead of
pygr-0.7-beta1 or somesuch
The setup.py file is
v1, v2 = sys.version_info[:2]
if v1 < 2:
raise 'pygr does not support python 1.x'
if v1 == 2 and v2 < 2:
raise 'pygr does not support python2.1 or earlier version'
while version_info was only added in Python 2.0, and
string exceptions were deprecated in Python 1.5. Which
was an awful long time ago.
The code I remember seeing last time was something like
os.system("rm %s" % filename)
The closest I could find now was:
def repeat_mask(seq,progname='RepeatMasker -xsmall',opts=''):
'Run RepeatMasker on a sequence, return lowercase-masked string'
temppath=os.tempnam()
ofile=file(temppath,'w')
write_fasta(ofile,seq)
ofile.close()
cmd=progname+' '+opts+' '+temppath
if os.system(cmd)!=0:
raise OSError('command %s failed' % cmd)
ofile=file(temppath+'.masked')
for id,title,seq_masked in read_fasta(ofile):
break # JUST READ ONE SEQUENCE
ofile.close()
cmd='rm -f %s %s.*' % (temppath,temppath)
if os.system(cmd)!=0:
raise OSError('command '+cmd+' failed')
return seq_masked
This should use the subprocess module, and use
shutil.rmtree instead of that os.system() call.
It also mans the README should change
In theory, pygr should work on any platform that adequately
supports python.
to
In theory, pygr should work on any sufficiently Unix-like platform
that
adequately supports python.
This
exit_status=os.system('cp %s/pygr/cgraph.c %s/pygr/cgraph.h %
s/pygr/cdict.pxd .'
% (pygrpath,pygrpath,pygrpath))
should use shutil.copy
if not os.access(self.filepath+'.nsd',os.R_OK) \
and not os.access(self.filepath+'.psd',os.R_OK) \
and not os.access(self.filepath+'.00.nsd',os.R_OK) \
and not os.access(self.filepath+'.00.psd',os.R_OK):
is probably better using os.path.exists instead of os.access.
if not os.access(self.filepath+'.nsd',os.R_OK) \
and not os.access(self.filepath+'.psd',os.R_OK) \
and not os.access(self.filepath+'.00.nsd',os.R_OK) \
and not os.access(self.filepath+'.00.psd',os.R_OK):
Interesting. The code from earlier, and a few other places
in the code base, use
ofile=file(temppath+'.masked')
for id,title,seq_masked in read_fasta(ofile):
break # JUST READ ONE SEQUENCE
Turns out read_fasta has a 2nd argument, 'onlyReadOneLine'.
I would have made two different functions, like
"read_first_fasta". Oh, wait, I'm mistaken. It means
to only read the first line of the sequence in a FASTA record.
There should be a function that works like:
id, title, seq_masked = read_first_fasta_record(ofile)
And the file variable name should be 'ifile' and not 'ofile',
since it appears that those are normally used for "input file"
and "output file".
Here's the code
def read_fasta(ifile,onlyReadOneLine=False):
"Get one sequence at a time from stream ifile"
id=None
title=''
seq=''
for line in ifile:
if '>'==line[0]:
if id!=None and len(seq)>0:
yield id,title,seq
seq = ''
id=line[1:].split()[0]
title=line[len(id)+2:]
elif id!=None: # READ SEQUENCE
for word in line.split(): # GET RID OF WHITESPACE
seq += word
if onlyReadOneLine and len(seq)>0:
yield id,title,seq
if id!=None and len(seq)>0:
yield id,title,seq
Some comments
def read_fasta(ifile,onlyReadOneLine=False):
"Get one sequence at a time from stream ifile"
id=None
title=''
seq=''
Following PEP 8's recommendations:
id = None
title = ''
seq = ''
(BTW, I prefer "" because they are easier to see.)
for line in ifile:
if '>'==line[0]:
if id!=None and len(seq)>0:
This should be:
if id is not None and seq:
yield id,title,seq
seq = ''
id=line[1:].split()[0]
title=line[len(id)+2:]
I've been trying to figure out if this is clever or strange.
Here are possible problems:
- there's an IndexError if no title exists
- are the spaces after the first space between the id and
the title supposed to be in the title
- is the possible trailing newline supposed to be in the title?
Perhaps the rather inelegant
fields = line[1:].rstrip().split(None, 1)
if not fields:
id = title = ""
elif len(fields) == 1:
id = fields[0]; title = ""
else:
id, title = fields
elif id!=None: # READ SEQUENCE
use "is not None" rather than "!= None"
for word in line.split(): # GET RID OF WHITESPACE
seq += word
Thankfully this is an O(N) amortized cost in modern Pythons.
It used to be O(N**2). Better is
seq = []
...
seq.extend(line.split()) # GET RID OF WHITESPACE
...
yield id, title, "".join(seq)
if onlyReadOneLine and len(seq)>0:
yield id,title,seq
There's some ugliness as it parses all the fields in a
sequence even when onlyReadOneLine is enable. Try ...
No, don't try. I don't know what onlyReadOneLine is supposed
to do. When is this flag, which produces this output, useful?
>>> list(read_fasta([">Hello there\n", "ATCG\n"]))
[('Hello', 'there\n', 'ATCG')]
>>> list(read_fasta([">Hello there\n", "ATCG\n", "AAAA\n"]))
[('Hello', 'there\n', 'ATCGAAAA')]
>>> list(read_fasta([">Hello there\n", "ATCG\n", "AAAA\n"], 1))
[('Hello', 'there\n', 'ATCG'), ('Hello', 'there\n', 'ATCGAAAA'),
('Hello', 'there\n', 'ATCGAAAA')]
>>>
>>> list(read_fasta([">Hello there\n", "ATCG\n", "AAAA\n", "\n",
">Bye\n", "A\n"], 1))
[('Hello', 'there\n', 'ATCG'), ('Hello', 'there\n', 'ATCGAAAA'),
('Hello', 'there\n', 'ATCGAAAA'), ('Hello', 'there\n', 'ATCGAAAA'),
('Bye', '', 'A'), ('Bye', '', 'A')]
>>>
I moved over to graphquery.py
def __iter__(self,k):
for k,v in self.iteritems():
yield k
def iteritems(self):
for dataNode,queryNode in self.compiler.dataMatch.items():
yield queryNode,dataNode # RETURN NODE MAPPINGS
for i in range(self.compiler.n): # ALSO SAVE MAPPINGS TO
DATA EDGES
gqi=self.compiler.gqi[i] # RETURN EDGE MAPPINGS
yield (gqi.fromNode,gqi.queryNode),self.compiler.dataEdge
[i]
def items(self):
return [x for x in self.iteritems()]
is more succinctly written as
def __iter__(self,k):
return self.iteritems()
def iteritems(self):
for item in self.compiler.dataMatch.items():
yield item # Return node mappings
for i, gqi in enumerate(self.compiler.gqi):
yield (gqi.fromNode, gqi.queryNode),
self.compiler.dataEdge[i]
def items(self):
return self.items()
There's an eval
def methodFactory(methodList,methodStr,localDict):
for methodName in methodList:
localDict[methodName]=eval(methodStr%methodName)
used like this
classutil.methodFactory(['__contains__'],'lambda self,obj:self.d.
%s(obj.id)',
locals())
I think it should be the rather simpler:
def __contains__(self, obj):
return obj.id in self.d
???
These
def __iter__(self):
for node in self.d:
yield self.unpack_source(node)
def keys(self): return [k for k in self]
should be
def __iter__(self):
return iter(self.unpack_source(node))
def keys(self):
list(self)
This is without knowing what the code does on
the overall level. I might have other comments
if I knew more about it.
Andrew
dalke at dalkescientific.com
More information about the biology-in-python
mailing list