[bip] agile software development

Tue Jul 31 19:36:24 PDT 2007

On Jul 31, 2007, at 6:22 PM, Titus Brown wrote:
> I like this idea.
>
> Chris Lee and I are planning to bring some code to the biology BoF at
> SciPy.  I don't know that we'll have time for much of a review, but  
> it's
> a good idea for the next SciPy.

I don't know about SciPy the conference, but SciPy the
workshop (the year before the first conference) didn't have
many bio people.  But starting small is perhaps best.

> -> > See http://genomebiology.com/2007/8/2/103 for one significant
> -> > example of computational science gone awry.

with commentary in

> http://pyre.third-bit.com/blog/archives/877.html

Unlike some of the comments, I don't think that will change
much of anything.  Mistakes happen.  Sign errors happen.
Some years back I implemented:

[KAB76] Kabsch, W.. A Solution for the Best Rotation to Relate Two  
Sets of Vectors,
	Acta Cryst. vol. A32. 1976, pp. 922-923.

It has a sign error.  It would sometimes return the mirror image.
The corrected algorithm is in:

[Kabsch, 1978] Kabsch, W. (1978). A discussion of the solution
     for the best rotation to related two sets of vectors. Acta.  
Crystal,
    34A:827-828.

It wasn't even in code - it was in the math.

Aren't papers supposed to be one step in the process, with
the next step being that other people try to reproduce or
otherwise verify the result?  Peer review is only one part
of the self-correcting mechanism, yes?

Because of the previously mentioned lack of interest in
doing code reviews, do you think that if the code had been
available (along with everyone else's code) then the bug
would have been found earlier?

Linus's Law, via Eric S. Raymond:
   given enough eyeballs, all bugs are shallow

There aren't enough interested eyeballs in this field to
review all of the source code that's made.

> You might also be interested in the Software Carpentry stuff that Greg
> Wilson and others (me included) are pushing.  http:// 
> www.swc.scipy.org/

Well, I am part of the PSF, which helped fund that ;)

It's ... not at the right level for what I'm talking
about.  I'm trying to come up with a better description than that.

There's nothing there on state machines.  The closest are
regular expressions.  There's nothing on parsing with tools
like Ply, lex/yacc, etc.  There's nothing on estimating
algorithm runtime performance, nor measuring and tuning
a programs performance.  How to support plugins.  Nothing on
HCI, including nothing on API design.

(Nice video on that last topic, titled
     How to Design a Good API and Why it Matters
at http://video.google.com/videoplay?docid=-3733345136856180693 )

The Software Carpentry work is useful.  I and others have
pointed students over to it.  But my dismay is more that
these interesting, useful, and relevant topics are not more
widely known by biology and chemistry developers.

> -> I think the reference to "waterfall design", when brought up
> -> in the various pro-agile/xp debates, is a straw man argument.
> -> Very few people use that approach.
>
> ... you would be surprised how many people are *taught* that, and how
> many teachers will *defend* it.  It's not a straw man to them ;)

Sorry, it's my lack of experience in the Big Wide World showing
through here.  In the world of scientific software, and in Python
software, there's very few people using that approach.

My knowledge of the rest of the world comes through reading, as
in things like:
    http://www.idinews.com/waterfall.html

      There's no such thing as the Waterfall Approach!
              (and there never was)

The Wikipedia page claims that most people use "modified waterfall
models."

Reading "The Language Log"
   http://itre.cis.upenn.edu/~myl/languagelog/
I read over and over about the different ways that various
people assert rules and guidelines about the English language,
such as "no split infinitives" and "no passive tense/voice".
The authors take a firm stance against "The Elements of Style",
for exampe, "Strunk and White's poisonous little collection of
bad grammatical advice"; see
  http://itre.cis.upenn.edu/%7Emyl/languagelog/archives/000994.html

The idea that "once upon a time everyone did the waterfall
model" is an easy one to believe, and most CS teachers and
software developers are not trained in software engineering
or even have a background in software project management or
methodologies.

I read "Mythical Man Month".  That project didn't use the
waterfall model.

My deepest knowledge about waterfall model comes from McConnell's
"Rapid Development", which pretty much says "don't use it".

Note that the Wikipedia page does not link to any projects
which actually used waterfall (compared with more iterative
variations of waterfall), and the page they make after
   "upon many large government projects"
goes to archive.org copy of an NASA site saying

    The standard waterfall model is associated with the
    failure or cancellation of a number of large systems.
    It can also be very expensive.

So until I have a few pointers to actual projects done
using waterfall, I'm going to be very skeptical about
the claim that real projects are done using "the standard
waterfall model."

> -> Have you come across the term "quality without a name", usually
> -> written "QWAN"?  It has unfortunate Zen connotations.  I think
> -> of it in part as a way to say "but I'm not naming any buzzword."
>
> Yes -- see Richard Gabriel's _Patterns of Software_,
>
> 	http://www.dreamsongs.com/Files/PatternsOfSoftware.pdf
>
> for an excellent discussion of software development in general.

That book was where I first learned the term.  I didn't know
it was downloadable now - I bought the book.  His commentary
on Lisp and the "New Jersey" school of thought was also thought
provoking.

I think most of the discussion about the QWAN is too mystical.
Search for: qwan Quality Without A Name
and the description in the first hit uses "ineffable" and
in the second "expressing a oneness".

When I was in grad school, trying to become a physicist,
my professor pointed out that my problem with a certain
assignment was that I "wasn't thinking like a physicist."
Indeed, I was thinking like a mathematician.

How does a baker know how long to knead bread?
   How does an artist know how to sketch a line to capture
           the essence of a shape?
     How does a batter know when to swing?

Lots of practice, and review, and feedback.

Q: How do you loose weight and get in shape?

A #1: The Atkin's diet (or any of a bajillion other buzzwords)

A #2: Eat less, eat healthy foods, and exercise.

But there's no buzzword for the latter.

> Are you familiar with pygr,
>
> 	http://bioinfo.mbi.ucla.edu/pygr
>
> ?

You mentioned it on your blog a year or two back.  I looked
at it then. It wasn't useful for what I'm doing in chemistry,
and I noticed enough code that suggested it was written by
a novice programmer that I decided to go no further.

Graphs are like strings in that you think they are simple
until you get into the details.  Strings are simple?  See
   http://www.and.org/vstr/comparison
although it isn't working for me right now.  For molecules
I want various atom/bond properties, I want substructure
searching (via graph isomorphism), and a few other things.

Looking at it again now:

It unzips to a non-versioned directory 'pygr' instead of
        pygr-0.7-beta1 or somesuch

The setup.py file is

    v1, v2 = sys.version_info[:2]
    if v1 < 2:
       raise 'pygr does not support python 1.x'
    if v1 == 2 and v2 < 2:
       raise 'pygr does not support python2.1 or earlier version'

while version_info was only added in Python 2.0, and
string exceptions were deprecated in Python 1.5.  Which
was an awful long time ago.

The code I remember seeing last time was something like
    os.system("rm %s" % filename)

The closest I could find now was:

def repeat_mask(seq,progname='RepeatMasker -xsmall',opts=''):
     'Run RepeatMasker on a sequence, return lowercase-masked string'
     temppath=os.tempnam()
     ofile=file(temppath,'w')
     write_fasta(ofile,seq)
     ofile.close()
     cmd=progname+' '+opts+' '+temppath
     if os.system(cmd)!=0:
         raise OSError('command %s failed' % cmd)
     ofile=file(temppath+'.masked')
     for id,title,seq_masked in read_fasta(ofile):
         break # JUST READ ONE SEQUENCE
     ofile.close()
     cmd='rm -f %s %s.*' % (temppath,temppath)
     if os.system(cmd)!=0:
         raise OSError('command '+cmd+' failed')
     return seq_masked

This should use the subprocess module, and use
shutil.rmtree instead of that os.system() call.

It also mans the README should change
   In theory, pygr should work on any platform that adequately  
supports python.
to
   In theory, pygr should work on any sufficiently Unix-like platform  
that
   adequately supports python.

This
         exit_status=os.system('cp %s/pygr/cgraph.c %s/pygr/cgraph.h % 
s/pygr/cdict.pxd .'
                               % (pygrpath,pygrpath,pygrpath))
should use shutil.copy

         if not os.access(self.filepath+'.nsd',os.R_OK) \
                and not os.access(self.filepath+'.psd',os.R_OK) \
                and not os.access(self.filepath+'.00.nsd',os.R_OK) \
                and not os.access(self.filepath+'.00.psd',os.R_OK):
is probably better using os.path.exists instead of os.access.

         if not os.access(self.filepath+'.nsd',os.R_OK) \
                and not os.access(self.filepath+'.psd',os.R_OK) \
                and not os.access(self.filepath+'.00.nsd',os.R_OK) \
                and not os.access(self.filepath+'.00.psd',os.R_OK):

Interesting. The code from earlier, and a few other places
in the code base, use

     ofile=file(temppath+'.masked')
     for id,title,seq_masked in read_fasta(ofile):
         break # JUST READ ONE SEQUENCE

Turns out read_fasta has a 2nd argument, 'onlyReadOneLine'.
I would have made two different functions, like
"read_first_fasta".  Oh, wait, I'm mistaken.  It means
to only read the first line of the sequence in a FASTA record.

There should be a function that works like:

    id, title, seq_masked = read_first_fasta_record(ofile)

And the file variable name should be 'ifile' and not 'ofile',
since it appears that those are normally used for "input file"
and "output file".

Here's the code

def read_fasta(ifile,onlyReadOneLine=False):
     "Get one sequence at a time from stream ifile"
     id=None
     title=''
     seq=''
     for line in ifile:
         if '>'==line[0]:
             if id!=None and len(seq)>0:
                 yield id,title,seq
                 seq = ''
             id=line[1:].split()[0]
             title=line[len(id)+2:]
         elif id!=None: # READ SEQUENCE
             for word in line.split(): # GET RID OF WHITESPACE
                 seq += word
             if onlyReadOneLine and len(seq)>0:
                 yield id,title,seq
     if id!=None and len(seq)>0:
         yield id,title,seq

Some comments

def read_fasta(ifile,onlyReadOneLine=False):
     "Get one sequence at a time from stream ifile"
     id=None
     title=''
     seq=''

Following PEP 8's recommendations:
     id = None
     title = ''
     seq = ''

(BTW, I prefer "" because they are easier to see.)

     for line in ifile:
         if '>'==line[0]:
             if id!=None and len(seq)>0:

This should be:
             if id is not None and seq:

                 yield id,title,seq
                 seq = ''
             id=line[1:].split()[0]
             title=line[len(id)+2:]

I've been trying to figure out if this is clever or strange.
Here are possible problems:
   - there's an IndexError if no title exists
   - are the spaces after the first space between the id and
       the title supposed to be in the title
   - is the possible trailing newline supposed to be in the title?
Perhaps the rather inelegant
             fields = line[1:].rstrip().split(None, 1)
             if not fields:
                 id = title = ""
             elif len(fields) == 1:
                 id = fields[0]; title = ""
             else:
                 id, title = fields

         elif id!=None: # READ SEQUENCE
use "is not None" rather than "!= None"

             for word in line.split(): # GET RID OF WHITESPACE
                 seq += word

Thankfully this is an O(N) amortized cost in modern Pythons.
It used to be O(N**2).  Better is
      seq = []
          ...
             seq.extend(line.split()) # GET RID OF WHITESPACE
          ...
                yield id, title, "".join(seq)

             if onlyReadOneLine and len(seq)>0:
                 yield id,title,seq

There's some ugliness as it parses all the fields in a
sequence even when onlyReadOneLine is enable. Try ...

No, don't try.  I don't know what onlyReadOneLine is supposed
to do.  When is this flag, which produces this output, useful?

 >>> list(read_fasta([">Hello there\n", "ATCG\n"]))
[('Hello', 'there\n', 'ATCG')]
 >>> list(read_fasta([">Hello there\n", "ATCG\n", "AAAA\n"]))
[('Hello', 'there\n', 'ATCGAAAA')]
 >>> list(read_fasta([">Hello there\n", "ATCG\n", "AAAA\n"], 1))
[('Hello', 'there\n', 'ATCG'), ('Hello', 'there\n', 'ATCGAAAA'),
('Hello', 'there\n', 'ATCGAAAA')]
 >>>
 >>> list(read_fasta([">Hello there\n", "ATCG\n", "AAAA\n", "\n",  
">Bye\n", "A\n"], 1))
[('Hello', 'there\n', 'ATCG'), ('Hello', 'there\n', 'ATCGAAAA'),
('Hello', 'there\n', 'ATCGAAAA'), ('Hello', 'there\n', 'ATCGAAAA'),
('Bye', '', 'A'), ('Bye', '', 'A')]
 >>>

I moved over to graphquery.py

     def __iter__(self,k):
         for k,v in self.iteritems():
             yield k
     def iteritems(self):
         for dataNode,queryNode in self.compiler.dataMatch.items():
             yield queryNode,dataNode # RETURN NODE MAPPINGS
         for i in range(self.compiler.n): # ALSO SAVE MAPPINGS TO  
DATA EDGES
             gqi=self.compiler.gqi[i] # RETURN EDGE MAPPINGS
             yield (gqi.fromNode,gqi.queryNode),self.compiler.dataEdge 
[i]
     def items(self):
         return [x for x in self.iteritems()]

is more succinctly written as

     def __iter__(self,k):
         return self.iteritems()
     def iteritems(self):
        for item in self.compiler.dataMatch.items():
            yield item  # Return node mappings
        for i, gqi in enumerate(self.compiler.gqi):
            yield (gqi.fromNode, gqi.queryNode),  
self.compiler.dataEdge[i]
     def items(self):
         return self.items()

There's an eval

def methodFactory(methodList,methodStr,localDict):
     for methodName in methodList:
         localDict[methodName]=eval(methodStr%methodName)

used like this

     classutil.methodFactory(['__contains__'],'lambda self,obj:self.d. 
%s(obj.id)',
                             locals())

I think it should be the rather simpler:
     def __contains__(self, obj):
         return obj.id in self.d
???

These

     def __iter__(self):
         for node in self.d:
             yield self.unpack_source(node)
     def keys(self): return [k for k in self]

should be

     def __iter__(self):
         return iter(self.unpack_source(node))
     def keys(self):
         list(self)

This is without knowing what the code does on
the overall level.  I might have other comments
if I knew more about it.

				Andrew
				dalke at dalkescientific.com