[bip] Reproducible research

Wed Mar 4 15:36:36 PST 2009

On Mar 4, 2009, at 6:56 PM, Giovanni Marco Dall'Olio wrote:
> I understand that it is impossible to write full suites of tests for
> an analysis or an experiment; but I was critizing the fact that nobody
> seems to care about this problem.

I was criticizing the statement that tests are a
sufficient "solution to reproducibility". They aren't.

> If you look at the programs of any master or course in bioinformatics,
> you'll see that none ever explain what an unit test is, or give
> general ideas about the concept of testing.

I've taught some of those. And I'm part of the problem. Getting
the idea of a for-loop and dictionaries and how to define functions
takes time, and I didn't think my students got to the point where
unittest made sense.

Why? Because I don't teach OO programming. That is, I don't
recall every teaching how to define a new class, how inheritance
works, etc. (Maybe to some of the most advanced.) Because the
people I teach are consumers of libraries, not developers
of new ones.

But unittest has what I think is a requirement, that people
write their tests using classes.

class MyTestCases(unittest.TestCase):
     def testBlah(self):
          ...
         self.testEquals(x, y)
          ...

How do I teach unit testing this way when I've never
covered the first line, or where "self.testEquals"
comes from?

Or should I restructure my teaching to start with OO
programming? Which I think is not needed for most of
what my students do.

Now, this was a couple of years ago, and nose has come out
since then, which means people can write their tests as
"def test_me(): ..." That's a bit simpler to understand,
but I haven't had a chance to teach that yet.

(I currently do training in cheminformatics, but only
for 2-3 days, and that's not enough time to go into
version control, unit testing, etc. for more than a taste
and a pointer to the Software Carpentry site.)

> A good test is a test that allows you to compare two or more different
> implementations of the same problem.

I disagree. A unittest is designed to test a library, and part
of the unittest design reflects the implementation. Unittests
are written by the module developers, the module is designed
to be tested, and the unit tests may reach into non-public
APIs in order to test lower-level components.

Unittests are not blackbox tests.

Unittests are good things.

But by your definition, they are not good tests, and the
only good tests are what the testing field calls "acceptance
tests."

> If you write a fasta format parser, you can only compare it with other
> implementations by executing the same tests on both and compare the
> results.

??? I thought I raised a big hullabaloo on this list almost
two years ago, wherein I compared different FASTA parser
implementations at the source code level and people complained
about my code details?

Code inspection does seem like one way to compare two parsers
without running any tests. Inspection and tests give two
different, though overlapping, ways to compare code.

> In part, it is not important to know the inner workings of an
> algorithm to understand the final restuls of an in silico experiment.

In other words, "In part, it is important to know the inner  
working ..."?

> You wrote a new sorting algorithm and published a paper where you have
> used it to calculate some interesting biological results: why should I
> care about the inner implementation of your algorithm?

Mmm, the original topic is:
> Another solution to reproducibility, in a perfect world, would be that
> people write good tests for their programs.

I gave an example from math where I figured it was obvious
that having good tests doesn't mean reproducibility. It
only means verifiability of those tests. It doesn't mean
reproducibility of the original program.

I also figured the analogies to a biological problem were
easy to come up with.

Here's one more biologically related. I use a genetic
algorithm to come up with a multiple sequence alignment
(or some other NP problem). I publish my results, I
release the tests, which are very complete. But I
omitted releasing the initial seed for the PRNG.
Maybe I lost it.

I can publish the results because it's verifiable
that I have a good solution. I could have said
it came to me in a dream, a la Keule's snakes and
it would still be publishable.

Can you reproduce my results? What if the result
is critically dependent on the initial seed?  (Which
is one of those things I don't like about GAs.)

In this case I can even release the source code
and it would be hard to reproduce.

You can verify that the result is a good alignment.
But you can't reproduce the method.

> It is impossible to cover all the possible cases, but it is not a good
> excuse to decide to do not test anything.

Never claimed it was. My only assertion is that
"good tests" are not sufficient for reproducibility.

> http://www.biology.arizona.edu/IMMUNOLOGY/activities/western_blot/ 
> west2.html
> That image is self-explanatory: you don't need to study all the
> details in the western blot protocol to understand it, but just by
> looking at which are the controls and the samples, you can understand
> a lot of the underlying experiment.

Umm, you do realize I'm a software developer by
practice with a background in molecular modeling
and training in physics, now working in chemical
informatics? I've heard of "western blot" many
times but don't know what it means. I looked
at the picture. It's a gel plate. I know the
physics of how it works, but the biology is not
self-explanatory.

If it was DNA testing I would say that C is
likely related to 1. ;)

> Yeah, but now you know that any implementation of a swiss-prot parser
> which lacks a test for multi-line AC fields can possibly suffer from
> this problem.

That bioperl's parser was broken here is a symptom of
a larger issue. People write optimistic parsers which
assume that the underlying format will only change in a
certain number of fixed ways, although reality
sometimes goes otherwise.

The current bioperl parser will silently ignore
if there is ever more than one SV field. Which
isn't the case now, but perhaps in the future?
Ditto for the organism field. Should they have
checks in case that day ever comes? Then do those
each need tests? Is the current code broken?

Most people justify lenient parsers with one of:
   - it just has to work enough for this problem
   - strict parsers break too often for meaningless changes
   - "Postel's Law"

When I was deeply involved in this I had a list of
all the ways that GenBank, SWISS-PROT, PROSITE, PIR,
etc. were releasing invalid data sets. Things that
not only were against their spec but which were
obviously wrong, like a publication year of "19985"
which was supposed to fit into a 4-digit field.

No one else's parsers found these errors, and
blithely accepted invalid formats without even
issuing a warning.

Which is acceptable, because most people in biology
aren't that concerned about <0.01% error rates
and know that human review (of the spec change
documentation, of the code, and through looking
at results) will eventually catch those problems.

I've been thinking about that .. all the software
needs to do is fit within the error bars of
experimental error. Does it really need to be
100% correct? What's good-enough correct?

Again, from experience I know that my verification
effort was almost a thankless job. Who likes to
get reports full of things like:

    "the specification on page 123 says XYZ while
     record ABC shows XZY."

    "the documentation says that the x-ray
     resolution is interpreted as a floating
     point value but record 1aa1 has resolution
     of 2.0 while record 2bb2 has resolution 2.00.
     Are the significant digits meaningful?"

The exception was PROSITE, which give me feedback
that made me happy about helping them fix their
documentation and data set.

> If I want to know whether the biopython implementation is wrong, I can
> look at its tests, or write a new test myself, and see if it fails or
> not.

You've modified your original assertion. Now you need
access to run the program in order to test if an
implementation is wrong. That's more than having access
to only the tests in order to get reproducibility.

> This is without having to read biopython's code: which is an
> advantage, because it is quicker and less prone to
> mis-interpretations.

And a disadvantage, because it offers no clues on
what should be tested.

				Andrew
				dalke at dalkescientific.com