[bip] Reproducible research

Wed Mar 4 09:56:12 PST 2009

On Wed, Mar 4, 2009 at 4:16 PM, Andrew Dalke <dalke at dalkescientific.com> wrote:
> On Mar 4, 2009, at 12:14 PM, Giovanni Marco Dall'Olio wrote:
>> Another solution to reproducibility, in a perfect world, would be that
>> people write good tests for their programs.
>> Let's say that I write a program that predicts the coding sequences in
>> a nucleotide sequence.
>> If I provide good tests for it, people should be able to reproduce my
>> analisis and understand it even if they don't know the programming
>> language that I have used, or even without having to have the source
>> code of my scripts.
>

eheh, nice discussion :)

I understand that it is impossible to write full suites of tests for
an analysis or an experiment; but I was critizing the fact that nobody
seems to care about this problem.
If you look at the programs of any master or course in bioinformatics,
you'll see that none ever explain what an unit test is, or give
general ideas about the concept of testing.
Also, it is very difficult to find articles or discussion over this
topic, even online.

> What defines the "good tests" in "If I provide good tests"?
> Or was the following text an elaboration?

A good test is a test that allows you to compare two or more different
implementations of the same problem.
If you write a fasta format parser, you can only compare it with other
implementations by executing the same tests on both and compare the
results.
Most people already do something like this, only that they don't know
they can call it testing and usually don't formalize the tests as
scripts.

> Aren't the most of the same factors which make good tests
> true of good code?
>
> If I write my tests in, say, APL, wouldn't people need to know
> that language to understand the tests? What if the variables
> were in Russian, as was the case in one code base I was trying
> to debug?
>
> If I've written a probabilistic tester for primality, my tests
> are "is X prime?", but that's not going to help understand the
> algorithm.
>
> If I write a new sorting algorithm, the tests are "is the result
> sorted?", which again reveals nothing about the inner workings
> of the algorithm itself.

In part, it is not important to know the inner workings of an
algorithm to understand the final restuls of an in silico experiment.
It doesn't matter whether you use blast or blat to align two
sequences: the only important thing is that you must be able to
compare the results with others, having a p-value or something that
defines how much your result can be a false positive or not.

You wrote a new sorting algorithm and published a paper where you have
used it to calculate some interesting biological results: why should I
care about the inner implementation of your algorithm?
I only need to know that it works, that the results are sorted
correctly, and I won't be able to judge it easily by only looking at
your source code.

>> Let's say I write a program to convert a fasta sequence to genbank.
>> Instead of relying on you to look at the source code, I'll tell you
>> that I have tested the script over a blank sequence, a sequence with a
>> blank line in the middle of the sequence, a sequence with a wrong
>> header, etc... and I provide you the instructions to run these tests
>> again if you need.
>
> And if I think you missed an important test case? How much
> of the infinite input space do you have to test before you can
> convince someone else that the code is correct? The Pentium
> FDIV bug shows that lots of tests still doesn't catch everything.

It is impossible to cover all the possible cases, but it is not a good
excuse to decide to do not test anything.
You can understand much from the results of a western blot just by
knowing which control sets have been used, and by knowing that some
best practice rules have been respected.
For example, look at the last image here:
http://www.biology.arizona.edu/IMMUNOLOGY/activities/western_blot/west2.html
That image is self-explanatory: you don't need to study all the
details in the western blot protocol to understand it, but just by
looking at which are the controls and the samples, you can understand
a lot of the underlying experiment.

> In your scenario, suppose you had a hard-coded buffer to
> read lines from the FASTA file. Some FASTA files have very
> long header lines, like those which contain all record
> identifiers with that sequence. Perhaps there's a buffer
> overflow in your code you didn't notice, which occurs in
> rare cases and causes corrupted results?
>
> I've given an example from years ago when the AC line of
> SWISS-PROT went from one line to one-or-more lines. The BioPerl
> parser didn't handle that case, and no one noticed it for a
> year.  The coded ended up reporting the accession numbers
> from the last AC line. Tests wouldn't have helped because
> when that code was written, there were no multi-line AC
> fields and the spec said it was only a single line.

Yeah, but now you know that any implementation of a swiss-prot parser
which lacks a test for multi-line AC fields can possibly suffer from
this problem.
If I want to know whether the biopython implementation is wrong, I can
look at its tests, or write a new test myself, and see if it fails or
not.
This is without having to read biopython's code: which is an
advantage, because it is quicker and less prone to
mis-interpretations.

>
>> For having another example, imagine if that all the openbio projects
>> would have a common place to store their use cases and tests. Wouldn't
>> it be easier to compare the various bio.* projects, and see how each
>> one implements each problem?
>
> There's been on-and-off projects for that for years, such
> as collecting different BLAST outputs for testing reference.
> It's a thankless job and it's never gone anywhere.
>
>                                Andrew
>                                dalke at dalkescientific.com
>
>
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>

-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it