[bip] Reproducible research

Wed Mar 4 03:14:19 PST 2009

On Wed, Mar 4, 2009 at 11:53 AM, Andrew Perry <ajperry at pansapiens.com> wrote:
>
>> > On Tue, Mar 3, 2009 at 2:03 PM, James Casbon <casbon at gmail.com> wrote:
>>
>> > software does not keep pace with the major changes in various computer
>> > languages or compilers. For example, updates to GNU gcc have meant
>> > older code can be very hard to update, Java has gone a couple of
>> > versions so some old functions are no longer present or open source
>> > options don't work. Even Python has made a change but not yet a factor
>> > as I am not sure if people are jumping to 3.0 yet. So unless people
>> > actually release code into an open source environment such as
>> > Biopython, only the dedication of original authors and perhaps a few
>> > generous others to keep it functional or updated.
>>
>
> Beyond Makefiles or other similar tools for reproducing workflows, one
> additional solution for long-term reproducibility that doesn't seem to have
> been explored much is providing a fully functional virtual machine with the
> OS + your software + dependencies setup and ready to run. IMO this should
> greatly increase the chances that older software will run as it did for the
> original authors, and results should then be easily reproducible many years
> down the track.

Another solution to reproducibility, in a perfect world, would be that
people write good tests for their programs.
Let's say that I write a program that predicts the coding sequences in
a nucleotide sequence.
If I provide good tests for it, people should be able to reproduce my
analisis and understand it even if they don't know the programming
language that I have used, or even without having to have the source
code of my scripts.
If I attach a description of all the tests I have made (this can be
generated automatically from unittests and similar) to the
supplementary data of the paper I publish, people will be able to
judge if my analysis is wrong or correct, without even having to look
at the code.

Let's say I write a program to convert a fasta sequence to genbank.
Instead of relying on you to look at the source code, I'll tell you
that I have tested the script over a blank sequence, a sequence with a
blank line in the middle of the sequence, a sequence with a wrong
header, etc... and I provide you the instructions to run these tests
again if you need.
Then you will be able to 'reproduce' the results of my scripts by
writing your own implementation, as wet biologists do since much time
with western blots and similar.

I think this is a more scientific approach, but maybe more difficult
to implement (and people are so scared by the concept of testing that
they prefer to don't write tests at all - I think it's bad science).

For having another example, imagine if that all the openbio projects
would have a common place to store their use cases and tests. Wouldn't
it be easier to compare the various bio.* projects, and see how each
one implements each problem?

> Storing the whole OS image + software would take more storage space; but
> space is supposedly cheap these days. Qemu / Virtualbox / VMware etc can all
> deal with the standard VMware format virtual machines, and these VMs should
> be runnable in many years to come, even when the older dependencies have
> become difficult to get running (ever try installing very old libraries on a
> new Linux distro, or installing Red Hat 6.2 [circa 2000] on very new
> hardware ? You may have to dig out an old Pentium II ...).
>
> Archive.org would be one place things like this could be stored, although
> I'm not sure if they really have the resources, or if this is their focus
> (they seem to do 'arts & culture' more than 'science'). Institutional
> repositories might be another storage+distribution option. My feeling is
> that journals should accept source code and VM images as Supplementary data
> upon publication. The good ones should minimally _require_ source code
> before publication, just like every journal that publishes reports of
> biomolecular structures requires the coordinates to be submitted to the PDB.
>
> This all assumes that your whole software stack is freely redistributable of
> course ... not always an option for everyone.
>
> Andrew Perry
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>

-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it