[bip] Reproducible research

Wed Mar 4 02:53:53 PST 2009

> > On Tue, Mar 3, 2009 at 2:03 PM, James Casbon <casbon at gmail.com<https://mail.google.com/mail?view=cm&tf=0&to=casbon@gmail.com>>
> wrote:
>
> > software does not keep pace with the major changes in various computer
> > languages or compilers. For example, updates to GNU gcc have meant
> > older code can be very hard to update, Java has gone a couple of
> > versions so some old functions are no longer present or open source
> > options don't work. Even Python has made a change but not yet a factor
> > as I am not sure if people are jumping to 3.0 yet. So unless people
> > actually release code into an open source environment such as
> > Biopython, only the dedication of original authors and perhaps a few
> > generous others to keep it functional or updated.
>
>
Beyond Makefiles or other similar tools for reproducing workflows, one
additional solution for long-term reproducibility that doesn't seem to have
been explored much is providing a fully functional virtual machine with the
OS + your software + dependencies setup and ready to run. IMO this should
greatly increase the chances that older software will run as it did for the
original authors, and results should then be easily reproducible many years
down the track.

Storing the whole OS image + software would take more storage space; but
space is supposedly cheap these days. Qemu / Virtualbox / VMware etc can all
deal with the standard VMware format virtual machines, and these VMs should
be runnable in many years to come, even when the older dependencies have
become difficult to get running (ever try installing very old libraries on a
new Linux distro, or installing Red Hat 6.2 [circa 2000] on very new
hardware ? You may have to dig out an old Pentium II ...).

Archive.org would be one place things like this could be stored, although
I'm not sure if they really have the resources, or if this is their focus
(they seem to do 'arts & culture' more than 'science'). Institutional
repositories might be another storage+distribution option. My feeling is
that journals should accept source code and VM images as Supplementary data
upon publication. The good ones should minimally _require_ source code
before publication, just like every journal that publishes reports of
biomolecular structures requires the coordinates to be submitted to the PDB.

This all assumes that your whole software stack is freely redistributable of
course ... not always an option for everyone.

Andrew Perry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.idyll.org/pipermail/biology-in-python/attachments/20090304/88636800/attachment.html