[bip] Reproducible research
Kevin Teague
kteague at bcgsc.ca
Thu Mar 5 11:08:01 PST 2009
For larger, more complex software stacks I've made attempts at
versioning the software.
Of course, this is only a slice of the reproducibility picture, since
you also need to version the data and the steps that were taken to run
that data through the software ... but it can be useful to just have
versioning of the software if the quality of data is in question
because of suspected bugs in a part of the software stack. Or if parts
of the stack are being upgraded by hand, the state of the installation
can get pretty murky - especially in software stacks where there is
turnover in the staff maintiaining them. Being able to completely re-
build the stack from scratch to a known state with a single command is
very good.
I used BuildIt (http://pypi.python.org/pypi/buildit/) with my first
attempt. This gives you a "root.ini" file which can be used to specify
the versions of all the parts of a stack. This would look something
like this:
[namespaces]
python = ${buildoutdir}/tasks/mkpython.ini [2.4.4]
postgres = ${buildoutdir}/tasks/postgres.ini [8.1.9]
java = ${buildoutdir}/tasks/java.ini [1.6.0_06]
exonerate = ${buildoutdir}/tasks/exonerate.ini [1.4.0]
fftw = ${buildoutdir}/tasks/fftw.ini [3.1.2]
expat = ${buildoutdir}/tasks/expat.ini [2.0.1]
eland = ${buildoutdir}/tasks/eland.ini [0.3.0b3]
# Python Packages
psycopg2 = ${buildoutdir}/tasks/psycopg2.ini [2.0.6]
pil = ${buildoutdir}/tasks/pil.ini [1.1.5]
sqlalchemy = ${buildoutdir}/tasks/sqlalchemy.ini [0.3.10]
numpy = ${buildoutdir}/tasks/numpy.ini [1.0.3.1]
matplotlib = ${buildoutdir}/tasks/matplotlib.ini [0.90.1]
The "build project" was kept separate from the other Python projects
used to process the data. In SVN trunk, the build tasks and config
would be updated to point to newer releases that were being developed
or tested against. Then an "instance branch" was maintained in SVN at /
branches/production/. A working copy of this branch was used to
install all of the software that was ran to generate final results.
This way viewing the changelog of the /branches/production/ gave a log
of changes to the software stack used to generate results.
This is quite a bit more work than just using a virtual machine and
taking snapshot images! But it does give you the benefit of being able
to install the stack directly onto bare hardware. Necessary if you
need to use a cluster to generate results. If you are compiling all
software from source, you can also usually maintain builds for
different platforms, e.g. linux servers and Mac OS X workstations.
BuildIt was a pretty good tool, but it's not longer being actively
maintained.
These days I use Buildout (http://pypi.python.org/pypi/zc.buildout)
and I'm really happy with it as an installation tool.
Buildout's killer feature is that it installs each part using a
"recipes" that are themselves packaged as separate Python packages. So
when you write a recipe to install a specific part (e.g. PostgreSQL,
Python or PIL) it's very easy to then re-use that recipe in other
software stacks. For example, rather than having to write my own set
of commands to configure an LDAP instance, I can just add to a
project's buildout.cfg file:
[slapd]
recipe = z3c.recipe.ldap
urls = ldap://127.0.0.1:1700
allow = bind_v2
When Buildout installs the "slapd" part, it will go to PyPI (if
needed) and fetch the latest release of z3c.recipe.ldap and install it
as an egg. And if you want to have reproducible installation recipes,
you can version those as well, with syntax such as "recipe =
z3c.recipe.ldap = 0.1".
Buildout has also gained good critical mass. There are over 100
install recipes available now on PyPI (http://pypi.python.org/pypi?:action=browse&c=512
). There are recipes for installing standard "CMMI (configure, make,
make install) software, databases, web applications, collections of
Python libraries, templates for generating config files, and even Perl
libraries :P. There are tools for generating source releases of a
buildout which can be permanently archived (zc.sourcerelease). This
was a problem with my initial BuildIt-based attempt, as most software
was fetched with wget, but due to link rot I later switched to copying
all tarballs to a local mirror first and fetching them from
there.There are also tools to help automate releases (zest.releaser
and collective.releaser).
However it's still a long way to go before using these kinds of tools
wouldn't be a huge learning curve and burden on your average
bioinformatician. Documentation is sparse, and tools to help automate
the tedious stuff, such as making releases and creating source
archives are still fairly young, and a clearer sets of best practices
have to be more well established. But for certain use cases of
reproducibility, I think it's a very good way to go.
More information about the biology-in-python
mailing list