[bip] Reproducible research

Thu Mar 5 11:08:01 PST 2009

For larger, more complex software stacks I've made attempts at  
versioning the software.

Of course, this is only a slice of the reproducibility picture, since  
you also need to version the data and the steps that were taken to run  
that data through the software ... but it can be useful to just have  
versioning of the software if the quality of data is in question  
because of suspected bugs in a part of the software stack. Or if parts  
of the stack are being upgraded by hand, the state of the installation  
can get pretty murky - especially in software stacks where there is  
turnover in the staff maintiaining them. Being able to completely re- 
build the stack from scratch to a known state with a single command is  
very good.

I used BuildIt (http://pypi.python.org/pypi/buildit/) with my first  
attempt. This gives you a "root.ini" file which can be used to specify  
the versions of all the parts of a stack. This would look something  
like this:

[namespaces]
python = ${buildoutdir}/tasks/mkpython.ini [2.4.4]
postgres = ${buildoutdir}/tasks/postgres.ini [8.1.9]
java = ${buildoutdir}/tasks/java.ini [1.6.0_06]
exonerate = ${buildoutdir}/tasks/exonerate.ini [1.4.0]
fftw = ${buildoutdir}/tasks/fftw.ini [3.1.2]
expat = ${buildoutdir}/tasks/expat.ini [2.0.1]
eland = ${buildoutdir}/tasks/eland.ini [0.3.0b3]

# Python Packages
psycopg2 = ${buildoutdir}/tasks/psycopg2.ini [2.0.6]
pil = ${buildoutdir}/tasks/pil.ini [1.1.5]
sqlalchemy = ${buildoutdir}/tasks/sqlalchemy.ini [0.3.10]
numpy = ${buildoutdir}/tasks/numpy.ini [1.0.3.1]
matplotlib = ${buildoutdir}/tasks/matplotlib.ini [0.90.1]

The "build project" was kept separate from the other Python projects  
used to process the data. In SVN trunk, the build tasks and config  
would be updated to point to newer releases that were being developed  
or tested against. Then an "instance branch" was maintained in SVN at / 
branches/production/. A working copy of this branch was used to  
install all of the software that was ran to generate final results.  
This way viewing the changelog of the /branches/production/ gave a log  
of changes to the software stack used to generate results.

This is quite a bit more work than just using a virtual machine and  
taking snapshot images! But it does give you the benefit of being able  
to install the stack directly onto bare hardware. Necessary if you  
need to use a cluster to generate results. If you are compiling all  
software from source, you can also usually maintain builds for  
different platforms, e.g. linux servers and Mac OS X workstations.

BuildIt was a pretty good tool, but it's not longer being actively  
maintained.

These days I use Buildout (http://pypi.python.org/pypi/zc.buildout)  
and I'm really happy with it as an installation tool.

Buildout's killer feature is that it installs each part using a  
"recipes" that are themselves packaged as separate Python packages. So  
when you write a recipe to install a specific part (e.g. PostgreSQL,  
Python or PIL) it's very easy to then re-use that recipe in other  
software stacks. For example, rather than having to write my own set  
of commands to configure an LDAP instance, I can just add to a  
project's buildout.cfg file:

[slapd]
recipe = z3c.recipe.ldap
urls = ldap://127.0.0.1:1700
allow = bind_v2

When Buildout installs the "slapd" part, it will go to PyPI (if  
needed) and fetch the latest release of z3c.recipe.ldap and install it  
as an egg. And if you want to have reproducible installation recipes,  
you can version those as well, with syntax such as "recipe =  
z3c.recipe.ldap = 0.1".

Buildout has also gained good critical mass. There are over 100  
install recipes available now on PyPI (http://pypi.python.org/pypi?:action=browse&c=512 
). There are recipes for installing standard "CMMI (configure, make,  
make install) software, databases, web applications, collections of  
Python libraries, templates for generating config files, and even Perl  
libraries :P. There are tools for generating source releases of a  
buildout which can be permanently archived (zc.sourcerelease). This  
was a problem with my initial BuildIt-based attempt, as most software  
was fetched with wget, but due to link rot I later switched to copying  
all tarballs to a local mirror first and fetching them from  
there.There are also tools to help automate releases (zest.releaser  
and collective.releaser).

However it's still a long way to go before using these kinds of tools  
wouldn't be a huge learning curve and burden on your average  
bioinformatician. Documentation is sparse, and tools to help automate  
the tedious stuff, such as making releases and creating source  
archives are still fairly young, and a clearer sets of best practices  
have to be more well established. But for certain use cases of  
reproducibility, I think it's a very good way to go.