[bip] Reproducible research

James Casbon casbon at gmail.com
Fri Mar 6 02:59:01 PST 2009


2009/3/5 Kevin Teague <kteague at bcgsc.ca>:
> For larger, more complex software stacks I've made attempts at
> versioning the software.
>
> Of course, this is only a slice of the reproducibility picture, since
> you also need to version the data and the steps that were taken to run
> that data through the software ... but it can be useful to just have
> versioning of the software if the quality of data is in question
> because of suspected bugs in a part of the software stack. Or if parts
> of the stack are being upgraded by hand, the state of the installation
> can get pretty murky - especially in software stacks where there is
> turnover in the staff maintiaining them. Being able to completely re-
> build the stack from scratch to a known state with a single command is
> very good.
>
> I used BuildIt (http://pypi.python.org/pypi/buildit/) with my first
> attempt. This gives you a "root.ini" file which can be used to specify
> the versions of all the parts of a stack. This would look something
> like this:
>
> [namespaces]
> python = ${buildoutdir}/tasks/mkpython.ini [2.4.4]
> postgres = ${buildoutdir}/tasks/postgres.ini [8.1.9]
> java = ${buildoutdir}/tasks/java.ini [1.6.0_06]
> exonerate = ${buildoutdir}/tasks/exonerate.ini [1.4.0]
> fftw = ${buildoutdir}/tasks/fftw.ini [3.1.2]
> expat = ${buildoutdir}/tasks/expat.ini [2.0.1]
> eland = ${buildoutdir}/tasks/eland.ini [0.3.0b3]
>
> # Python Packages
> psycopg2 = ${buildoutdir}/tasks/psycopg2.ini [2.0.6]
> pil = ${buildoutdir}/tasks/pil.ini [1.1.5]
> sqlalchemy = ${buildoutdir}/tasks/sqlalchemy.ini [0.3.10]
> numpy = ${buildoutdir}/tasks/numpy.ini [1.0.3.1]
> matplotlib = ${buildoutdir}/tasks/matplotlib.ini [0.90.1]
>
> The "build project" was kept separate from the other Python projects
> used to process the data. In SVN trunk, the build tasks and config
> would be updated to point to newer releases that were being developed
> or tested against. Then an "instance branch" was maintained in SVN at /
> branches/production/. A working copy of this branch was used to
> install all of the software that was ran to generate final results.
> This way viewing the changelog of the /branches/production/ gave a log
> of changes to the software stack used to generate results.
>
> This is quite a bit more work than just using a virtual machine and
> taking snapshot images! But it does give you the benefit of being able
> to install the stack directly onto bare hardware. Necessary if you
> need to use a cluster to generate results. If you are compiling all
> software from source, you can also usually maintain builds for
> different platforms, e.g. linux servers and Mac OS X workstations.
>
> BuildIt was a pretty good tool, but it's not longer being actively
> maintained.
>
> These days I use Buildout (http://pypi.python.org/pypi/zc.buildout)
> and I'm really happy with it as an installation tool.
>
> Buildout's killer feature is that it installs each part using a
> "recipes" that are themselves packaged as separate Python packages. So
> when you write a recipe to install a specific part (e.g. PostgreSQL,
> Python or PIL) it's very easy to then re-use that recipe in other
> software stacks. For example, rather than having to write my own set
> of commands to configure an LDAP instance, I can just add to a
> project's buildout.cfg file:
>
> [slapd]
> recipe = z3c.recipe.ldap
> urls = ldap://127.0.0.1:1700
> allow = bind_v2
>
> When Buildout installs the "slapd" part, it will go to PyPI (if
> needed) and fetch the latest release of z3c.recipe.ldap and install it
> as an egg. And if you want to have reproducible installation recipes,
> you can version those as well, with syntax such as "recipe =
> z3c.recipe.ldap = 0.1".
>
> Buildout has also gained good critical mass. There are over 100
> install recipes available now on PyPI (http://pypi.python.org/pypi?:action=browse&c=512
> ). There are recipes for installing standard "CMMI (configure, make,
> make install) software, databases, web applications, collections of
> Python libraries, templates for generating config files, and even Perl
> libraries :P. There are tools for generating source releases of a
> buildout which can be permanently archived (zc.sourcerelease). This
> was a problem with my initial BuildIt-based attempt, as most software
> was fetched with wget, but due to link rot I later switched to copying
> all tarballs to a local mirror first and fetching them from
> there.There are also tools to help automate releases (zest.releaser
> and collective.releaser).
>
> However it's still a long way to go before using these kinds of tools
> wouldn't be a huge learning curve and burden on your average
> bioinformatician. Documentation is sparse, and tools to help automate
> the tedious stuff, such as making releases and creating source
> archives are still fairly young, and a clearer sets of best practices
> have to be more well established. But for certain use cases of
> reproducibility, I think it's a very good way to go.

Thanks, that's really interesting stuff.  I had always assumed
buildout to be a zope only kind of thing, but it seems not.  I notice
that buildout will work with virtualenv as well:
http://wiki.python.org/moin/buildout/pycon2008%20tutorial



More information about the biology-in-python mailing list