[bip] Reproducible research

Fri Mar 6 07:28:19 PST 2009

Just out of interest (and probably slightly off-topic), has anyone ever
suggested getting a journal to support a code repository aswell........get
the journal the paper was published to run an SVN/{insert your preferred one
here} server. This would probably lead to an increase in the cost associated
with publishing but with places like github around surely it could be
outsourced or just alter the terms and requirements by forcing the code to
be made available on code.google or github.
This would prevent issues with access to source code. At the moment a girl
working on a project requested the code for some software published in
Bioinformatics from the author and was asked to send them the data (yeah
right!), so they could run it themselves, completely against the terms and
conditions of publishing with them. If code was forced to be made publicly
available at the time of publication this problem would disappear, as would
attempting to contact someone for code who has changed their e-mail
address/moved jobs.

I'm off the opinion that open-sourcing code and methods/pipelines would help
to catch problems like the one below faster and cheaper:
http://www.sciencemag.org/cgi/content/full/314/5807/1875b

Data however is a completely big issue and is going to be a major
problem................it still annoys me that no-one ever thought about
audit trails in biological databases and annotation
projects...............and look at the state of them now.

I'll definitely be looking at buildout for use with my current project.

Would people agree that publishing code to public repositories might lead to
more reproducible research than the current practice of hiding it away?
Surely the quality of code produced would go up aswell based on authors
being embarrassed to publish their uncommented and filthy perl code and
switch to Python (yes I'm dreaming).

Many thanks,

Nathan

2009/3/6 James Casbon <casbon at gmail.com>

> 2009/3/5 Kevin Teague <kteague at bcgsc.ca>:
> > For larger, more complex software stacks I've made attempts at
> > versioning the software.
> >
> > Of course, this is only a slice of the reproducibility picture, since
> > you also need to version the data and the steps that were taken to run
> > that data through the software ... but it can be useful to just have
> > versioning of the software if the quality of data is in question
> > because of suspected bugs in a part of the software stack. Or if parts
> > of the stack are being upgraded by hand, the state of the installation
> > can get pretty murky - especially in software stacks where there is
> > turnover in the staff maintiaining them. Being able to completely re-
> > build the stack from scratch to a known state with a single command is
> > very good.
> >
> > I used BuildIt (http://pypi.python.org/pypi/buildit/) with my first
> > attempt. This gives you a "root.ini" file which can be used to specify
> > the versions of all the parts of a stack. This would look something
> > like this:
> >
> > [namespaces]
> > python = ${buildoutdir}/tasks/mkpython.ini [2.4.4]
> > postgres = ${buildoutdir}/tasks/postgres.ini [8.1.9]
> > java = ${buildoutdir}/tasks/java.ini [1.6.0_06]
> > exonerate = ${buildoutdir}/tasks/exonerate.ini [1.4.0]
> > fftw = ${buildoutdir}/tasks/fftw.ini [3.1.2]
> > expat = ${buildoutdir}/tasks/expat.ini [2.0.1]
> > eland = ${buildoutdir}/tasks/eland.ini [0.3.0b3]
> >
> > # Python Packages
> > psycopg2 = ${buildoutdir}/tasks/psycopg2.ini [2.0.6]
> > pil = ${buildoutdir}/tasks/pil.ini [1.1.5]
> > sqlalchemy = ${buildoutdir}/tasks/sqlalchemy.ini [0.3.10]
> > numpy = ${buildoutdir}/tasks/numpy.ini [1.0.3.1]
> > matplotlib = ${buildoutdir}/tasks/matplotlib.ini [0.90.1]
> >
> > The "build project" was kept separate from the other Python projects
> > used to process the data. In SVN trunk, the build tasks and config
> > would be updated to point to newer releases that were being developed
> > or tested against. Then an "instance branch" was maintained in SVN at /
> > branches/production/. A working copy of this branch was used to
> > install all of the software that was ran to generate final results.
> > This way viewing the changelog of the /branches/production/ gave a log
> > of changes to the software stack used to generate results.
> >
> > This is quite a bit more work than just using a virtual machine and
> > taking snapshot images! But it does give you the benefit of being able
> > to install the stack directly onto bare hardware. Necessary if you
> > need to use a cluster to generate results. If you are compiling all
> > software from source, you can also usually maintain builds for
> > different platforms, e.g. linux servers and Mac OS X workstations.
> >
> > BuildIt was a pretty good tool, but it's not longer being actively
> > maintained.
> >
> > These days I use Buildout (http://pypi.python.org/pypi/zc.buildout)
> > and I'm really happy with it as an installation tool.
> >
> > Buildout's killer feature is that it installs each part using a
> > "recipes" that are themselves packaged as separate Python packages. So
> > when you write a recipe to install a specific part (e.g. PostgreSQL,
> > Python or PIL) it's very easy to then re-use that recipe in other
> > software stacks. For example, rather than having to write my own set
> > of commands to configure an LDAP instance, I can just add to a
> > project's buildout.cfg file:
> >
> > [slapd]
> > recipe = z3c.recipe.ldap
> > urls = ldap://127.0.0.1:1700
> > allow = bind_v2
> >
> > When Buildout installs the "slapd" part, it will go to PyPI (if
> > needed) and fetch the latest release of z3c.recipe.ldap and install it
> > as an egg. And if you want to have reproducible installation recipes,
> > you can version those as well, with syntax such as "recipe =
> > z3c.recipe.ldap = 0.1".
> >
> > Buildout has also gained good critical mass. There are over 100
> > install recipes available now on PyPI (
> http://pypi.python.org/pypi?:action=browse&c=512
> > ). There are recipes for installing standard "CMMI (configure, make,
> > make install) software, databases, web applications, collections of
> > Python libraries, templates for generating config files, and even Perl
> > libraries :P. There are tools for generating source releases of a
> > buildout which can be permanently archived (zc.sourcerelease). This
> > was a problem with my initial BuildIt-based attempt, as most software
> > was fetched with wget, but due to link rot I later switched to copying
> > all tarballs to a local mirror first and fetching them from
> > there.There are also tools to help automate releases (zest.releaser
> > and collective.releaser).
> >
> > However it's still a long way to go before using these kinds of tools
> > wouldn't be a huge learning curve and burden on your average
> > bioinformatician. Documentation is sparse, and tools to help automate
> > the tedious stuff, such as making releases and creating source
> > archives are still fairly young, and a clearer sets of best practices
> > have to be more well established. But for certain use cases of
> > reproducibility, I think it's a very good way to go.
>
> Thanks, that's really interesting stuff.  I had always assumed
> buildout to be a zope only kind of thing, but it seems not.  I notice
> that buildout will work with virtualenv as well:
> http://wiki.python.org/moin/buildout/pycon2008%20tutorial
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.idyll.org/pipermail/biology-in-python/attachments/20090306/77161abb/attachment.htm