[bip] Reproducible research

Wed Mar 4 09:01:22 PST 2009

James Casbon wrote:
> 2009/3/4 Bruce Southey <bsouthey at gmail.com>:
>   
>> On Tue, Mar 3, 2009 at 2:03 PM, James Casbon <casbon at gmail.com> wrote:
>>     
>>> Hi Bip,
>>>
>>> I've been thinking about reproducible computational research in
>>> biology recently and I thought I'd drop it your way.  There seem to be
>>> several components of this, some already recognised and some not.
>>>
>>> Database and software tools are already known to be badly maintained:
>>> http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000136
>>>       
>> That paper is badly flawed in the most critical assumption that the
>> requirements for Nucleic Acids Research Web issue is that a web
>> version has to remain functional. I thought there was an expectation
>> that the site should remain up for some time but that is not so in the
>> current instructions
>> (http://www.oxfordjournals.org/our_journals/nar/for_authors/submission_webserver.html
>> ). Also flawed in that there was no Google or other web search for url
>> changes - I know a couple where that has happened. Even two years in
>> bioinformatics is a really long time (let alone the 10 years of
>> Biopython!).
>>     
>
> My own take is that for databases and software, a version should be
> placed on archive.org.
>   
That will not work for web applications which is what the web server 
issue of NAR is all about! Furthermore in terms of databases, these must 
be kept up-to-date as who wants to use a 10 year version of the human 
genome?

Also, how does this keep up with bug fixes and enhancements to the original?
When these things go public is it possible that new and unexpected 
consequences appear simply because of the range possibilities or 
'laziness' of programer (guilty of that). In my case, I have made two 
significant changes to the original version as well as the odd bug fix 
within that period but the rules of the web server issue prevent new 
papers within a couple of years of the first publication. But there is a 
link to the original version on the page.

Also, browsers change, see the issues with Protein Explorer that could 
not work with Internet Explorer and then Mozilla. Some where related to 
using chime (but now it trying to work with Jmol) as well as javascript 
changes.
>   
>> However, I don't think anyone has a right to to complain if you do not
>> support people's efforts than just using their code.
>>
>>     
>>> But that problem is very difficult.  What I am more interested here is
>>> the day to day work of making an analysis work again and again, and
>>> applying it to other data.
>>>       
>> If you can guarantee the exact some input and output, you just write
>> scripts and perhaps a master script take either you decide when to run
>> or perhaps some cron job. But it is very hard to ensure that the input
>> and output are the same. Biopython and numpy/scipy do a good job at
>> this by having sufficiently generic modules that can handle the
>> important stuff so you only need to focus on input and output aspects.
>>     
>
> Just using a script is far less useful than a make style system.  If
> the first step on the script is, say, to download and formatdb a large
> database you only want to run these once if possible.  Yet you want
> them to be in the script because they are necessary first steps.
>   
That is complete rubbish because the makefile is just a script!
Further, I do not see how a makefile would be useful here. Sure you can 
tell it run formatdb if the database has 'changed' but it is not easy 
check if it should download a database.

>   
>>> Makefiles are the obvious way of doing this, and there has been some
>>> work around this:
>>> http://skam.sourceforge.net/
>>> http://biowiki.org/MakefileManifesto
>>> AFAIK python + make = scons
>>> http://www.scons.org/
>>> And these guys are doing interesting stuff with scons:
>>> http://reproducibility.org/ (but their tools are a bit domain specific
>>> for what I want)
>>>
>>> Then, there are the workflow engines, of which taverna seems the most
>>> enterpisey (grid!, Web services!):
>>> http://taverna.sourceforge.net/
>>> Galaxy's workflows has been coming on a bit as well:
>>> http://galaxy.psu.edu/
>>> But you can't run them from the command line (and looking at the code,
>>> the controller and the view are so coupled you won't be able to).  And
>>> you can't parametrize them.
>>>       
>> Are you talking about computer languages like Python or actual
>> bioinformatics programs?
>> It does not matter much if the code is not maintained. Usually
>>     
>
> Not sure I agree with that.
>   
With what? If the code is not being maintained then you are stuck in the 
past so you have to run old versions that have various old bugs that you 
may expose. For example, Biopython had a Numeric bug that was fixed in 
numpy associated with memory alignment but you would have to rewrite the 
Biopython code to avoid it or learn the related Numeric code. So who 
fixes those bugs?

Bruce