[bip] Reproducible research

Wed Mar 4 09:17:18 PST 2009

2009/3/4 Bruce Southey <bsouthey at gmail.com>:
> James Casbon wrote:
>> 2009/3/4 Bruce Southey <bsouthey at gmail.com>:
>>
>>> On Tue, Mar 3, 2009 at 2:03 PM, James Casbon <casbon at gmail.com> wrote:
>>>
>>>> Hi Bip,
>>>>
>>>> I've been thinking about reproducible computational research in
>>>> biology recently and I thought I'd drop it your way.  There seem to be
>>>> several components of this, some already recognised and some not.
>>>>
>>>> Database and software tools are already known to be badly maintained:
>>>> http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000136
>>>>
>>> That paper is badly flawed in the most critical assumption that the
>>> requirements for Nucleic Acids Research Web issue is that a web
>>> version has to remain functional. I thought there was an expectation
>>> that the site should remain up for some time but that is not so in the
>>> current instructions
>>> (http://www.oxfordjournals.org/our_journals/nar/for_authors/submission_webserver.html
>>> ). Also flawed in that there was no Google or other web search for url
>>> changes - I know a couple where that has happened. Even two years in
>>> bioinformatics is a really long time (let alone the 10 years of
>>> Biopython!).
>>>
>>
>> My own take is that for databases and software, a version should be
>> placed on archive.org.
>>
> That will not work for web applications which is what the web server
> issue of NAR is all about! Furthermore in terms of databases, these must
> be kept up-to-date as who wants to use a 10 year version of the human
> genome?

You would if you wanted to reproduce a result.

>
> Also, how does this keep up with bug fixes and enhancements to the original?
> When these things go public is it possible that new and unexpected
> consequences appear simply because of the range possibilities or
> 'laziness' of programer (guilty of that). In my case, I have made two
> significant changes to the original version as well as the odd bug fix
> within that period but the rules of the web server issue prevent new
> papers within a couple of years of the first publication. But there is a
> link to the original version on the page.
>
> Also, browsers change, see the issues with Protein Explorer that could
> not work with Internet Explorer and then Mozilla. Some where related to
> using chime (but now it trying to work with Jmol) as well as javascript
> changes.
>>
>>> However, I don't think anyone has a right to to complain if you do not
>>> support people's efforts than just using their code.
>>>
>>>
>>>> But that problem is very difficult.  What I am more interested here is
>>>> the day to day work of making an analysis work again and again, and
>>>> applying it to other data.
>>>>
>>> If you can guarantee the exact some input and output, you just write
>>> scripts and perhaps a master script take either you decide when to run
>>> or perhaps some cron job. But it is very hard to ensure that the input
>>> and output are the same. Biopython and numpy/scipy do a good job at
>>> this by having sufficiently generic modules that can handle the
>>> important stuff so you only need to focus on input and output aspects.
>>>
>>
>> Just using a script is far less useful than a make style system.  If
>> the first step on the script is, say, to download and formatdb a large
>> database you only want to run these once if possible.  Yet you want
>> them to be in the script because they are necessary first steps.
>>
> That is complete rubbish because the makefile is just a script!
> Further, I do not see how a makefile would be useful here. Sure you can
> tell it run formatdb if the database has 'changed' but it is not easy
> check if it should download a database.

Hmmm.   Are you the Bip troll?  If a makefile is just a script, then
why does make exist?

To check whether to download a database, you check whether the local
copy exists.

>
>>
>>>> Makefiles are the obvious way of doing this, and there has been some
>>>> work around this:
>>>> http://skam.sourceforge.net/
>>>> http://biowiki.org/MakefileManifesto
>>>> AFAIK python + make = scons
>>>> http://www.scons.org/
>>>> And these guys are doing interesting stuff with scons:
>>>> http://reproducibility.org/ (but their tools are a bit domain specific
>>>> for what I want)
>>>>
>>>> Then, there are the workflow engines, of which taverna seems the most
>>>> enterpisey (grid!, Web services!):
>>>> http://taverna.sourceforge.net/
>>>> Galaxy's workflows has been coming on a bit as well:
>>>> http://galaxy.psu.edu/
>>>> But you can't run them from the command line (and looking at the code,
>>>> the controller and the view are so coupled you won't be able to).  And
>>>> you can't parametrize them.
>>>>
>>> Are you talking about computer languages like Python or actual
>>> bioinformatics programs?
>>> It does not matter much if the code is not maintained. Usually
>>>
>>
>> Not sure I agree with that.
>>
> With what? If the code is not being maintained then you are stuck in the
> past so you have to run old versions that have various old bugs that you
> may expose. For example, Biopython had a Numeric bug that was fixed in
> numpy associated with memory alignment but you would have to rewrite the
> Biopython code to avoid it or learn the related Numeric code. So who
> fixes those bugs?

Now I'm confused, are you disagreeing with your initial position?

Statement 1: "It does not matter much if the code is not maintained"
Statement 2: "If the code is not being maintained then you are stuck
in the past so you have to run old versions that have various old
bugs"