[bip] Reproducible research

Wed Mar 4 10:02:27 PST 2009

James Casbon wrote:
> 2009/3/4 Bruce Southey <bsouthey at gmail.com>:
>   
>> James Casbon wrote:
>>     
>>> 2009/3/4 Bruce Southey <bsouthey at gmail.com>:
>>>
>>>       
>>>> On Tue, Mar 3, 2009 at 2:03 PM, James Casbon <casbon at gmail.com> wrote:
>>>>
>>>>         
>>>>> Hi Bip,
>>>>>
>>>>> I've been thinking about reproducible computational research in
>>>>> biology recently and I thought I'd drop it your way.  There seem to be
>>>>> several components of this, some already recognised and some not.
>>>>>
>>>>> Database and software tools are already known to be badly maintained:
>>>>> http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000136
>>>>>
>>>>>           
>>>> That paper is badly flawed in the most critical assumption that the
>>>> requirements for Nucleic Acids Research Web issue is that a web
>>>> version has to remain functional. I thought there was an expectation
>>>> that the site should remain up for some time but that is not so in the
>>>> current instructions
>>>> (http://www.oxfordjournals.org/our_journals/nar/for_authors/submission_webserver.html
>>>> ). Also flawed in that there was no Google or other web search for url
>>>> changes - I know a couple where that has happened. Even two years in
>>>> bioinformatics is a really long time (let alone the 10 years of
>>>> Biopython!).
>>>>
>>>>         
>>> My own take is that for databases and software, a version should be
>>> placed on archive.org.
>>>
>>>       
>> That will not work for web applications which is what the web server
>> issue of NAR is all about! Furthermore in terms of databases, these must
>> be kept up-to-date as who wants to use a 10 year version of the human
>> genome?
>>     
>
> You would if you wanted to reproduce a result.
>
>   
>> Also, how does this keep up with bug fixes and enhancements to the original?
>> When these things go public is it possible that new and unexpected
>> consequences appear simply because of the range possibilities or
>> 'laziness' of programer (guilty of that). In my case, I have made two
>> significant changes to the original version as well as the odd bug fix
>> within that period but the rules of the web server issue prevent new
>> papers within a couple of years of the first publication. But there is a
>> link to the original version on the page.
>>
>> Also, browsers change, see the issues with Protein Explorer that could
>> not work with Internet Explorer and then Mozilla. Some where related to
>> using chime (but now it trying to work with Jmol) as well as javascript
>> changes.
>>     
>>>> However, I don't think anyone has a right to to complain if you do not
>>>> support people's efforts than just using their code.
>>>>
>>>>
>>>>         
>>>>> But that problem is very difficult.  What I am more interested here is
>>>>> the day to day work of making an analysis work again and again, and
>>>>> applying it to other data.
>>>>>
>>>>>           
>>>> If you can guarantee the exact some input and output, you just write
>>>> scripts and perhaps a master script take either you decide when to run
>>>> or perhaps some cron job. But it is very hard to ensure that the input
>>>> and output are the same. Biopython and numpy/scipy do a good job at
>>>> this by having sufficiently generic modules that can handle the
>>>> important stuff so you only need to focus on input and output aspects.
>>>>
>>>>         
>>> Just using a script is far less useful than a make style system.  If
>>> the first step on the script is, say, to download and formatdb a large
>>> database you only want to run these once if possible.  Yet you want
>>> them to be in the script because they are necessary first steps.
>>>
>>>       
>> That is complete rubbish because the makefile is just a script!
>> Further, I do not see how a makefile would be useful here. Sure you can
>> tell it run formatdb if the database has 'changed' but it is not easy
>> check if it should download a database.
>>     
>
> Hmmm.   Are you the Bip troll?  
I have not found any taxonomy or sequence information on the existence 
of tolls to answer that question.

Bruce