[bip] Reproducible research

Wed Mar 4 01:32:52 PST 2009

2009/3/4 Bruce Southey <bsouthey at gmail.com>:
> On Tue, Mar 3, 2009 at 2:03 PM, James Casbon <casbon at gmail.com> wrote:
>> Hi Bip,
>>
>> I've been thinking about reproducible computational research in
>> biology recently and I thought I'd drop it your way.  There seem to be
>> several components of this, some already recognised and some not.
>>
>> Database and software tools are already known to be badly maintained:
>> http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000136
> That paper is badly flawed in the most critical assumption that the
> requirements for Nucleic Acids Research Web issue is that a web
> version has to remain functional. I thought there was an expectation
> that the site should remain up for some time but that is not so in the
> current instructions
> (http://www.oxfordjournals.org/our_journals/nar/for_authors/submission_webserver.html
> ). Also flawed in that there was no Google or other web search for url
> changes - I know a couple where that has happened. Even two years in
> bioinformatics is a really long time (let alone the 10 years of
> Biopython!).

My own take is that for databases and software, a version should be
placed on archive.org.

>
> However, I don't think anyone has a right to to complain if you do not
> support people's efforts than just using their code.
>
>> But that problem is very difficult.  What I am more interested here is
>> the day to day work of making an analysis work again and again, and
>> applying it to other data.
>
>
> If you can guarantee the exact some input and output, you just write
> scripts and perhaps a master script take either you decide when to run
> or perhaps some cron job. But it is very hard to ensure that the input
> and output are the same. Biopython and numpy/scipy do a good job at
> this by having sufficiently generic modules that can handle the
> important stuff so you only need to focus on input and output aspects.

Just using a script is far less useful than a make style system.  If
the first step on the script is, say, to download and formatdb a large
database you only want to run these once if possible.  Yet you want
them to be in the script because they are necessary first steps.

>
>>
>> Makefiles are the obvious way of doing this, and there has been some
>> work around this:
>> http://skam.sourceforge.net/
>> http://biowiki.org/MakefileManifesto
>> AFAIK python + make = scons
>> http://www.scons.org/
>> And these guys are doing interesting stuff with scons:
>> http://reproducibility.org/ (but their tools are a bit domain specific
>> for what I want)
>>
>> Then, there are the workflow engines, of which taverna seems the most
>> enterpisey (grid!, Web services!):
>> http://taverna.sourceforge.net/
>> Galaxy's workflows has been coming on a bit as well:
>> http://galaxy.psu.edu/
>> But you can't run them from the command line (and looking at the code,
>> the controller and the view are so coupled you won't be able to).  And
>> you can't parametrize them.
>
> Are you talking about computer languages like Python or actual
> bioinformatics programs?
> It does not matter much if the code is not maintained. Usually

Not sure I agree with that.

> software does not keep pace with the major changes in various computer
> languages or compilers. For example, updates to GNU gcc have meant
> older code can be very hard to update, Java has gone a couple of
> versions so some old functions are no longer present or open source
> options don't work. Even Python has made a change but not yet a factor
> as I am not sure if people are jumping to 3.0 yet. So unless people
> actually release code into an open source environment such as
> Biopython, only the dedication of original authors and perhaps a few
> generous others to keep it functional or updated.

Biopython is pretty segmented.  I have been very lazy at updating the
stuff I submitted, and no-one else has.

>>
>> So how is BIP doing this?  I really want something simple, that can be
>> used at a command line or the web, and preferably in python.
>>
>
> It is not clear what you want to do. Do you mean doing some thing like
> Biopython where there are modules to run external applications like
> blast and clustalw? Or you run all sequences at once in blast and
> parse the output in Biopython or Python.
> If so, you write code (often more than one script) to read and parse
> the input into a desired format, call the appropriate routines and
> parse the output. Exactly what you do depends on the problem.

I'm being deliberately vague about what I want to do.  Combine all of
the above in many different ways is probably the closest I can say.
Obviously I am aware that you can write scripts and 'call appropriate
routines'.

James