[bip] Reproducible research

Wed Mar 4 01:49:05 PST 2009

On Wed, Mar 4, 2009 at 10:32 AM, James Casbon <casbon at gmail.com> wrote:
> 2009/3/4 Bruce Southey <bsouthey at gmail.com>:
>> On Tue, Mar 3, 2009 at 2:03 PM, James Casbon <casbon at gmail.com> wrote:
>>> Hi Bip,
>>>
>>> I've been thinking about reproducible computational research in
>>> biology recently and I thought I'd drop it your way.  There seem to be
>>> several components of this, some already recognised and some not.
>>>
>>> Database and software tools are already known to be badly maintained:
>>> http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000136
>> That paper is badly flawed in the most critical assumption that the
>> requirements for Nucleic Acids Research Web issue is that a web
>> version has to remain functional. I thought there was an expectation
>> that the site should remain up for some time but that is not so in the
>> current instructions
>> (http://www.oxfordjournals.org/our_journals/nar/for_authors/submission_webserver.html
>> ). Also flawed in that there was no Google or other web search for url
>> changes - I know a couple where that has happened. Even two years in
>> bioinformatics is a really long time (let alone the 10 years of
>> Biopython!).
>
> My own take is that for databases and software, a version should be
> placed on archive.org.
>
>>
>> However, I don't think anyone has a right to to complain if you do not
>> support people's efforts than just using their code.
>>
>>> But that problem is very difficult.  What I am more interested here is
>>> the day to day work of making an analysis work again and again, and
>>> applying it to other data.
>>
>>
>> If you can guarantee the exact some input and output, you just write
>> scripts and perhaps a master script take either you decide when to run
>> or perhaps some cron job. But it is very hard to ensure that the input
>> and output are the same. Biopython and numpy/scipy do a good job at
>> this by having sufficiently generic modules that can handle the
>> important stuff so you only need to focus on input and output aspects.
>
> Just using a script is far less useful than a make style system.  If
> the first step on the script is, say, to download and formatdb a large
> database you only want to run these once if possible.  Yet you want
> them to be in the script because they are necessary first steps.

You have to use makefiles in the simplest way possible: forget about
dependencies and all the complicated stuff, just use targets as
procedures and the list of commands.

For example:
$: cat >Makefile
print_hello:
      @echo 'hello world'

this way you can produce makefile which are very easy to understand by
other people, a lot more than wrapper scripts, and you won't have to
spend time in learning too much of make's syntax.

p.s. you didn't send your message to the mailing list, can you send it again?

>
>>
>>>
>>> Makefiles are the obvious way of doing this, and there has been some
>>> work around this:
>>> http://skam.sourceforge.net/
>>> http://biowiki.org/MakefileManifesto
>>> AFAIK python + make = scons
>>> http://www.scons.org/
>>> And these guys are doing interesting stuff with scons:
>>> http://reproducibility.org/ (but their tools are a bit domain specific
>>> for what I want)
>>>
>>> Then, there are the workflow engines, of which taverna seems the most
>>> enterpisey (grid!, Web services!):
>>> http://taverna.sourceforge.net/
>>> Galaxy's workflows has been coming on a bit as well:
>>> http://galaxy.psu.edu/
>>> But you can't run them from the command line (and looking at the code,
>>> the controller and the view are so coupled you won't be able to).  And
>>> you can't parametrize them.
>>
>> Are you talking about computer languages like Python or actual
>> bioinformatics programs?
>> It does not matter much if the code is not maintained. Usually
>
> Not sure I agree with that.
>
>> software does not keep pace with the major changes in various computer
>> languages or compilers. For example, updates to GNU gcc have meant
>> older code can be very hard to update, Java has gone a couple of
>> versions so some old functions are no longer present or open source
>> options don't work. Even Python has made a change but not yet a factor
>> as I am not sure if people are jumping to 3.0 yet. So unless people
>> actually release code into an open source environment such as
>> Biopython, only the dedication of original authors and perhaps a few
>> generous others to keep it functional or updated.
>
> Biopython is pretty segmented.  I have been very lazy at updating the
> stuff I submitted, and no-one else has.
>
>>>
>>> So how is BIP doing this?  I really want something simple, that can be
>>> used at a command line or the web, and preferably in python.
>>>
>>
>> It is not clear what you want to do. Do you mean doing some thing like
>> Biopython where there are modules to run external applications like
>> blast and clustalw? Or you run all sequences at once in blast and
>> parse the output in Biopython or Python.
>> If so, you write code (often more than one script) to read and parse
>> the input into a desired format, call the appropriate routines and
>> parse the output. Exactly what you do depends on the problem.
>
> I'm being deliberately vague about what I want to do.  Combine all of
> the above in many different ways is probably the closest I can say.
> Obviously I am aware that you can write scripts and 'call appropriate
> routines'.
>
> James
>
> _______________________________________________
> biology-in-python mailing list - bip at lists.idyll.org.
>
> See http://bio.scipy.org/ for our Wiki.
>

-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it