[bip] Reproducible research

C. Titus Brown ctb at msu.edu
Mon Mar 9 06:39:07 PDT 2009

On Mon, Mar 09, 2009 at 10:42:47AM +0000, Leighton Pritchard wrote:
-> Apologies for the length.  It's a character trait ;)


-> On 07/03/2009 04:39, "C. Titus Brown" <ctb at msu.edu> wrote:
-> > On Thu, Mar 05, 2009 at 09:51:48AM +0000, Leighton Pritchard wrote:
-> > -> There's another issue with reproducing work from others' publications that
-> > -> hasn't come up yet: the work is frequently described inadequately for
-> > -> reproduction, in the methods section.
-> > -> 
-> > -> In my experience, this is depressingly often the case for publications that
-> > -> apply bioinformatics.
-> > [...] There's very little incentive for an accurate description of
-> > the process by which you arrived at your results.
-> > I am mildly skeptical that there's significant value to demanding
-> > exact reproducibility in many circumstances.
-> We could quibble over what you mean by 'exact' in that statement but, in
-> general terms, if your work is not reproducible you are not doing Science,
-> but rather Pseudoscience (or, in a best-case scenario, 'hypothesis
-> generation').  In effect, you're doing no more than generating an anecdote
-> (which isn't as denigrating as it might sound - many anecdotes have proven
-> to be useful starting points for real insight).  Not that reproducibility is
-> a *sufficient* claim to a correct result describing the 'true' state of the
-> world around us, but it is *necessary* for the Scientific Method.

And yet... presumably you agree science Has Been Done, in bioinformatics
and elsewhere?  Despite the generally rather abysmal quality of Methods
sections and the lack of open source software using version-locked
databases?  Methinks you have a contradiction ;)

While I have respect for the argument that some form of reproducibility
is important, I think our discussion on this list is taking it a bit
far.  I personally don't care too much about the exact version of nr you
are using, unless it's somehow critically important for the analysis
(which then suggests to me that you're doing the wrong analysis ;)

In a previous e-mail, I said that I'd like to have access to your
source, so I can run, modify, and grok your code.  I'd also like to have
access to the important parts of your raw data, so I can run it through
my own tools.  I think these are important to doing Science because they
speak directly to the question of whether or not your research is
reproducible.  Being able to reproduce every jot and tittle of your
publication, however, is not so important to me; I can only think of
five or six papers over the last 10 years where I would have even wanted
to try to reproduce their results.

I think it's a bit of a distraction.  It's easy to get sidetracked by
questions of whether or not a particular analysis is reproducible.  I'd
rather focus on questions like:

 - are the results interesting?
 - is the analysis interesting, e.g. does it reveal novel structure in
 	data, or is the analysis technique particularly sensitive?

If either of these are true, then someone (in the original group or
elsewhere; eventually, someone else, if it's interesting enough) will
follow up on the research, and we will eventually find out if the
results truly were reproducible.

Anecdote: I attended a talk the other day on next-gen sequencing, and it
will soon be cheaper for the sequencing center to rerun a particular
Illumina GA run than it is to store the resulting image data (~1 tb?)
for 6 months.  So primary data is simply getting tossed.  What do Strong
Reproducibilists think should be done with that data?  It *could* be
important... but it's probably not.  How much effort and expense do we
want to go to here?

Leighton, as I think you acknowledge, someone has to make the decisions
about what is important and what's not, and it's every scientist's job
to do that as appropriately as they can.  Discussing these things and
providing technology to facilitate them is good. Arguing that we should
be individually responsible for retaining every bit of data that anyone
might find relevant -- as I think some have on this list -- is,
ultimately, silly.  Or at least distracting. ;)

(I *still* think it should be mandatory to make source code available for
review along with the results in a publication, mind you.)


More information about the biology-in-python mailing list