[bip] Reproducible research

Mon Mar 9 08:46:54 PDT 2009

2009/3/9 C. Titus Brown <ctb at msu.edu>:
> On Mon, Mar 09, 2009 at 10:42:47AM +0000, Leighton Pritchard wrote:
> -> Apologies for the length.  It's a character trait ;)
>
> ;)
>
> -> On 07/03/2009 04:39, "C. Titus Brown" <ctb at msu.edu> wrote:
> ->
> -> > On Thu, Mar 05, 2009 at 09:51:48AM +0000, Leighton Pritchard wrote:
> -> > -> There's another issue with reproducing work from others' publications that
> -> > -> hasn't come up yet: the work is frequently described inadequately for
> -> > -> reproduction, in the methods section.
> -> > ->
> -> > -> In my experience, this is depressingly often the case for publications that
> -> > -> apply bioinformatics.
> -> > [...] There's very little incentive for an accurate description of
> -> > the process by which you arrived at your results.
> -> > I am mildly skeptical that there's significant value to demanding
> -> > exact reproducibility in many circumstances.
> ->
> -> We could quibble over what you mean by 'exact' in that statement but, in
> -> general terms, if your work is not reproducible you are not doing Science,
> -> but rather Pseudoscience (or, in a best-case scenario, 'hypothesis
> -> generation').  In effect, you're doing no more than generating an anecdote
> -> (which isn't as denigrating as it might sound - many anecdotes have proven
> -> to be useful starting points for real insight).  Not that reproducibility is
> -> a *sufficient* claim to a correct result describing the 'true' state of the
> -> world around us, but it is *necessary* for the Scientific Method.
>
> And yet... presumably you agree science Has Been Done, in bioinformatics
> and elsewhere?  Despite the generally rather abysmal quality of Methods
> sections and the lack of open source software using version-locked
> databases?  Methinks you have a contradiction ;)
>
> While I have respect for the argument that some form of reproducibility
> is important, I think our discussion on this list is taking it a bit
> far.  I personally don't care too much about the exact version of nr you
> are using, unless it's somehow critically important for the analysis
> (which then suggests to me that you're doing the wrong analysis ;)
>
> In a previous e-mail, I said that I'd like to have access to your
> source, so I can run, modify, and grok your code.  I'd also like to have
> access to the important parts of your raw data, so I can run it through
> my own tools.  I think these are important to doing Science because they
> speak directly to the question of whether or not your research is
> reproducible.  Being able to reproduce every jot and tittle of your
> publication, however, is not so important to me; I can only think of
> five or six papers over the last 10 years where I would have even wanted
> to try to reproduce their results.
>
> I think it's a bit of a distraction.  It's easy to get sidetracked by
> questions of whether or not a particular analysis is reproducible.  I'd
> rather focus on questions like:
>
>  - are the results interesting?
>  - is the analysis interesting, e.g. does it reveal novel structure in
>        data, or is the analysis technique particularly sensitive?
>
> If either of these are true, then someone (in the original group or
> elsewhere; eventually, someone else, if it's interesting enough) will
> follow up on the research, and we will eventually find out if the
> results truly were reproducible.
>
> Anecdote: I attended a talk the other day on next-gen sequencing, and it
> will soon be cheaper for the sequencing center to rerun a particular
> Illumina GA run than it is to store the resulting image data (~1 tb?)
> for 6 months.  So primary data is simply getting tossed.  What do Strong
> Reproducibilists think should be done with that data?  It *could* be
> important... but it's probably not.  How much effort and expense do we
> want to go to here?

*If* you can preserve the original library material, then the image
data is derived material and therefore is one of the things to be
reproduced.  That's a big 'if', though ;)

But this points to the other facet of science that has gone out the
window along with reproducibility - the hypothesis.  Now you can just
trawl for data and fit the hypothesis in hindsight.

>
> Leighton, as I think you acknowledge, someone has to make the decisions
> about what is important and what's not, and it's every scientist's job
> to do that as appropriately as they can.  Discussing these things and
> providing technology to facilitate them is good. Arguing that we should
> be individually responsible for retaining every bit of data that anyone
> might find relevant -- as I think some have on this list -- is,
> ultimately, silly.  Or at least distracting. ;)
>
> (I *still* think it should be mandatory to make source code available for
> review along with the results in a publication, mind you.